495 Commits

Author SHA1 Message Date
yanlong.wang
7f04a65548
fix: curl response headers 2025-02-19 15:18:38 +08:00
yanlong.wang
d749809966
deps: ditch puppeteer-extra-plugin-stealth 2025-02-19 14:48:12 +08:00
yanlong.wang
b6846ab4b6
curl-impersonate: add zstd compression options 2025-02-19 14:22:19 +08:00
yanlong.wang
2991e300d8
puppeteer: tweak the ua a bit 2025-02-19 14:05:26 +08:00
yanlong.wang
cf24d84e8a
puppeteer: ditch puppeteer-stealth and use the real stable chrome 2025-02-19 13:47:33 +08:00
yanlong.wang
e4bc29aab8
fix: expect malformed url in iframes 2025-02-17 18:53:55 +08:00
yanlong.wang
92f636474d
style: prefer const for originalSrc 2025-02-17 17:43:23 +08:00
yanlong.wang
008dcbaf22
fix: image in summary 2025-02-17 17:41:39 +08:00
yanlong.wang
fc2824b115
fix: bump deps 2025-02-17 13:34:24 +08:00
yanlong.wang
0e8308e627
fix: some invalid uriComponent case 2025-02-17 12:28:57 +08:00
Yanlong Wang
05df989202
fix: unhandled rejection case 2025-02-14 17:59:18 +08:00
yanlong.wang
0b93e7da53
fix 2025-02-10 15:05:58 +08:00
yanlong.wang
f7fd8132b8
bump: deps 2025-02-10 14:13:19 +08:00
Yanlong Wang
033a53af30
fix: options handling in stand-alone script 2025-02-07 11:14:55 +08:00
yanlong.wang
0f36fe81a6
fix: compressed response from curl 2025-02-05 16:09:54 +08:00
Yanlong Wang
6a58de590c
deployment: dedicated server script for cloud-run (#1139)
* refactor: domain profile and attempt direct engine

* fix: direct engine

* fix: abuse in background phase

* fix

* wip

* use curl-impersonate in custom image

* local pdf for curl

* listen port from env

* fix

* fix

* fix

* fix: ditch http2

* cd: using gh action

* ci: token for thinapps-shared

* ci: setup node lock file path

* ci: tweak

* ci: mmdb

* ci: docker build

* fix: ci

* fix: ci
2025-02-05 14:50:18 +08:00
Yanlong Wang
a453ab5f16
fix: content suffix for markdown respond format 2025-02-04 15:59:01 +08:00
Yanlong Wang
cc6d2f3e29
fix: search params 2025-01-26 21:21:48 +08:00
yanlong.wang
234f61d066
remove more attrs in readerlm preprocessing 2025-01-20 11:54:31 +08:00
Yanlong Wang
140a6f86ae
fix: tweak readerlm 2025-01-17 12:24:05 +08:00
Yanlong Wang
f95eb027d7
fix: tweak readerlm parameters 2025-01-17 11:42:36 +08:00
yanlong.wang
4e5729372e
fix: readerlm repetition_penalty 2025-01-16 19:20:44 +08:00
yanlong.wang
3e58afc2ba
fix: readerlm params 2025-01-16 18:46:14 +08:00
yanlong.wang
e23d9f30a6
fix: base parameter 2025-01-16 15:37:16 +08:00
yanlong.wang
53821d0105
fix: lm and related options 2025-01-16 15:11:32 +08:00
yanlong.wang
80b9a6a5a0
fix: curl with errors 2025-01-15 19:29:59 +08:00
yanlong.wang
6be6051aa7
fix 2025-01-15 17:50:03 +08:00
yanlong.wang
06f359309e
feat: new lm engine 2025-01-15 17:38:49 +08:00
Yanlong Wang
51a4877933
feat: gemini to replace blip2 (#1129)
* feat: domain profile

* fix

* fix

* fix

* fix

* fix

* refactor: curl as direct engine

* fix

* wip

* fix

* fix

* fix

* fix

* fix

---------

Co-authored-by: Sha Zhou <sha.zhou@jina.ai>
2025-01-15 15:03:46 +08:00
Sha Zhou
c19ba65391 update scrapping options 2025-01-14 15:32:50 +08:00
Sha Zhou
8f25fe1d45 fix pageshot failure 2025-01-13 19:25:07 +08:00
Sha Zhou
dc80020ade use browser engine when no-cache is set 2025-01-13 18:09:30 +08:00
Sha Zhou
54abc175bb
feat: domain profile (#1127)
* feat: domain profile

* fix

* fix

* fix

* fix

* fix

* refactor: curl as direct engine

* fix

---------

Co-authored-by: yanlong.wang <yanlong.wang@naiver.org>
2025-01-13 17:44:09 +08:00
Sha Zhou
6c23342cbf
feat: fetch page content by curl (#1119)
* feat: fetch url without script data

* refactor: rename X-Agent to X-Engine

Co-Authored-By: yanlong.wang@jina.ai <yanlong.wang@jina.ai>

* refactor: rename X-Agent to X-Engine

Co-Authored-By: yanlong.wang@jina.ai <yanlong.wang@jina.ai>

* refactor: rename X-Agent to X-Engine header and property (#1122)

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>

* refactor: rename X-Agent to X-Engine while preserving user-agent functionality (#1123)

- Remove duplicate X-Engine header definition
- Restore userAgent threadLocal.set
- Restore overrideUserAgent in crawler options
- Maintain engine-related changes

Link to Devin run: https://app.devin.ai/sessions/cd65e5d9466049a28a92002267c48e8b

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>

* fix: remove duplicate engine declarations in scrapping-options.ts (#1124)

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
2025-01-08 19:25:14 +08:00
yanlong.wang
2606c445d9
feat(crawl): viewport options 2024-12-24 19:07:48 +08:00
yanlong.wang
d8ad1cb6a1
feat: expose setViewport to page script 2024-12-24 18:48:32 +08:00
yanlong.wang
696536c7f2
feat(crawl): token budget 2024-12-24 18:31:54 +08:00
Yanlong Wang
b9d07e3692
fix: remove http abuse check 2024-12-06 10:21:57 +08:00
yanlong.wang
07023f8add
chore: tweak deployment 2024-11-27 12:36:00 +08:00
yanlong.wang
7a6d275979
bump: deps 2024-11-25 18:24:18 +08:00
yanlong.wang
f6c89e878c
fix: pdf upload in multipart 2024-11-25 17:50:01 +08:00
Yanlong Wang
deb0b6dc23
fix: potential gfm performance issue 2024-11-23 23:15:09 +08:00
yanlong.wang
16cabcaf22
feat: opt out gfm/table 2024-11-21 18:26:21 +08:00
yanlong.wang
2b29679801
fix: img turndown rules 2024-11-20 17:29:28 +08:00
Yanlong Wang
4400bef95b
fix: tricks applied by puppeteer-extra-plugin-stealth 2024-11-18 16:43:40 +08:00
Yanlong Wang
1f4620deef
fix: img with srcset only 2024-11-18 16:37:42 +08:00
Yanlong Wang
6fa8ce309e
fix: poorly transformed detection 2024-11-16 13:59:31 +08:00
Yanlong Wang
706de20e5c
fix : deps 2024-11-15 11:00:39 +08:00
Yanlong Wang
59dcc2db94
feat: image retention config 2024-11-14 22:36:53 +08:00
Yanlong Wang
ccb4b8a49d
fix: potential invalid html 2024-11-13 00:39:07 +08:00