yanlong.wang
|
7f04a65548
|
fix: curl response headers
|
2025-02-19 15:18:38 +08:00 |
|
yanlong.wang
|
d749809966
|
deps: ditch puppeteer-extra-plugin-stealth
|
2025-02-19 14:48:12 +08:00 |
|
yanlong.wang
|
b6846ab4b6
|
curl-impersonate: add zstd compression options
|
2025-02-19 14:22:19 +08:00 |
|
yanlong.wang
|
2991e300d8
|
puppeteer: tweak the ua a bit
|
2025-02-19 14:05:26 +08:00 |
|
yanlong.wang
|
cf24d84e8a
|
puppeteer: ditch puppeteer-stealth and use the real stable chrome
|
2025-02-19 13:47:33 +08:00 |
|
yanlong.wang
|
e4bc29aab8
|
fix: expect malformed url in iframes
|
2025-02-17 18:53:55 +08:00 |
|
yanlong.wang
|
92f636474d
|
style: prefer const for originalSrc
|
2025-02-17 17:43:23 +08:00 |
|
yanlong.wang
|
008dcbaf22
|
fix: image in summary
|
2025-02-17 17:41:39 +08:00 |
|
yanlong.wang
|
fc2824b115
|
fix: bump deps
|
2025-02-17 13:34:24 +08:00 |
|
yanlong.wang
|
0e8308e627
|
fix: some invalid uriComponent case
|
2025-02-17 12:28:57 +08:00 |
|
Yanlong Wang
|
05df989202
|
fix: unhandled rejection case
|
2025-02-14 17:59:18 +08:00 |
|
yanlong.wang
|
0b93e7da53
|
fix
|
2025-02-10 15:05:58 +08:00 |
|
yanlong.wang
|
f7fd8132b8
|
bump: deps
|
2025-02-10 14:13:19 +08:00 |
|
Yanlong Wang
|
033a53af30
|
fix: options handling in stand-alone script
|
2025-02-07 11:14:55 +08:00 |
|
yanlong.wang
|
0f36fe81a6
|
fix: compressed response from curl
|
2025-02-05 16:09:54 +08:00 |
|
Yanlong Wang
|
6a58de590c
|
deployment: dedicated server script for cloud-run (#1139)
* refactor: domain profile and attempt direct engine
* fix: direct engine
* fix: abuse in background phase
* fix
* wip
* use curl-impersonate in custom image
* local pdf for curl
* listen port from env
* fix
* fix
* fix
* fix: ditch http2
* cd: using gh action
* ci: token for thinapps-shared
* ci: setup node lock file path
* ci: tweak
* ci: mmdb
* ci: docker build
* fix: ci
* fix: ci
|
2025-02-05 14:50:18 +08:00 |
|
Yanlong Wang
|
a453ab5f16
|
fix: content suffix for markdown respond format
|
2025-02-04 15:59:01 +08:00 |
|
Yanlong Wang
|
cc6d2f3e29
|
fix: search params
|
2025-01-26 21:21:48 +08:00 |
|
yanlong.wang
|
234f61d066
|
remove more attrs in readerlm preprocessing
|
2025-01-20 11:54:31 +08:00 |
|
Yanlong Wang
|
140a6f86ae
|
fix: tweak readerlm
|
2025-01-17 12:24:05 +08:00 |
|
Yanlong Wang
|
f95eb027d7
|
fix: tweak readerlm parameters
|
2025-01-17 11:42:36 +08:00 |
|
yanlong.wang
|
4e5729372e
|
fix: readerlm repetition_penalty
|
2025-01-16 19:20:44 +08:00 |
|
yanlong.wang
|
3e58afc2ba
|
fix: readerlm params
|
2025-01-16 18:46:14 +08:00 |
|
yanlong.wang
|
e23d9f30a6
|
fix: base parameter
|
2025-01-16 15:37:16 +08:00 |
|
yanlong.wang
|
53821d0105
|
fix: lm and related options
|
2025-01-16 15:11:32 +08:00 |
|
yanlong.wang
|
80b9a6a5a0
|
fix: curl with errors
|
2025-01-15 19:29:59 +08:00 |
|
yanlong.wang
|
6be6051aa7
|
fix
|
2025-01-15 17:50:03 +08:00 |
|
yanlong.wang
|
06f359309e
|
feat: new lm engine
|
2025-01-15 17:38:49 +08:00 |
|
Yanlong Wang
|
51a4877933
|
feat: gemini to replace blip2 (#1129)
* feat: domain profile
* fix
* fix
* fix
* fix
* fix
* refactor: curl as direct engine
* fix
* wip
* fix
* fix
* fix
* fix
* fix
---------
Co-authored-by: Sha Zhou <sha.zhou@jina.ai>
|
2025-01-15 15:03:46 +08:00 |
|
Sha Zhou
|
c19ba65391
|
update scrapping options
|
2025-01-14 15:32:50 +08:00 |
|
Sha Zhou
|
8f25fe1d45
|
fix pageshot failure
|
2025-01-13 19:25:07 +08:00 |
|
Sha Zhou
|
dc80020ade
|
use browser engine when no-cache is set
|
2025-01-13 18:09:30 +08:00 |
|
Sha Zhou
|
54abc175bb
|
feat: domain profile (#1127)
* feat: domain profile
* fix
* fix
* fix
* fix
* fix
* refactor: curl as direct engine
* fix
---------
Co-authored-by: yanlong.wang <yanlong.wang@naiver.org>
|
2025-01-13 17:44:09 +08:00 |
|
Sha Zhou
|
6c23342cbf
|
feat: fetch page content by curl (#1119)
* feat: fetch url without script data
* refactor: rename X-Agent to X-Engine
Co-Authored-By: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* refactor: rename X-Agent to X-Engine
Co-Authored-By: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* refactor: rename X-Agent to X-Engine header and property (#1122)
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* refactor: rename X-Agent to X-Engine while preserving user-agent functionality (#1123)
- Remove duplicate X-Engine header definition
- Restore userAgent threadLocal.set
- Restore overrideUserAgent in crawler options
- Maintain engine-related changes
Link to Devin run: https://app.devin.ai/sessions/cd65e5d9466049a28a92002267c48e8b
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* fix: remove duplicate engine declarations in scrapping-options.ts (#1124)
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
|
2025-01-08 19:25:14 +08:00 |
|
yanlong.wang
|
2606c445d9
|
feat(crawl): viewport options
|
2024-12-24 19:07:48 +08:00 |
|
yanlong.wang
|
d8ad1cb6a1
|
feat: expose setViewport to page script
|
2024-12-24 18:48:32 +08:00 |
|
yanlong.wang
|
696536c7f2
|
feat(crawl): token budget
|
2024-12-24 18:31:54 +08:00 |
|
Yanlong Wang
|
b9d07e3692
|
fix: remove http abuse check
|
2024-12-06 10:21:57 +08:00 |
|
yanlong.wang
|
07023f8add
|
chore: tweak deployment
|
2024-11-27 12:36:00 +08:00 |
|
yanlong.wang
|
7a6d275979
|
bump: deps
|
2024-11-25 18:24:18 +08:00 |
|
yanlong.wang
|
f6c89e878c
|
fix: pdf upload in multipart
|
2024-11-25 17:50:01 +08:00 |
|
Yanlong Wang
|
deb0b6dc23
|
fix: potential gfm performance issue
|
2024-11-23 23:15:09 +08:00 |
|
yanlong.wang
|
16cabcaf22
|
feat: opt out gfm/table
|
2024-11-21 18:26:21 +08:00 |
|
yanlong.wang
|
2b29679801
|
fix: img turndown rules
|
2024-11-20 17:29:28 +08:00 |
|
Yanlong Wang
|
4400bef95b
|
fix: tricks applied by puppeteer-extra-plugin-stealth
|
2024-11-18 16:43:40 +08:00 |
|
Yanlong Wang
|
1f4620deef
|
fix: img with srcset only
|
2024-11-18 16:37:42 +08:00 |
|
Yanlong Wang
|
6fa8ce309e
|
fix: poorly transformed detection
|
2024-11-16 13:59:31 +08:00 |
|
Yanlong Wang
|
706de20e5c
|
fix : deps
|
2024-11-15 11:00:39 +08:00 |
|
Yanlong Wang
|
59dcc2db94
|
feat: image retention config
|
2024-11-14 22:36:53 +08:00 |
|
Yanlong Wang
|
ccb4b8a49d
|
fix: potential invalid html
|
2024-11-13 00:39:07 +08:00 |
|