Yanlong Wang
|
05df989202
|
fix: unhandled rejection case
|
2025-02-14 17:59:18 +08:00 |
|
yanlong.wang
|
0b93e7da53
|
fix
|
2025-02-10 15:05:58 +08:00 |
|
yanlong.wang
|
f7fd8132b8
|
bump: deps
|
2025-02-10 14:13:19 +08:00 |
|
Yanlong Wang
|
033a53af30
|
fix: options handling in stand-alone script
|
2025-02-07 11:14:55 +08:00 |
|
yanlong.wang
|
0f36fe81a6
|
fix: compressed response from curl
|
2025-02-05 16:09:54 +08:00 |
|
Yanlong Wang
|
6a58de590c
|
deployment: dedicated server script for cloud-run (#1139)
* refactor: domain profile and attempt direct engine
* fix: direct engine
* fix: abuse in background phase
* fix
* wip
* use curl-impersonate in custom image
* local pdf for curl
* listen port from env
* fix
* fix
* fix
* fix: ditch http2
* cd: using gh action
* ci: token for thinapps-shared
* ci: setup node lock file path
* ci: tweak
* ci: mmdb
* ci: docker build
* fix: ci
* fix: ci
|
2025-02-05 14:50:18 +08:00 |
|
Yanlong Wang
|
a453ab5f16
|
fix: content suffix for markdown respond format
|
2025-02-04 15:59:01 +08:00 |
|
Yanlong Wang
|
cc6d2f3e29
|
fix: search params
|
2025-01-26 21:21:48 +08:00 |
|
yanlong.wang
|
234f61d066
|
remove more attrs in readerlm preprocessing
|
2025-01-20 11:54:31 +08:00 |
|
Yanlong Wang
|
140a6f86ae
|
fix: tweak readerlm
|
2025-01-17 12:24:05 +08:00 |
|
Yanlong Wang
|
f95eb027d7
|
fix: tweak readerlm parameters
|
2025-01-17 11:42:36 +08:00 |
|
yanlong.wang
|
4e5729372e
|
fix: readerlm repetition_penalty
|
2025-01-16 19:20:44 +08:00 |
|
yanlong.wang
|
3e58afc2ba
|
fix: readerlm params
|
2025-01-16 18:46:14 +08:00 |
|
yanlong.wang
|
e23d9f30a6
|
fix: base parameter
|
2025-01-16 15:37:16 +08:00 |
|
yanlong.wang
|
53821d0105
|
fix: lm and related options
|
2025-01-16 15:11:32 +08:00 |
|
yanlong.wang
|
80b9a6a5a0
|
fix: curl with errors
|
2025-01-15 19:29:59 +08:00 |
|
yanlong.wang
|
6be6051aa7
|
fix
|
2025-01-15 17:50:03 +08:00 |
|
yanlong.wang
|
06f359309e
|
feat: new lm engine
|
2025-01-15 17:38:49 +08:00 |
|
Yanlong Wang
|
51a4877933
|
feat: gemini to replace blip2 (#1129)
* feat: domain profile
* fix
* fix
* fix
* fix
* fix
* refactor: curl as direct engine
* fix
* wip
* fix
* fix
* fix
* fix
* fix
---------
Co-authored-by: Sha Zhou <sha.zhou@jina.ai>
|
2025-01-15 15:03:46 +08:00 |
|
Sha Zhou
|
c19ba65391
|
update scrapping options
|
2025-01-14 15:32:50 +08:00 |
|
Sha Zhou
|
8f25fe1d45
|
fix pageshot failure
|
2025-01-13 19:25:07 +08:00 |
|
Sha Zhou
|
dc80020ade
|
use browser engine when no-cache is set
|
2025-01-13 18:09:30 +08:00 |
|
Sha Zhou
|
54abc175bb
|
feat: domain profile (#1127)
* feat: domain profile
* fix
* fix
* fix
* fix
* fix
* refactor: curl as direct engine
* fix
---------
Co-authored-by: yanlong.wang <yanlong.wang@naiver.org>
|
2025-01-13 17:44:09 +08:00 |
|
Sha Zhou
|
6c23342cbf
|
feat: fetch page content by curl (#1119)
* feat: fetch url without script data
* refactor: rename X-Agent to X-Engine
Co-Authored-By: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* refactor: rename X-Agent to X-Engine
Co-Authored-By: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* refactor: rename X-Agent to X-Engine header and property (#1122)
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* refactor: rename X-Agent to X-Engine while preserving user-agent functionality (#1123)
- Remove duplicate X-Engine header definition
- Restore userAgent threadLocal.set
- Restore overrideUserAgent in crawler options
- Maintain engine-related changes
Link to Devin run: https://app.devin.ai/sessions/cd65e5d9466049a28a92002267c48e8b
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* fix: remove duplicate engine declarations in scrapping-options.ts (#1124)
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
|
2025-01-08 19:25:14 +08:00 |
|
yanlong.wang
|
2606c445d9
|
feat(crawl): viewport options
|
2024-12-24 19:07:48 +08:00 |
|
yanlong.wang
|
d8ad1cb6a1
|
feat: expose setViewport to page script
|
2024-12-24 18:48:32 +08:00 |
|
yanlong.wang
|
696536c7f2
|
feat(crawl): token budget
|
2024-12-24 18:31:54 +08:00 |
|
Yanlong Wang
|
b9d07e3692
|
fix: remove http abuse check
|
2024-12-06 10:21:57 +08:00 |
|
yanlong.wang
|
07023f8add
|
chore: tweak deployment
|
2024-11-27 12:36:00 +08:00 |
|
yanlong.wang
|
7a6d275979
|
bump: deps
|
2024-11-25 18:24:18 +08:00 |
|
yanlong.wang
|
f6c89e878c
|
fix: pdf upload in multipart
|
2024-11-25 17:50:01 +08:00 |
|
Yanlong Wang
|
deb0b6dc23
|
fix: potential gfm performance issue
|
2024-11-23 23:15:09 +08:00 |
|
yanlong.wang
|
16cabcaf22
|
feat: opt out gfm/table
|
2024-11-21 18:26:21 +08:00 |
|
yanlong.wang
|
2b29679801
|
fix: img turndown rules
|
2024-11-20 17:29:28 +08:00 |
|
Yanlong Wang
|
4400bef95b
|
fix: tricks applied by puppeteer-extra-plugin-stealth
|
2024-11-18 16:43:40 +08:00 |
|
Yanlong Wang
|
1f4620deef
|
fix: img with srcset only
|
2024-11-18 16:37:42 +08:00 |
|
Yanlong Wang
|
6fa8ce309e
|
fix: poorly transformed detection
|
2024-11-16 13:59:31 +08:00 |
|
Yanlong Wang
|
706de20e5c
|
fix : deps
|
2024-11-15 11:00:39 +08:00 |
|
Yanlong Wang
|
59dcc2db94
|
feat: image retention config
|
2024-11-14 22:36:53 +08:00 |
|
Yanlong Wang
|
ccb4b8a49d
|
fix: potential invalid html
|
2024-11-13 00:39:07 +08:00 |
|
Yanlong Wang
|
be993c2cb1
|
fix: there may be invalid root doc
|
2024-11-13 00:32:48 +08:00 |
|
Yanlong Wang
|
68c4df2df3
|
fix: deps and bugs
|
2024-11-13 00:27:39 +08:00 |
|
yanlong.wang
|
7ae2545a30
|
chore: tweak deployment
|
2024-11-12 17:33:23 +08:00 |
|
yanlong.wang
|
e2a187d126
|
fix: crawling IP url
|
2024-11-11 15:30:48 +08:00 |
|
yanlong.wang
|
67d4a9f45a
|
fix: expect cookie encoding issue
|
2024-11-11 14:58:00 +08:00 |
|
yanlong.wang
|
53bc91c31a
|
feat: compound response
|
2024-11-11 12:40:40 +08:00 |
|
Yanlong Wang
|
22647a0617
|
feat: script injecting and tools
|
2024-11-08 14:19:54 +08:00 |
|
Yanlong Wang
|
bd629a836b
|
search now requires authentication
|
2024-11-01 14:15:03 +08:00 |
|
Yanlong Wang
|
5d865651b1
|
chore: bump deps
|
2024-11-01 09:20:23 +08:00 |
|
yanlong.wang
|
b10931b8ed
|
fix: turndown rules
|
2024-10-31 17:22:51 +08:00 |
|