yanlong.wang
|
6be6051aa7
|
fix
|
2025-01-15 17:50:03 +08:00 |
|
yanlong.wang
|
06f359309e
|
feat: new lm engine
|
2025-01-15 17:38:49 +08:00 |
|
Yanlong Wang
|
51a4877933
|
feat: gemini to replace blip2 (#1129)
* feat: domain profile
* fix
* fix
* fix
* fix
* fix
* refactor: curl as direct engine
* fix
* wip
* fix
* fix
* fix
* fix
* fix
---------
Co-authored-by: Sha Zhou <sha.zhou@jina.ai>
|
2025-01-15 15:03:46 +08:00 |
|
Sha Zhou
|
c19ba65391
|
update scrapping options
|
2025-01-14 15:32:50 +08:00 |
|
Sha Zhou
|
8f25fe1d45
|
fix pageshot failure
|
2025-01-13 19:25:07 +08:00 |
|
Sha Zhou
|
dc80020ade
|
use browser engine when no-cache is set
|
2025-01-13 18:09:30 +08:00 |
|
Sha Zhou
|
54abc175bb
|
feat: domain profile (#1127)
* feat: domain profile
* fix
* fix
* fix
* fix
* fix
* refactor: curl as direct engine
* fix
---------
Co-authored-by: yanlong.wang <yanlong.wang@naiver.org>
|
2025-01-13 17:44:09 +08:00 |
|
Sha Zhou
|
6c23342cbf
|
feat: fetch page content by curl (#1119)
* feat: fetch url without script data
* refactor: rename X-Agent to X-Engine
Co-Authored-By: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* refactor: rename X-Agent to X-Engine
Co-Authored-By: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* refactor: rename X-Agent to X-Engine header and property (#1122)
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* refactor: rename X-Agent to X-Engine while preserving user-agent functionality (#1123)
- Remove duplicate X-Engine header definition
- Restore userAgent threadLocal.set
- Restore overrideUserAgent in crawler options
- Maintain engine-related changes
Link to Devin run: https://app.devin.ai/sessions/cd65e5d9466049a28a92002267c48e8b
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
* fix: remove duplicate engine declarations in scrapping-options.ts (#1124)
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: yanlong.wang@jina.ai <yanlong.wang@jina.ai>
|
2025-01-08 19:25:14 +08:00 |
|
yanlong.wang
|
2606c445d9
|
feat(crawl): viewport options
|
2024-12-24 19:07:48 +08:00 |
|
yanlong.wang
|
d8ad1cb6a1
|
feat: expose setViewport to page script
|
2024-12-24 18:48:32 +08:00 |
|
yanlong.wang
|
696536c7f2
|
feat(crawl): token budget
|
2024-12-24 18:31:54 +08:00 |
|
Yanlong Wang
|
b9d07e3692
|
fix: remove http abuse check
|
2024-12-06 10:21:57 +08:00 |
|
yanlong.wang
|
07023f8add
|
chore: tweak deployment
|
2024-11-27 12:36:00 +08:00 |
|
yanlong.wang
|
7a6d275979
|
bump: deps
|
2024-11-25 18:24:18 +08:00 |
|
yanlong.wang
|
f6c89e878c
|
fix: pdf upload in multipart
|
2024-11-25 17:50:01 +08:00 |
|
Yanlong Wang
|
deb0b6dc23
|
fix: potential gfm performance issue
|
2024-11-23 23:15:09 +08:00 |
|
yanlong.wang
|
16cabcaf22
|
feat: opt out gfm/table
|
2024-11-21 18:26:21 +08:00 |
|
yanlong.wang
|
2b29679801
|
fix: img turndown rules
|
2024-11-20 17:29:28 +08:00 |
|
Yanlong Wang
|
4400bef95b
|
fix: tricks applied by puppeteer-extra-plugin-stealth
|
2024-11-18 16:43:40 +08:00 |
|
Yanlong Wang
|
1f4620deef
|
fix: img with srcset only
|
2024-11-18 16:37:42 +08:00 |
|
Yanlong Wang
|
6fa8ce309e
|
fix: poorly transformed detection
|
2024-11-16 13:59:31 +08:00 |
|
Yanlong Wang
|
706de20e5c
|
fix : deps
|
2024-11-15 11:00:39 +08:00 |
|
Yanlong Wang
|
59dcc2db94
|
feat: image retention config
|
2024-11-14 22:36:53 +08:00 |
|
Yanlong Wang
|
ccb4b8a49d
|
fix: potential invalid html
|
2024-11-13 00:39:07 +08:00 |
|
Yanlong Wang
|
be993c2cb1
|
fix: there may be invalid root doc
|
2024-11-13 00:32:48 +08:00 |
|
Yanlong Wang
|
68c4df2df3
|
fix: deps and bugs
|
2024-11-13 00:27:39 +08:00 |
|
yanlong.wang
|
7ae2545a30
|
chore: tweak deployment
|
2024-11-12 17:33:23 +08:00 |
|
yanlong.wang
|
e2a187d126
|
fix: crawling IP url
|
2024-11-11 15:30:48 +08:00 |
|
yanlong.wang
|
67d4a9f45a
|
fix: expect cookie encoding issue
|
2024-11-11 14:58:00 +08:00 |
|
yanlong.wang
|
53bc91c31a
|
feat: compound response
|
2024-11-11 12:40:40 +08:00 |
|
Yanlong Wang
|
22647a0617
|
feat: script injecting and tools
|
2024-11-08 14:19:54 +08:00 |
|
Yanlong Wang
|
bd629a836b
|
search now requires authentication
|
2024-11-01 14:15:03 +08:00 |
|
Yanlong Wang
|
5d865651b1
|
chore: bump deps
|
2024-11-01 09:20:23 +08:00 |
|
yanlong.wang
|
b10931b8ed
|
fix: turndown rules
|
2024-10-31 17:22:51 +08:00 |
|
yanlong.wang
|
340fb517d8
|
chore: add internal slack report
|
2024-10-30 17:42:06 +08:00 |
|
yanlong.wang
|
a488bb8921
|
fix: headers in overridden request
|
2024-10-29 15:20:58 +08:00 |
|
yanlong.wang
|
3303763345
|
fix: salvaging with google cache does not work anymore
|
2024-10-29 15:09:50 +08:00 |
|
yanlong.wang
|
ebc09003d1
|
fix: walk around locale setting bug
|
2024-10-29 15:09:20 +08:00 |
|
yanlong.wang
|
9242bb393a
|
fix: detect poorly transformed contents
|
2024-10-28 14:52:13 +08:00 |
|
yanlong.wang
|
a8793114bb
|
fix
|
2024-10-23 18:50:39 +08:00 |
|
yanlong.wang
|
e38c5514e1
|
fix
|
2024-10-23 18:12:43 +08:00 |
|
yanlong.wang
|
fb97410e99
|
fix: bump deps
|
2024-10-23 18:03:59 +08:00 |
|
yanlong.wang
|
d538726bdd
|
revert: domain cannot be un-doomed due to google function wrapper
acdfd93097/src/function_wrappers.ts (L109-L116)
|
2024-10-23 17:27:23 +08:00 |
|
yanlong.wang
|
fedffe3dd2
|
fix: force process quit on firebase issue
|
2024-10-23 16:08:02 +08:00 |
|
yanlong.wang
|
102a1686b0
|
feat: expand shadow dom
|
2024-10-23 14:58:46 +08:00 |
|
Yanlong Wang
|
00a1278385
|
chore: tweak deployment
|
2024-10-21 21:34:08 +08:00 |
|
yanlong.wang
|
d6ad9e75d6
|
chore: suspend data crunching
|
2024-10-21 12:07:14 +08:00 |
|
Yanlong Wang
|
cf32ab4fa7
|
bump: deps
|
2024-10-18 12:59:44 +08:00 |
|
Yanlong Wang
|
74eac2fc18
|
fix: remove link url escaping
|
2024-10-18 12:59:36 +08:00 |
|
yanlong.wang
|
a54816d12d
|
fix
|
2024-10-14 17:33:24 +08:00 |
|