339 Commits

Author SHA1 Message Date
Nicolas
4d0acc9722 Merge branch 'main' into v1-webscraper 2024-08-26 16:22:05 -03:00
Gergo Moricz
d591e0f51c block corterix.com for performance issues 2024-08-25 20:06:12 +02:00
Nicolas
173f4ee1bf Nick: chrome cdp main | simple autoscaler 2024-08-23 20:09:59 -03:00
Gergő Móricz
05c250d3b8 Merge branch 'main' into v1-webscraper 2024-08-23 19:38:57 +02:00
Nicolas
3d53f4e213 Nick: unblocking pin 2024-08-23 13:56:05 -03:00
Gergő Móricz
e7f267b6fe Merge branch 'main' into v1-webscraper 2024-08-23 17:21:54 +02:00
Gergő Móricz
8d9ff90bcb feat(fire-engine): propagate sentry trace 2024-08-22 23:38:04 +02:00
Gergő Móricz
8e3c2b2855 fix(crawler): verify URL 2024-08-22 23:30:19 +02:00
rafaelsideguide
7473b74021 fix: html and rawlhtmls for pdfs 2024-08-22 15:15:45 -03:00
rafaelsideguide
b1d61d8557 Merge remote-tracking branch 'origin/v1-webscraper' into v1/python-sdk 2024-08-22 13:39:09 -03:00
Gergő Móricz
6d48dbcd38 feat(sentry): add trace continuity for queue 2024-08-22 16:47:38 +02:00
Gergő Móricz
fbbc3878f1 fix(crawler): make sure includes/excludes is an array 2024-08-22 13:18:26 +02:00
rafaelsideguide
fe2e8c0b7a includehtml fix 2024-08-21 15:54:00 -03:00
Gergő Móricz
55009e51f5 fix: filter out invalid URLs from crawl links 2024-08-21 20:49:25 +02:00
rafaelsideguide
52abec41c2 fixing delete 2024-08-21 10:35:50 -03:00
rafaelsideguide
b66553867e reverting delete, fixed express bug on checkCredits 2024-08-21 09:28:20 -03:00
rafaelsideguide
138437d616 commenting out delete, crashing on fire-engine 2024-08-21 08:11:24 -03:00
rafaelsideguide
5e48bec1fd commenting out delete, crashing on fire-engine 2024-08-21 08:10:46 -03:00
Nicolas
90b32f16c8 Nick: fixes 2024-08-20 21:38:11 -03:00
Nicolas
819ad50af3 Update fireEngine.ts 2024-08-20 21:16:33 -03:00
rafaelsideguide
e9d6ca197e tests passing now 2024-08-20 20:00:41 -03:00
Nicolas
1b3ad60a2c Reapply "Merge pull request #561 from mendableai/bug/dealing-with-dns-error"
This reverts commit ffe11a5bf73e3c57657972cd36c3af1d0b9a432c.
2024-08-20 19:22:09 -03:00
Nicolas
441628998f Reapply "Merge pull request #561 from mendableai/bug/dealing-with-dns-error"
This reverts commit ffe11a5bf73e3c57657972cd36c3af1d0b9a432c.
2024-08-20 19:16:48 -03:00
Nicolas
ffe11a5bf7 Revert "Merge pull request #561 from mendableai/bug/dealing-with-dns-error"
This reverts commit 2030ec603109d6ce8786a011d431bc5c83917f1b, reversing
changes made to f494d2b707d40b690ae41611d17f77f683570fc2.
2024-08-20 18:16:11 -03:00
Gergő Móricz
1368f9a87f fix: treat existing screenshot as a scraper success condition 2024-08-20 22:24:18 +02:00
rafaelsideguide
f98be7d94e Update fireEngine.ts 2024-08-20 16:53:01 -03:00
rafaelsideguide
1f27182a13 added try catch 2024-08-20 15:42:39 -03:00
rafaelsideguide
e326249a57 added check job and cancel to fire-engine requests 2024-08-20 14:26:42 -03:00
rafaelsideguide
e1c9cbf709 bug fixed. crawl should not stop if sitemap url is invalid 2024-08-20 09:11:58 -03:00
rafaelsideguide
ecd472356b added variables to beta customers 2024-08-19 16:41:54 -03:00
rafaelsideguide
b8170aaa47 Update blocklist.ts 2024-08-19 08:51:48 -03:00
Nicolas
47123be783 Nick: weird activity block 2024-08-16 22:01:56 -04:00
rafaelsideguide
086ba6280b fixed markdown format 2024-08-16 18:39:13 -03:00
Gergő Móricz
aabfaf0ac5 clean up crawl-status, fix db ddos 2024-08-16 23:29:39 +02:00
rafaelsideguide
7a61325500 map + search + scrape markdown bug 2024-08-16 17:57:11 -03:00
Nicolas
23a033fe61 Nick: fixes and more e2e tests 2024-08-16 16:03:35 -04:00
rafaelsideguide
3f998b688d scrape ready 2024-08-16 15:14:37 -03:00
Nicolas
81b2479db3
Merge pull request #459 from mendableai/feat/queue-scrapes
feat: Move scraper to queue
2024-08-15 14:19:55 -04:00
Nicolas
86326f34e9 Update single_url.test.ts 2024-08-15 13:48:42 -04:00
Gergő Móricz
29f0d9ec94 propagate priority to fire-engine 2024-08-15 19:04:46 +02:00
Nicolas
6e1074cdd1 Update website_params.ts 2024-08-14 17:39:54 -04:00
Thomas Kosmas
6410e1a81d Update params 2024-08-15 00:10:14 +03:00
Gergo Moricz
d7549d4dc5 feat: remove webScraperQueue 2024-08-13 21:03:24 +02:00
Gergő Móricz
4a2c37dcf5
Merge branch 'main' into feat/queue-scrapes 2024-08-13 20:53:49 +02:00
Gergo Moricz
86e136beca feat: crawl to scrape conversion 2024-08-13 20:51:43 +02:00
Thomas Kosmas
98be29c963 Update parameters for platform.openai.com 2024-08-12 22:49:28 +03:00
rafaelsideguide
0591000b64 bugfix includes excludes 2024-08-09 14:30:41 -03:00
Nicolas
f1f5605010 Update website_params.ts 2024-08-08 12:31:58 -04:00
Gergő Móricz
5fc7fcb77c
Merge branch 'main' into feat/queue-scrapes 2024-08-07 16:35:44 +02:00
Gergo Moricz
fe9fdb578b revert bad hotfixes 2024-08-07 16:34:25 +02:00