2218 Commits

Author SHA1 Message Date
Gergő Móricz
8dd5bf7bd9
feat(api/tests/scrape): Playwright test improvements (#1626)
* feat(api/tests/scrape): verify that proxy works on Playwright

* debug: logs

* remove logs

* feat(playwright): add contentType relaying

* fix tests

* debug

* fix json
2025-06-04 01:24:19 +02:00
Gergő Móricz
95f204aab7
Index (FIR-2177) (#1605)
* poc progress

* poc

* url splits and better url normalization

* feat(index): integrate into map

* fix on selfhost

* feat: modifiers

* separate index supa logic

* debug

* fix language comparison

* feat: dontStoreInCache

* feat(index): some rudimentary testing

* feat: use url split columns

* feat(queue-worker/kickoff): use index links to kickoff crawl

* feat(scrapeURL/index): behaviour on non-200 index entries

* feat/added benchmark for scrapes

* feat(map): ignoreIndex

* feat(index): batch insert

* fix(api/tests/scrape): fix index test to work with batching

* disable cacheable lookup for self hosting tests

* feat(js-sdk): dontStoreInCache

* chore(js-sdk): bump

* feat(index): FIRECRAWL_INDEX_WRITE_ONLY

* feat(api/test): index envs

* map benchmarks

* cleanup

* further fixes

* clean up on map

* remove extraneous log

* workflow test run

* asd

* improve fns

* try again

* wow i'm an idiot

* ok fixed

* wth

* revert

* async saving to index

* feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624)

* feat(selfhost): deploy a playwright image (#1625)

* Testing improvements (FIR-2209) (#1623)

* yeet ad blocking tests until further notice

* feat: re-enable billing tests

* more timeout

* cache issues with billing test

* weird thing

* fix(api/tests/scrape/status): propagation time

* stupid

* no log

* sws

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Ademílson Tonato <ademilsonft@outlook.com>
2025-06-03 21:30:19 +02:00
Gergő Móricz
406d696667
Testing improvements (FIR-2209) (#1623)
* yeet ad blocking tests until further notice

* feat: re-enable billing tests

* more timeout

* cache issues with billing test

* weird thing

* fix(api/tests/scrape/status): propagation time

* stupid

* no log

* sws
2025-06-03 21:16:36 +02:00
Ademílson Tonato
41897139da
feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624) 2025-06-03 18:16:46 +02:00
Nicolas
e108ff3525 Update search.ts 2025-06-02 23:46:55 -03:00
Nicolas
9347de6a41 Update scrape.ts 2025-06-02 23:15:59 -03:00
Nicolas
86a9d3525b Update queue-jobs.ts 2025-06-02 23:09:09 -03:00
Nicolas
cbc47305cc Update search.ts 2025-06-02 23:09:02 -03:00
Nicolas
8c661f5329 Update scrape.ts 2025-06-02 22:37:49 -03:00
Nicolas
8967b31465 Nick: bypass billing 2025-06-02 21:51:46 -03:00
Nicolas
bf919ceb82 Nick: __searchPreviewToken 2025-06-02 21:16:34 -03:00
Nicolas
ef789ce8d7 Nick: __experimental 2025-06-02 19:58:56 -03:00
Gergő Móricz
72be73473f
feat(api/scrape): credits_billed column + handle billing for /scrape calls on worker side with stricter timeout enforcement (FIR-2162) (#1607)
* feat(api/scrape): stricten timeout and handle billing and logging on queue-worker

* fix: abortsignal pre-check

* fix: proper level

* add comment to clarify is_scrape

* reenable billing tests

* Revert "reenable billing tests"

This reverts commit 98236fdfa03dde8cecdd6b763fcf86810e468a28.

* oof

* fix searxng logging

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-06-02 17:56:27 -03:00
Gergő Móricz
4167ec53eb
fix(scrapeURL): only allow disabling the adblock on playwright (FIR-2200) (#1616)
* fix(scrapeURL): only allow disabling the adblock on playwright

* feat(api/tests/scrape): re-enable ad blocking tests
2025-06-02 22:48:16 +02:00
Gergő Móricz
7a8be13220 remove indexes that are no longer used 2025-06-02 22:09:55 +02:00
Gergő Móricz
98ceda9bd5
feat(search): ignore concurrency limit for search (FIR-2187) (#1617)
* feat(search): ignore concurrency limit for search (temp)

* feat(search): only for low tier users for good DX
2025-06-02 17:07:44 -03:00
Nicolas
9297afd1ff Nick: search 2025-05-29 17:00:13 -03:00
Gergő Móricz
a8e0482718 feat(search): bill for PDFs properly 2025-05-29 20:59:15 +02:00
Gergő Móricz
a2f41fb650 feat(api/server): wait 60s for GCE load balancer drain timeout
To minimize 502s.
2025-05-29 20:08:52 +02:00
Gergő Móricz
3ea221b093 fix(api/queue): tighten expiries on indexQueue jobs 2025-05-29 16:36:55 +02:00
Gergő Móricz
c9dd0e609a fix(api/queue): tighten expiries on billingQueue jobs 2025-05-29 16:26:52 +02:00
Gergő Móricz
93655b5c0b
feat(scrapeURL/pdf): bill n credits per page (FIR-1934) (#1553)
* feat(scrapeURL/pdf): bill n credits per page

* Update scrape.ts

* Update queue-worker.ts

* separate billing logi

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-05-29 16:01:08 +02:00
Gergő Móricz
38c96b524f
feat(scrapeURL): handle contentType JSON better in markdown conversion (#1604) 2025-05-29 15:26:07 +02:00
Gergő Móricz
7e73b01599 fix(queue-worker): call webhook after job is in DB 2025-05-29 14:40:47 +02:00
Gergő Móricz
706d378a89 feat(api/v1/scrape-status): log supa lookup errors 2025-05-29 13:02:54 +02:00
Gergő Móricz
a5efff07f9
feat(apps/api): add support for a separate, non-eviction Redis (#1600)
* feat(apps/api): add support for a separate, non-eviction Redis

* fix: misimport
2025-05-28 09:58:04 +02:00
Nicolas
756b452a01 Update batch_billing.ts 2025-05-27 19:05:00 -03:00
Nicolas
299e3e29e0 Update batch_billing.ts 2025-05-27 18:44:24 -03:00
Gergő Móricz
a36c6a4f40
feat(scrapeURL): add unnormalizedSourceURL for url matching DX (FIR-2137) (#1601)
* feat(scrapeURL): add unnormalizedSourceURL for url matching DX

* fix(tests): fixc
2025-05-27 21:33:44 +02:00
Gergő Móricz
474e5a0543 fix(crawler): always set expiry on sitemap links in redis 2025-05-27 15:39:31 +02:00
Gergő Móricz
c3738063cf less logs even more 2025-05-25 15:50:20 +02:00
Gergő Móricz
492d97e889 reduce logging 2025-05-24 00:09:13 +02:00
Gergő Móricz
a3145ccacc
fix(extract-status): be able to get extract status even after TTL lapses (#1599) 2025-05-23 22:33:09 +02:00
Gergő Móricz
8389a1a78d
fix(html-transformer): bad outName for og:locale:alternate (FIR-2101) (#1597)
* fix(html-transformer): bad outName for og:locale:alternate

* oops
2025-05-23 17:10:09 +02:00
Gergő Móricz
3ec17e2d1a
fix(v1): avoid overwriting rateLimiterMode with FIRE-1 rate limiter (#1593) 2025-05-23 11:50:59 -03:00
Gergő Móricz
3df687e4db
feat(queue-worker/afterJobDone): improved ccq insert logic (#1595) 2025-05-23 11:50:14 -03:00
Gergő Móricz
a7894a2714 fix(scrapeURL/pdf): even better timeout detection 2025-05-23 16:29:28 +02:00
Gergő Móricz
8571b5a99d Revert "feat(queue-worker/afterJobDone): improved ccq insert logic"
This reverts commit 97c635676d228ed1342cdd1468cb2a1aef4fcfc9.
2025-05-23 15:42:15 +02:00
Gergő Móricz
97c635676d feat(queue-worker/afterJobDone): improved ccq insert logic 2025-05-23 15:41:57 +02:00
Gergő Móricz
f41af8241e fix(scrapeURL/pdf): better timeout error 2025-05-23 13:59:53 +02:00
Gergő Móricz
bfe731309c fix(scrapeURL/pdf/mu): remove log 2025-05-23 13:47:34 +02:00
Gergő Móricz
b03670a8b7
feat: parse PDFs on fc side and reject if too long for timeout (FIR-2083) (#1592)
* feat: pdf-parser, implementation in scrapeURL

* use pdf-parser for page count instead of mu

* fix(pdf-parser): bindings

* feat(scrapeURL/pdf): adjust MILLISECONDS_PER_PAGE

* implement post-runsync polling and fix

* fix(Dockerfile): copy in the pdf-parser source code

* fix(scrapeURL/pdf): better error for timeout below 0
2025-05-23 13:45:53 +02:00
Gergő Móricz
321fff1695 ok what 2025-05-23 11:41:34 +02:00
Gergő Móricz
00cc733972 more logs 2025-05-23 11:29:34 +02:00
Gergő Móricz
bb67b9812b check if enum is being overwritten somehow 2025-05-23 11:27:49 +02:00
Gergő Móricz
d4e7bde03d add stack 2025-05-23 10:18:30 +02:00
Gergő Móricz
6776292cc2 more log 2025-05-23 09:57:15 +02:00
Gergő Móricz
2e863da334 feat(api/v1/authMiddleware): add log to debug extract agent preview mode 2025-05-23 09:35:29 +02:00
Gergő Móricz
3e736f1e0d
feat(concurrency-log): add cclog endpoint (FIR-2067) (#1589)
* feat(concurrency-log): add cclog endpoint

* fix(api/routes/admin): misimport

* more misimports
2025-05-22 18:13:35 -03:00
Gergő Móricz
fd74299134
feat(scrapeURL, logJob): log pdf page count to db (FIR-2068) (#1587)
* feat(scrapeURL, logJob): log pdf page count to db

* devin stop the test littering pls
2025-05-22 17:26:01 -03:00