firecrawl

mirror of https://git.mirrors.martin98.com/https://github.com/mendableai/firecrawl synced 2025-06-04 11:24:40 +08:00

Author	SHA1	Message	Date
Gergő Móricz	8dd5bf7bd9	feat(api/tests/scrape): Playwright test improvements (#1626 ) * feat(api/tests/scrape): verify that proxy works on Playwright * debug: logs * remove logs * feat(playwright): add contentType relaying * fix tests * debug * fix json	2025-06-04 01:24:19 +02:00
Gergő Móricz	95f204aab7	Index (FIR-2177) (#1605 ) * poc progress * poc * url splits and better url normalization * feat(index): integrate into map * fix on selfhost * feat: modifiers * separate index supa logic * debug * fix language comparison * feat: dontStoreInCache * feat(index): some rudimentary testing * feat: use url split columns * feat(queue-worker/kickoff): use index links to kickoff crawl * feat(scrapeURL/index): behaviour on non-200 index entries * feat/added benchmark for scrapes * feat(map): ignoreIndex * feat(index): batch insert * fix(api/tests/scrape): fix index test to work with batching * disable cacheable lookup for self hosting tests * feat(js-sdk): dontStoreInCache * chore(js-sdk): bump * feat(index): FIRECRAWL_INDEX_WRITE_ONLY * feat(api/test): index envs * map benchmarks * cleanup * further fixes * clean up on map * remove extraneous log * workflow test run * asd * improve fns * try again * wow i'm an idiot * ok fixed * wth * revert * async saving to index * feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624) * feat(selfhost): deploy a playwright image (#1625) * Testing improvements (FIR-2209) (#1623) * yeet ad blocking tests until further notice * feat: re-enable billing tests * more timeout * cache issues with billing test * weird thing * fix(api/tests/scrape/status): propagation time * stupid * no log * sws --------- Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com> Co-authored-by: Ademílson Tonato <ademilsonft@outlook.com>	2025-06-03 21:30:19 +02:00
Gergő Móricz	406d696667	Testing improvements (FIR-2209) (#1623 ) * yeet ad blocking tests until further notice * feat: re-enable billing tests * more timeout * cache issues with billing test * weird thing * fix(api/tests/scrape/status): propagation time * stupid * no log * sws	2025-06-03 21:16:36 +02:00
Ademílson Tonato	41897139da	feat: enhance metadata extraction by including 'itemprop' attribute in HTML (#1624 )	2025-06-03 18:16:46 +02:00
Nicolas	e108ff3525	Update search.ts	2025-06-02 23:46:55 -03:00
Nicolas	9347de6a41	Update scrape.ts	2025-06-02 23:15:59 -03:00
Nicolas	86a9d3525b	Update queue-jobs.ts	2025-06-02 23:09:09 -03:00
Nicolas	cbc47305cc	Update search.ts	2025-06-02 23:09:02 -03:00
Nicolas	8c661f5329	Update scrape.ts	2025-06-02 22:37:49 -03:00
Nicolas	8967b31465	Nick: bypass billing	2025-06-02 21:51:46 -03:00
Nicolas	bf919ceb82	Nick: __searchPreviewToken	2025-06-02 21:16:34 -03:00
Nicolas	ef789ce8d7	Nick: __experimental	2025-06-02 19:58:56 -03:00
Gergő Móricz	72be73473f	feat(api/scrape): credits_billed column + handle billing for `/scrape` calls on worker side with stricter timeout enforcement (FIR-2162) (#1607 ) * feat(api/scrape): stricten timeout and handle billing and logging on queue-worker * fix: abortsignal pre-check * fix: proper level * add comment to clarify is_scrape * reenable billing tests * Revert "reenable billing tests" This reverts commit 98236fdfa03dde8cecdd6b763fcf86810e468a28. * oof * fix searxng logging --------- Co-authored-by: Nicolas <nicolascamara29@gmail.com>	2025-06-02 17:56:27 -03:00
Gergő Móricz	4167ec53eb	fix(scrapeURL): only allow disabling the adblock on playwright (FIR-2200) (#1616 ) * fix(scrapeURL): only allow disabling the adblock on playwright * feat(api/tests/scrape): re-enable ad blocking tests	2025-06-02 22:48:16 +02:00
Gergő Móricz	7a8be13220	remove indexes that are no longer used	2025-06-02 22:09:55 +02:00
Gergő Móricz	98ceda9bd5	feat(search): ignore concurrency limit for search (FIR-2187) (#1617 ) * feat(search): ignore concurrency limit for search (temp) * feat(search): only for low tier users for good DX	2025-06-02 17:07:44 -03:00
Nicolas	9297afd1ff	Nick: search	2025-05-29 17:00:13 -03:00
Gergő Móricz	a8e0482718	feat(search): bill for PDFs properly	2025-05-29 20:59:15 +02:00
Gergő Móricz	a2f41fb650	feat(api/server): wait 60s for GCE load balancer drain timeout To minimize 502s.	2025-05-29 20:08:52 +02:00
Gergő Móricz	3ea221b093	fix(api/queue): tighten expiries on indexQueue jobs	2025-05-29 16:36:55 +02:00
Gergő Móricz	c9dd0e609a	fix(api/queue): tighten expiries on billingQueue jobs	2025-05-29 16:26:52 +02:00
Gergő Móricz	93655b5c0b	feat(scrapeURL/pdf): bill n credits per page (FIR-1934) (#1553 ) * feat(scrapeURL/pdf): bill n credits per page * Update scrape.ts * Update queue-worker.ts * separate billing logi --------- Co-authored-by: Nicolas <nicolascamara29@gmail.com>	2025-05-29 16:01:08 +02:00
Gergő Móricz	38c96b524f	feat(scrapeURL): handle contentType JSON better in markdown conversion (#1604 )	2025-05-29 15:26:07 +02:00
Gergő Móricz	7e73b01599	fix(queue-worker): call webhook after job is in DB	2025-05-29 14:40:47 +02:00
Gergő Móricz	706d378a89	feat(api/v1/scrape-status): log supa lookup errors	2025-05-29 13:02:54 +02:00
Gergő Móricz	a5efff07f9	feat(apps/api): add support for a separate, non-eviction Redis (#1600 ) * feat(apps/api): add support for a separate, non-eviction Redis * fix: misimport	2025-05-28 09:58:04 +02:00
Nicolas	756b452a01	Update batch_billing.ts	2025-05-27 19:05:00 -03:00
Nicolas	299e3e29e0	Update batch_billing.ts	2025-05-27 18:44:24 -03:00
Gergő Móricz	a36c6a4f40	feat(scrapeURL): add unnormalizedSourceURL for url matching DX (FIR-2137) (#1601 ) * feat(scrapeURL): add unnormalizedSourceURL for url matching DX * fix(tests): fixc	2025-05-27 21:33:44 +02:00
Gergő Móricz	474e5a0543	fix(crawler): always set expiry on sitemap links in redis	2025-05-27 15:39:31 +02:00
Gergő Móricz	c3738063cf	less logs even more	2025-05-25 15:50:20 +02:00
Gergő Móricz	492d97e889	reduce logging	2025-05-24 00:09:13 +02:00
Gergő Móricz	a3145ccacc	fix(extract-status): be able to get extract status even after TTL lapses (#1599 )	2025-05-23 22:33:09 +02:00
Gergő Móricz	8389a1a78d	fix(html-transformer): bad outName for og:locale:alternate (FIR-2101) (#1597 ) * fix(html-transformer): bad outName for og:locale:alternate * oops	2025-05-23 17:10:09 +02:00
Gergő Móricz	3ec17e2d1a	fix(v1): avoid overwriting rateLimiterMode with FIRE-1 rate limiter (#1593 )	2025-05-23 11:50:59 -03:00
Gergő Móricz	3df687e4db	feat(queue-worker/afterJobDone): improved ccq insert logic (#1595 )	2025-05-23 11:50:14 -03:00
Gergő Móricz	a7894a2714	fix(scrapeURL/pdf): even better timeout detection	2025-05-23 16:29:28 +02:00
Gergő Móricz	8571b5a99d	Revert "feat(queue-worker/afterJobDone): improved ccq insert logic" This reverts commit 97c635676d228ed1342cdd1468cb2a1aef4fcfc9.	2025-05-23 15:42:15 +02:00
Gergő Móricz	97c635676d	feat(queue-worker/afterJobDone): improved ccq insert logic	2025-05-23 15:41:57 +02:00
Gergő Móricz	f41af8241e	fix(scrapeURL/pdf): better timeout error	2025-05-23 13:59:53 +02:00
Gergő Móricz	bfe731309c	fix(scrapeURL/pdf/mu): remove log	2025-05-23 13:47:34 +02:00
Gergő Móricz	b03670a8b7	feat: parse PDFs on fc side and reject if too long for timeout (FIR-2083) (#1592 ) * feat: pdf-parser, implementation in scrapeURL * use pdf-parser for page count instead of mu * fix(pdf-parser): bindings * feat(scrapeURL/pdf): adjust MILLISECONDS_PER_PAGE * implement post-runsync polling and fix * fix(Dockerfile): copy in the pdf-parser source code * fix(scrapeURL/pdf): better error for timeout below 0	2025-05-23 13:45:53 +02:00
Gergő Móricz	321fff1695	ok what	2025-05-23 11:41:34 +02:00
Gergő Móricz	00cc733972	more logs	2025-05-23 11:29:34 +02:00
Gergő Móricz	bb67b9812b	check if enum is being overwritten somehow	2025-05-23 11:27:49 +02:00
Gergő Móricz	d4e7bde03d	add stack	2025-05-23 10:18:30 +02:00
Gergő Móricz	6776292cc2	more log	2025-05-23 09:57:15 +02:00
Gergő Móricz	2e863da334	feat(api/v1/authMiddleware): add log to debug extract agent preview mode	2025-05-23 09:35:29 +02:00
Gergő Móricz	3e736f1e0d	feat(concurrency-log): add cclog endpoint (FIR-2067) (#1589 ) * feat(concurrency-log): add cclog endpoint * fix(api/routes/admin): misimport * more misimports	2025-05-22 18:13:35 -03:00
Gergő Móricz	fd74299134	feat(scrapeURL, logJob): log pdf page count to db (FIR-2068) (#1587 ) * feat(scrapeURL, logJob): log pdf page count to db * devin stop the test littering pls	2025-05-22 17:26:01 -03:00

1 2 3 4 5 ...

2218 Commits