firecrawl

mirror of https://git.mirrors.martin98.com/https://github.com/mendableai/firecrawl synced 2025-06-22 18:01:21 +08:00

Author	SHA1	Message	Date
Gergő Móricz	6637dce626	fix: status	2025-01-19 17:34:09 +01:00
Gergő Móricz	13abb2bc0e	fix(crawl-redis/finishCrawl): increase logging to hunt down race condition	2025-01-17 17:23:13 +01:00
Gergő Móricz	d5929af010	fix(queue-worker/kickoff): make crawls wait for kickoff to finish (matters on big sitemapped sites)	2025-01-17 16:04:01 +01:00
Nicolas	a82160a630	Update crawl-redis.ts	2025-01-10 21:31:23 -03:00
Nicolas	f4d10c5031	Nick: formatting fixes	2025-01-10 18:35:10 -03:00
Móricz Gergő	49e584f8e1	fix(queue-worker/crawl): use SCARD to generate num_docs field	2025-01-09 09:51:34 +01:00
Gergő Móricz	ccfada98ca	various queue fixes	2025-01-07 19:15:23 +01:00
Nicolas	3b6edef9fa	chore: formatting	2024-12-17 16:58:57 -03:00
Gergő Móricz	37f58efe45	fix(crawl-redis/lockURL): only add to visited_unique if lock succeeds	2024-12-15 21:01:31 +01:00
Gergő Móricz	98f27b0acc	fix(crawl-redis/addCrawlJobDone): further ensure that completed doesn't go over total	2024-12-15 16:29:09 +01:00
Gergő Móricz	4b5014d7fe	feat(v1/batch/scrape): add ignoreInvalidURLs option	2024-12-14 01:11:43 +01:00
Nicolas	8a1c404918	Nick: revert trailing comma	2024-12-11 19:51:08 -03:00
Nicolas	00335e2ba9	Nick: fixed prettier	2024-12-11 19:46:11 -03:00
Gergő Móricz	ce460a3a56	fix(v1/crawl/status): completed more than total if some scrape jobs fail or are discarded	2024-12-10 22:33:53 +01:00
Gergő Móricz	91a1a9a1fc	fix(crawl-redis/lockURL): reduce logging	2024-12-09 19:29:42 +01:00
Gergő Móricz	f82b9c205c	fix(crawl-redis): oops	2024-12-05 21:42:08 +01:00
Gergő Móricz	845c2744a9	feat(app): add extra crawl logging (app-side only for now)	2024-12-05 20:50:36 +01:00
Nicolas	52806807a1	Nick: crawl fixes	2024-12-03 16:25:55 -03:00
rafaelmmiller	5ddb7eb922	parameter	2024-11-29 16:44:54 -03:00
Gergő Móricz	b468bb4014	crawl fixes	2024-11-20 19:48:01 +01:00
Gergő Móricz	79a75e088a	feat(crawl): allowSubdomain	2024-11-19 18:38:59 +01:00
Gergő Móricz	31a0471bfa	fix(crawl-redis): ordered push to wrong side of list	2024-11-15 21:56:15 +01:00
Gergő Móricz	0310cd2afa	fix(crawl): redirect rebase	2024-11-13 21:38:44 +01:00
Gergő Móricz	5ce4aaf0ec	fix(crawl): initialURL setting is unnecessary	2024-11-12 23:35:07 +01:00
Gergő Móricz	fbabc779f5	fix(crawler): relative URL handling on non-start pages (#893 ) * fix(crawler): relative URL handling on non-start pages * fix(crawl): further fixing	2024-11-12 18:20:53 +01:00
Móricz Gergő	f6db9f1428	fix(crawl-redis): batch scrape lockURL	2024-11-12 11:52:34 +01:00
Gergő Móricz	68c9615f2d	fix(crawl/maxDepth): fix maxDepth behaviour	2024-11-11 22:02:17 +01:00
Gergő Móricz	a8dc75f762	feat(crawl): add parameter to treat differing query parameters as different URLs (#892 ) * add parameter to crawleroptions * add code to make it work	2024-11-11 21:36:22 +01:00
Gergő Móricz	8e4e49e471	feat(generateURLPermutations): add tests	2024-11-11 20:29:17 +01:00
Gergő Móricz	dc3a4e27fd	move param to the right place	2024-11-08 16:25:11 +01:00
Gergő Móricz	6ecf24b85e	feat(crawl): URL deduplication	2024-11-08 16:22:06 +01:00
Gergő Móricz	8d467c8ca7	`WebScraper` refactor into `scrapeURL` (#714 ) * feat: use strictNullChecking * feat: switch logger to Winston * feat(scrapeURL): first batch * fix(scrapeURL): error swallow * fix(scrapeURL): add timeout to EngineResultsTracker * fix(scrapeURL): report unexpected error to sentry * chore: remove unused modules * feat(transfomers/coerce): warn when a format's response is missing * feat(scrapeURL): feature flag priorities, engine quality sorting, PDF and DOCX support * (add note) * feat(scrapeURL): wip readme * feat(scrapeURL): LLM extract * feat(scrapeURL): better warnings * fix(scrapeURL/engines/fire-engine;playwright): fix screenshot * feat(scrapeURL): add forceEngine internal option * feat(scrapeURL/engines): scrapingbee * feat(scrapeURL/transformars): uploadScreenshot * feat(scrapeURL): more intense tests * bunch of stuff * get rid of WebScraper (mostly) * adapt batch scrape * add staging deploy workflow * fix yaml * fix logger issues * fix v1 test schema * feat(scrapeURL/fire-engine/chrome-cdp): remove wait inserts on actions * scrapeURL: v0 backwards compat * logger fixes * feat(scrapeurl): v0 returnOnlyUrls support * fix(scrapeURL/v0): URL leniency * fix(batch-scrape): ts non-nullable * fix(scrapeURL/fire-engine/chromecdp): fix wait action * fix(logger): remove error debug key * feat(requests.http): use dotenv expression * fix(scrapeURL/extractMetadata): extract custom metadata * fix crawl option conversion * feat(scrapeURL): Add retry logic to robustFetch * fix(scrapeURL): crawl stuff * fix(scrapeURL): LLM extract * fix(scrapeURL/v0): search fix * fix(tests/v0): grant larger response size to v0 crawl status * feat(scrapeURL): basic fetch engine * feat(scrapeURL): playwright engine * feat(scrapeURL): add url-specific parameters * Update readme and examples * added e2e tests for most parameters. Still a few actions, location and iframes to be done. * fixed type * Nick: * Update scrape.ts * Update index.ts * added actions and base64 check * Nick: skipTls feature flag? * 403 * todo * todo * fixes * yeet headers from url specific params * add warning when final engine has feature deficit * expose engine results tracker for ScrapeEvents implementation * ingest scrape events * fixed some tests * comment * Update index.test.ts * fixed rawHtml * Update index.test.ts * update comments * move geolocation to global f-e option, fix removeBase64Images * Nick: * trim url-specific params * Update index.ts --------- Co-authored-by: Eric Ciarla <ericciarla@yahoo.com> Co-authored-by: rafaelmmiller <8574157+rafaelmmiller@users.noreply.github.com> Co-authored-by: Nicolas <nicolascamara29@gmail.com>	2024-11-07 20:57:33 +01:00
Gergő Móricz	03b37998fd	feat: bulk scrape	2024-10-17 19:40:18 +02:00
Nicolas	d1b838322d	Merge pull request #721 from mendableai/feat/concurrency-limit Concurrency limits	2024-10-01 16:15:05 -03:00
Gergő Móricz	fe721fffbe	fix(crawl-redis): normalize URL before locking	2024-10-01 20:59:50 +02:00
Gergő Móricz	b696bfc854	fix(crawl-status): avoid race conditions where crawl may be deemed failed	2024-09-26 21:00:27 +02:00
Nicolas	d872bf0c4c	Merge branch 'main' into v1-webscraper	2024-08-28 12:42:23 -03:00
Nicolas	c7bfe4ffe8	Nick:	2024-08-21 22:20:40 -03:00
Gergő Móricz	eb84673b06	feat: crawl status websocket WIP	2024-08-17 01:04:14 +02:00
Gergő Móricz	5896153d19	fix: crawl status and redis fixes	2024-08-16 22:52:48 +02:00
Gergő Móricz	f20328bdbb	crawl status and document stuff	2024-08-16 22:48:05 +02:00
Gergő Móricz	d0a8382a5b	fix(queue-worker): crawl finishing race condition	2024-08-16 18:48:52 +02:00
Gergő Móricz	846610681b	fix: fix posthog, add dummy crawl DB items	2024-08-15 18:55:18 +02:00
Gergő Móricz	b8ec40dd72	fix(crawl): submit sitemapped jobs in bulk	2024-08-14 20:34:19 +02:00
Gergo Moricz	2e5e480cc2	fix(crawl): call webhooks	2024-08-13 22:10:17 +02:00
Gergo Moricz	86e136beca	feat: crawl to scrape conversion	2024-08-13 20:51:43 +02:00

46 Commits