34 Commits

Author SHA1 Message Date
devin-ai-integration[bot]
7ccbbec488
Fix LLMs.txt cache bug with subdomains and add bypass option (#1557)
* Fix LLMs.txt cache bug with subdomains and add bypass option (#1519)

Co-Authored-By: hello@sideguide.dev <hello+firecrawl@sideguide.dev>

* Nick:

* Update LLMs.txt test file to use helper functions and concurrent tests

Co-Authored-By: hello@sideguide.dev <hello+firecrawl@sideguide.dev>

* Remove LLMs.txt test file as requested

Co-Authored-By: hello@sideguide.dev <hello+firecrawl@sideguide.dev>

* Change parameter name to 'cache' and keep 7-day expiration

Co-Authored-By: hello@sideguide.dev <hello+firecrawl@sideguide.dev>

* Update generate-llmstxt-supabase.ts

* Update JS and Python SDKs to include cache parameter

Co-Authored-By: hello@sideguide.dev <hello+firecrawl@sideguide.dev>

* Fix LLMs.txt cache implementation to use normalizeUrl and exact matching

Co-Authored-By: hello@sideguide.dev <hello+firecrawl@sideguide.dev>

* Revert "Fix LLMs.txt cache implementation to use normalizeUrl and exact matching"

This reverts commit d05b9964677b7b2384453329d2ac99d841467053.

* Nick:

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: hello@sideguide.dev <hello+firecrawl@sideguide.dev>
Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-05-16 16:29:09 -03:00
Gergő Móricz
4740254b89 feat(rquests.http): add extract 2025-04-16 12:32:02 -07:00
Nicolas
6634d236bf
(feat/fire-1) FIRE-1 (#1462)
* wip

* integrating smart-scrape

* integrate smartscrape into llmExtract

* wip

* smart scrape multiple links

* fixes

* fix

* wip

* it worked!

* wip. there's a bug on the batchExtract TypeError: Converting circular structure to JSON

* wip

* retry model

* retry models

* feat/scrape+json+extract interfaces ready

* vertex -> googleapi

* fix/transformArrayToObject. required params on schema is still a bug

* change model

* o3-mini -> gemini

* Update extractSmartScrape.ts

* sessionId

* sessionId

* Nick: f-0 start

* Update extraction-service-f0.ts

* Update types.ts

* Nick:

* Update queue-worker.ts

* Nick: new interface

* rename analyzeSchemaAndPrompt -> F0

* refactor: rename agent ID to model in types and extract logic

* agent

* id->model

* id->model

* refactor: standardize agent model handling and validation across extraction logic

* livecast agent

* (feat/f1) sdks (#1459)

* feat: add FIRE-1 agent support to Python and JavaScript SDKs

Co-Authored-By: hello@sideguide.dev <hello@sideguide.dev>

* feat: add FIRE-1 agent support to scrape methods in both SDKs

Co-Authored-By: hello@sideguide.dev <hello@sideguide.dev>

* feat: add prompt and sessionId to AgentOptions interface

Co-Authored-By: hello@sideguide.dev <hello@sideguide.dev>

* Update index.ts

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: hello@sideguide.dev <hello@sideguide.dev>
Co-authored-by: Nicolas <nicolascamara29@gmail.com>

* feat(v1): rate limits

* Update types.ts

* Update llmExtract.ts

* add cost tracking

* remove

* Update requests.http

* fix smart scrape cost calc

* log sm cost

* fix counts

* fix

* expose cost tracking

* models fix

* temp: skipLibcheck

* get rid of it

* fix ts

* dont skip lib check

* Update extractSmartScrape.ts

* Update queue-worker.ts

* Update smartScrape.ts

* Update requests.http

* fix(rate-limiter):

* types: fire-1 refine

* bill 150

* fix credits used on crawl

* ban from crawl

* route cost limit warning

* Update generic-ai.ts

* genres

* Update llmExtract.ts

* test server diff

* cletu

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Thomas Kosmas <thomas510111@gmail.com>
Co-authored-by: Ademílson F. Tonato <ademilsonft@outlook.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: hello@sideguide.dev <hello@sideguide.dev>
Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2025-04-15 00:19:45 -07:00
Ademílson F. Tonato
da2f17c757
feat(api/search): add search endpoint and update request limits
- Introduced a new POST endpoint for searching with a query and limit
- Updated the maximum limit for search results from 20 to 50 in the request schema
- Adjusted the default number of results in the Google search function from 7 to 5
2025-04-09 17:24:29 +01:00
Gergő Móricz
24f5199359
compare format (FIR-1560) (#1405) 2025-04-02 19:52:43 +02:00
Eric Ciarla
5a1886936c
Truncate llmstxt cache based on maxurls limit & improve maxurls handling (#1285)
* init

* Update generate-llmstxt-service.ts
2025-03-03 18:37:33 -03:00
Eric Ciarla
d984b50400
Add llmstxt generator endpoint (#1201)
* Nick:

* Revert "fix(v1/types): fix extract -> json rename (FIR-1072) (#1195)"

This reverts commit 586a10f40d354a038afc2b67809f20a7a829f8cb.

* Update deep-research-service.ts

* Nick:

* init

* part 2

* Update generate-llmstxt-service.ts

* Fix queue

* Update queue-worker.ts

* Almost there

* Final touches

* Update requests.http

* final touches

* Update requests.http

* Improve logging

* Change endpoint to /llmstxt

* Update queue-worker.ts

* Update generate-llmstxt-service.ts

* Nick: cache

* Update index.ts

* Update firecrawl.py

* Update package.json

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2025-02-19 14:42:33 -03:00
Rafael Miller
8d7e8c4f50
added cached scrapes to extract (#1107)
* added cached scrapes to extract

* dont save if exists

* no duplicates

* experimental tag

* Update requests.http

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-01-31 13:58:52 -03:00
Ademílson Tonato
34e3911a97
docs: update cancel crawl response
- add cancel crawl event to requests.http
2025-01-24 16:16:17 +00:00
Gergő Móricz
237d0dc197 fix(requests.http): map 2025-01-17 16:21:57 +01:00
Gergő Móricz
dde3aebac4 fix(v1/crawl-status): fix stuck on 0 jobs 2025-01-15 18:51:39 +01:00
Nicolas
ac6650e488 Update requests.http 2025-01-13 22:31:54 -03:00
Nicolas
5e5b5ee0e2
(feat/extract) New re-ranker + multi entity extraction (#1061)
* agent that decides if splits schema or not

* split and merge properties done

* wip

* wip

* changes

* ch

* array merge working!

* comment

* wip

* dereferentiate schema

* dereference schemas

* Nick: new re-ranker

* Create llm-links.txt

* Nick: format

* Update extraction-service.ts

* wip: cooking schema mix and spread functions

* wip

* wip getting there!!!

* nick:

* moved functions to helpers

* nick:

* cant reproduce the error anymore

* error handling all scrapes failed

* fix

* Nick: added the sitemap index

* Update sitemap-index.ts

* Update map.ts

* deduplicate and merge arrays

* added error handler for object transformations

* Update url-processor.ts

* Nick:

* Nick: fixes

* Nick: big improvements to rerank of multi-entity

* Nick: working

* Update reranker.ts

* fixed transformations for nested objs

* fix merge nulls

* Nick: fixed error piping

* Update queue-worker.ts

* Update extraction-service.ts

* Nick: format

* Update queue-worker.ts

* Update pnpm-lock.yaml

* Update queue-worker.ts

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Thomas Kosmas <thomas510111@gmail.com>
2025-01-13 22:30:15 -03:00
Rafael Miller
2c233bd321
Update requests.http 2024-12-16 11:48:48 -03:00
rafaelmmiller
d8150c6171 added type to reqs example 2024-12-16 11:46:56 -03:00
rafaelmmiller
2fb8a3c8dc fix schema 2024-11-19 10:04:42 -03:00
rafaelmmiller
53134b7c85 Rafa: removed throw error and added map to requests 2024-11-19 09:34:52 -03:00
Gergő Móricz
687ea69621 fix(requests.http): default to localhost baseUrl 2024-11-12 19:59:09 +01:00
Gergő Móricz
3a5eee6e3f feat: improve requests.http using format features 2024-11-12 19:58:07 +01:00
Gergő Móricz
8d467c8ca7
WebScraper refactor into scrapeURL (#714)
* feat: use strictNullChecking

* feat: switch logger to Winston

* feat(scrapeURL): first batch

* fix(scrapeURL): error swallow

* fix(scrapeURL): add timeout to EngineResultsTracker

* fix(scrapeURL): report unexpected error to sentry

* chore: remove unused modules

* feat(transfomers/coerce): warn when a format's response is missing

* feat(scrapeURL): feature flag priorities, engine quality sorting, PDF and DOCX support

* (add note)

* feat(scrapeURL): wip readme

* feat(scrapeURL): LLM extract

* feat(scrapeURL): better warnings

* fix(scrapeURL/engines/fire-engine;playwright): fix screenshot

* feat(scrapeURL): add forceEngine internal option

* feat(scrapeURL/engines): scrapingbee

* feat(scrapeURL/transformars): uploadScreenshot

* feat(scrapeURL): more intense tests

* bunch of stuff

* get rid of WebScraper (mostly)

* adapt batch scrape

* add staging deploy workflow

* fix yaml

* fix logger issues

* fix v1 test schema

* feat(scrapeURL/fire-engine/chrome-cdp): remove wait inserts on actions

* scrapeURL: v0 backwards compat

* logger fixes

* feat(scrapeurl): v0 returnOnlyUrls support

* fix(scrapeURL/v0): URL leniency

* fix(batch-scrape): ts non-nullable

* fix(scrapeURL/fire-engine/chromecdp): fix wait action

* fix(logger): remove error debug key

* feat(requests.http): use dotenv expression

* fix(scrapeURL/extractMetadata): extract custom metadata

* fix crawl option conversion

* feat(scrapeURL): Add retry logic to robustFetch

* fix(scrapeURL): crawl stuff

* fix(scrapeURL): LLM extract

* fix(scrapeURL/v0): search fix

* fix(tests/v0): grant larger response size to v0 crawl status

* feat(scrapeURL): basic fetch engine

* feat(scrapeURL): playwright engine

* feat(scrapeURL): add url-specific parameters

* Update readme and examples

* added e2e tests for most parameters. Still a few actions, location and iframes to be done.

* fixed type

* Nick:

* Update scrape.ts

* Update index.ts

* added actions and base64 check

* Nick: skipTls feature flag?

* 403

* todo

* todo

* fixes

* yeet headers from url specific params

* add warning when final engine has feature deficit

* expose engine results tracker for ScrapeEvents implementation

* ingest scrape events

* fixed some tests

* comment

* Update index.test.ts

* fixed rawHtml

* Update index.test.ts

* update comments

* move geolocation to global f-e option, fix removeBase64Images

* Nick:

* trim url-specific params

* Update index.ts

---------

Co-authored-by: Eric Ciarla <ericciarla@yahoo.com>
Co-authored-by: rafaelmmiller <8574157+rafaelmmiller@users.noreply.github.com>
Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2024-11-07 20:57:33 +01:00
Gergő Móricz
693dc14d9b remove invalid keys 2024-08-31 00:08:24 +02:00
rafaelsideguide
1baba3ce0a fix(go-sdk): submodules 2024-08-26 11:11:34 -03:00
rafaelsideguide
3ebdf93342 removed console.logs 2024-06-24 16:43:12 -03:00
rafaelsideguide
21d29de819 testing crawl with new.abb.com case
many unnecessary console.logs for tracing the code execution
2024-06-24 16:25:07 -03:00
rafaelsideguide
07e93ee5fd Update requests.http 2024-04-24 10:32:35 -03:00
Nicolas
3b5b868d0d Update requests.http 2024-04-23 18:13:58 -07:00
Nicolas
8939ca570b Merge branch 'main' into nsc/returnOnlyUrls 2024-04-23 18:05:48 -07:00
rafaelsideguide
a680c7ce84 [Feat] Server health check + slack message 2024-04-23 15:46:29 -03:00
Nicolas
ddf9ff9c9a Nick: 2024-04-20 11:46:06 -07:00
Nicolas
0113d66739 Update requests.http 2024-04-16 13:01:35 -04:00
Nicolas
3d260e94f3 Nick: fc- prefix 2024-04-15 20:39:25 -04:00
Nicolas
32af4ac226 Update requests.http 2024-04-15 17:38:33 -04:00
Nicolas
3418858cb7 Nick: revoked 2024-04-15 17:26:47 -04:00
Nicolas
a6c2a87811 Initial commit 2024-04-15 17:01:47 -04:00