51 Commits

Author SHA1 Message Date
rafaelmmiller
f92217e3b6 wip 2025-04-16 00:38:35 -07:00
Gergő Móricz
80b507e64e correlate with eid 2025-04-15 23:06:13 -07:00
Gergő Móricz
512a2b1cd4 feat(extract): run on original links if reranker is weird 2025-04-15 22:57:20 -07:00
Gergő Móricz
0abe60085b fix 2025-04-15 20:29:01 -07:00
Nicolas
6634d236bf
(feat/fire-1) FIRE-1 (#1462)
* wip

* integrating smart-scrape

* integrate smartscrape into llmExtract

* wip

* smart scrape multiple links

* fixes

* fix

* wip

* it worked!

* wip. there's a bug on the batchExtract TypeError: Converting circular structure to JSON

* wip

* retry model

* retry models

* feat/scrape+json+extract interfaces ready

* vertex -> googleapi

* fix/transformArrayToObject. required params on schema is still a bug

* change model

* o3-mini -> gemini

* Update extractSmartScrape.ts

* sessionId

* sessionId

* Nick: f-0 start

* Update extraction-service-f0.ts

* Update types.ts

* Nick:

* Update queue-worker.ts

* Nick: new interface

* rename analyzeSchemaAndPrompt -> F0

* refactor: rename agent ID to model in types and extract logic

* agent

* id->model

* id->model

* refactor: standardize agent model handling and validation across extraction logic

* livecast agent

* (feat/f1) sdks (#1459)

* feat: add FIRE-1 agent support to Python and JavaScript SDKs

Co-Authored-By: hello@sideguide.dev <hello@sideguide.dev>

* feat: add FIRE-1 agent support to scrape methods in both SDKs

Co-Authored-By: hello@sideguide.dev <hello@sideguide.dev>

* feat: add prompt and sessionId to AgentOptions interface

Co-Authored-By: hello@sideguide.dev <hello@sideguide.dev>

* Update index.ts

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: hello@sideguide.dev <hello@sideguide.dev>
Co-authored-by: Nicolas <nicolascamara29@gmail.com>

* feat(v1): rate limits

* Update types.ts

* Update llmExtract.ts

* add cost tracking

* remove

* Update requests.http

* fix smart scrape cost calc

* log sm cost

* fix counts

* fix

* expose cost tracking

* models fix

* temp: skipLibcheck

* get rid of it

* fix ts

* dont skip lib check

* Update extractSmartScrape.ts

* Update queue-worker.ts

* Update smartScrape.ts

* Update requests.http

* fix(rate-limiter):

* types: fire-1 refine

* bill 150

* fix credits used on crawl

* ban from crawl

* route cost limit warning

* Update generic-ai.ts

* genres

* Update llmExtract.ts

* test server diff

* cletu

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Thomas Kosmas <thomas510111@gmail.com>
Co-authored-by: Ademílson F. Tonato <ademilsonft@outlook.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: hello@sideguide.dev <hello@sideguide.dev>
Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2025-04-15 00:19:45 -07:00
Gergő Móricz
6a10f0689d
ACUC: Dynamic Limits (FIR-1641) (#1434)
* extend acuc definition

* kill plan

* stuff

* stupid tests

* feat: better acuc

* feat(acuc): mock ACUC when not using db auth
2025-04-10 18:49:23 +02:00
Gergő Móricz
d3da790dc4 feat(extraction-service): teamId logging 2025-04-09 18:48:00 +02:00
Nicolas
20c93db43f
(feat/extract) URLs can now be optional in /extract (#1346)
* Nick: urls optional on extract

* Update index.ts
2025-03-16 22:29:25 -04:00
Nicolas
25d9bdb1f6
(feat/ai-sdk) Migrate to AI-SDK (#1220)
* Nick: init

* Update llmExtract.ts

* Update llmExtract.ts

* Nick rename

* fix(v1/types): extract json schema validation

* Update url-processor.ts

* feat(ai-sdk): ollama support

* feat(ai-sdk): further ollama support

* Nick: it is broken btw

* feat(ai-sdk): abstract model adapter

* Update pnpm-lock.yaml

* Update analyzeSchemaAndPrompt.ts

* Nick:

* feat(ai-sdk): ollama support

* doc(SELF_HOST): update with embedding param

* Nick:

* Update ranker.ts

* Nick:

* feat(ai-sdk): fixes

* Update llmExtract.ts

* feat: remove zod-to-json-schema

* fix

* Update llmExtract.ts

* use openai

* fixes

---------

Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2025-02-20 22:48:58 +01:00
Gergő Móricz
2200f084f3
SELFHOST FIXES (#1207)
* fix(extract): construct OpenAI on demand

Fixes hard-crash if api key not specified in a self-hosting environment.

* fix(ci): try sleeping

* fix(ci): override host

* fix(ci): wait for server to start

* Support /extract and /crawl for self-hosted (FIR-1097) (#1137)

* Support /extract for self-hosted

This returns the job response from redis rather than supabase when db auth is disabled (self hosted mode)

* Use getJob for extract and use correct types

* fix(v1/crawl-status): only poll DB for total count if DB is enabled

* feat(snips): TEST_SUITE_SELF_HOSTED

* fix(ci/test-server-self-host): use pr trigger

* fix(scrapeURL): f-e mocking in selfhosted env

* fix(snips): do not try to eval json format on selfhost

* fix(scrapeURL): further f-e mocking

* fix(snips): don't timeout on hard fail polling

* fix(v1/extract-status): fix-up the db-agnostic impl

unfortunately had to separate the functions since the schema
was too divergent :(

* fix(snips): boost screenshot delay

* feat(ci): test with openai

* feat(ci): extract, search testing

* fix(ci): matrix

* fix(ci): bleh

* Update: fix default google search (#1174)

* fix log title

* search should always work

* asd

* fix ci

---------

Co-authored-by: Nick Roth <nlr06886@gmail.com>
Co-authored-by: William <sdustusun@gmail.com>
2025-02-20 00:41:22 +01:00
Rafael Miller
ac5c88bffb
added scrapeOptions to extract (#1133) 2025-02-07 13:38:08 -03:00
Rafael Miller
8d7e8c4f50
added cached scrapes to extract (#1107)
* added cached scrapes to extract

* dont save if exists

* no duplicates

* experimental tag

* Update requests.http

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-01-31 13:58:52 -03:00
Nicolas
04c6f511b5
(feat/extract) Add sources to the extraction (#1101)
* Nick: good state

* Nick: source tracker class

* Nick: show sources under flag
2025-01-28 13:46:21 -03:00
Nicolas
6b9e65c4f6
(feat/extract) Refactor and Reranker improvements (#1100)
* Reapply "Nick: extract api reference"

This reverts commit 61d7ba76f76ce74e0d230f89a93436f29dc8d9df.

* Nick: refactor analyzer

* Nick: formatting

* Nick:

* Update extraction-service.ts

* Nick: fixes

* NIck:

* Nick: wip

* Nick: reverted to the old re-ranker

* Nick:

* Update extract-status.ts
2025-01-27 20:07:01 -03:00
Nicolas
61d7ba76f7 Revert "Nick: extract api reference"
This reverts commit 522c5b35da7d5cd997aa5ebe2002a38ede7ace93.
2025-01-26 21:06:37 -03:00
Nicolas
522c5b35da Nick: extract api reference 2025-01-26 21:00:40 -03:00
Móricz Gergő
05d79a875a fix(extract): oops 2025-01-24 11:55:41 +01:00
Móricz Gergő
4db9a4a675 fix(extraction-service): allow no multiEntityKeys if isMultiEntity is false 2025-01-24 11:33:49 +01:00
rafaelmmiller
f1cd891a70 added today to extract prompts 2025-01-23 17:14:45 -03:00
Gergő Móricz
6f696d32ae feat(extract): add log on 0 links 2025-01-23 19:25:12 +01:00
Gergő Móricz
5d56627bfa feat(extraction-service): highlight req schema generation 2025-01-23 19:24:24 +01:00
Móricz Gergő
9da51a7514 feat(extract): add original schema to logs 2025-01-23 14:59:54 +01:00
Móricz Gergő
d3518e85a8 feat(extract): add logging 2025-01-23 12:05:15 +01:00
Nicolas
ccb74a2b43 Nick: increased timeouts on extract + reduced extract redis usage 2025-01-23 01:28:26 -03:00
Nicolas
498558d358 Nick: formatting done 2025-01-22 18:47:44 -03:00
Nicolas
56f048aeff Reapply "Nick:"
This reverts commit 4b4385c520c7223cf79ebba981dded8ffaefde11.
2025-01-22 17:26:32 -03:00
Nicolas
4b4385c520 Revert "Nick:"
This reverts commit 6718ce89085339eaaceb1e88a0aa45ecff3216ac.
2025-01-22 17:26:09 -03:00
Nicolas
e1ef826ac6 Merge branch 'main' of https://github.com/mendableai/firecrawl 2025-01-22 17:25:49 -03:00
Nicolas
6718ce8908 Nick: 2025-01-22 17:25:48 -03:00
Gergő Móricz
208bd4ca0c fix(extraction-service): marginally improve logging 2025-01-22 19:38:09 +01:00
Nicolas
d786949639 Reapply "Merge pull request #1068 from mendableai/nsc/llm-usage-extract"
This reverts commit 8b17af40018688c34f95727ceaec289b02ab2023.
2025-01-19 22:04:12 -03:00
Nicolas
8b17af4001 Revert "Merge pull request #1068 from mendableai/nsc/llm-usage-extract"
This reverts commit 406f28c04aff2ba3ae65f483627da13f02943cc3, reversing
changes made to 34ad9ec25d73f37deb1e3adec2315a121ec52f0e.
2025-01-19 22:00:28 -03:00
Nicolas
64607f3f20 Update extraction-service.ts 2025-01-18 22:40:53 -03:00
Nicolas
9cd48d7f73 Nick: 2025-01-17 23:47:22 -03:00
Nicolas
1f6abf95e8 Nick: extract billing works 2025-01-17 20:59:44 -03:00
Nicolas
4db023280d Nick: introduce llm-usage cost analysis 2025-01-15 21:01:29 -03:00
Nicolas
957eea4113 Nick: extract without a schema should work as expected 2025-01-14 11:37:00 -03:00
Nicolas
61e6af2b16 Nick: streaming callback experimental 2025-01-14 02:13:42 -03:00
Nicolas
2dc87a2e1c Update extraction-service.ts 2025-01-14 01:59:52 -03:00
Nicolas
033e9bbf29 Nick: __experimental_streamSteps 2025-01-14 01:45:50 -03:00
Nicolas
5e5b5ee0e2
(feat/extract) New re-ranker + multi entity extraction (#1061)
* agent that decides if splits schema or not

* split and merge properties done

* wip

* wip

* changes

* ch

* array merge working!

* comment

* wip

* dereferentiate schema

* dereference schemas

* Nick: new re-ranker

* Create llm-links.txt

* Nick: format

* Update extraction-service.ts

* wip: cooking schema mix and spread functions

* wip

* wip getting there!!!

* nick:

* moved functions to helpers

* nick:

* cant reproduce the error anymore

* error handling all scrapes failed

* fix

* Nick: added the sitemap index

* Update sitemap-index.ts

* Update map.ts

* deduplicate and merge arrays

* added error handler for object transformations

* Update url-processor.ts

* Nick:

* Nick: fixes

* Nick: big improvements to rerank of multi-entity

* Nick: working

* Update reranker.ts

* fixed transformations for nested objs

* fix merge nulls

* Nick: fixed error piping

* Update queue-worker.ts

* Update extraction-service.ts

* Nick: format

* Update queue-worker.ts

* Update pnpm-lock.yaml

* Update queue-worker.ts

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Thomas Kosmas <thomas510111@gmail.com>
2025-01-13 22:30:15 -03:00
Nicolas
f4d10c5031 Nick: formatting fixes 2025-01-10 18:35:10 -03:00
Nicolas
aa31508ccd Nick: links-billed update (temp) 2025-01-08 15:13:33 -03:00
Gergő Móricz
1f2a76fc23
Update apps/api/src/lib/extract/extraction-service.ts 2025-01-07 20:18:10 +01:00
Nicolas
eb254547e5 Nick: 2025-01-07 16:16:01 -03:00
Nicolas
27457ed5db Nick: init 2025-01-03 20:44:27 -03:00
rafaelmmiller
ef0fc8d0d3 broader search if didnt find results 2025-01-02 18:00:18 -03:00
Nicolas
33632d2fe3 Update extraction-service.ts 2024-12-31 15:22:50 -03:00
Nicolas
e6da214aeb Nick: async background index 2024-12-30 21:42:01 -03:00
Nicolas
4332f18a8f Nick: making it optional for the user 2024-12-26 12:43:58 -03:00