2761 Commits

Author SHA1 Message Date
Nicolas
2dc87a2e1c Update extraction-service.ts 2025-01-14 01:59:52 -03:00
Nicolas
033e9bbf29 Nick: __experimental_streamSteps 2025-01-14 01:45:50 -03:00
Nicolas
558a7f4c08 Update package.json 2025-01-14 01:35:29 -03:00
Nicolas
9759f18725 Nick: temp file fixes 2025-01-13 23:56:53 -03:00
Nicolas
ac6650e488 Update requests.http 2025-01-13 22:31:54 -03:00
Nicolas
5e5b5ee0e2
(feat/extract) New re-ranker + multi entity extraction (#1061)
* agent that decides if splits schema or not

* split and merge properties done

* wip

* wip

* changes

* ch

* array merge working!

* comment

* wip

* dereferentiate schema

* dereference schemas

* Nick: new re-ranker

* Create llm-links.txt

* Nick: format

* Update extraction-service.ts

* wip: cooking schema mix and spread functions

* wip

* wip getting there!!!

* nick:

* moved functions to helpers

* nick:

* cant reproduce the error anymore

* error handling all scrapes failed

* fix

* Nick: added the sitemap index

* Update sitemap-index.ts

* Update map.ts

* deduplicate and merge arrays

* added error handler for object transformations

* Update url-processor.ts

* Nick:

* Nick: fixes

* Nick: big improvements to rerank of multi-entity

* Nick: working

* Update reranker.ts

* fixed transformations for nested objs

* fix merge nulls

* Nick: fixed error piping

* Update queue-worker.ts

* Update extraction-service.ts

* Nick: format

* Update queue-worker.ts

* Update pnpm-lock.yaml

* Update queue-worker.ts

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Thomas Kosmas <thomas510111@gmail.com>
2025-01-13 22:30:15 -03:00
Gergő Móricz
5c62bb1195
feat: new snips test framework (FIR-414) (#1033)
* feat: new snips test framework

* Update mock.ts

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-01-13 20:50:47 +01:00
Nicolas
9a13c1dede Nick: fixes to extract rephrase prompt 2025-01-11 20:22:36 -03:00
Nicolas
a82160a630 Update crawl-redis.ts 2025-01-10 21:31:23 -03:00
Nicolas
f4d10c5031 Nick: formatting fixes 2025-01-10 18:35:10 -03:00
Gergő Móricz
d1f3b96388 feat: add scrapeId in document.metadata 2025-01-09 20:52:12 +01:00
Gergő Móricz
29c1f126ab feat(scrape-status): adapt 2025-01-09 19:14:00 +01:00
Gergő Móricz
2849ce2f13 fix(queue-worker): errored job logging 2025-01-09 18:48:47 +01:00
Gergő Móricz
97bf54214f fix(scrapeURL/loop): re-add is long enough check with lt 0 2025-01-09 18:43:50 +01:00
Gergő Móricz
0da386914d fix(queue-worker): graceful shutdown 2025-01-09 16:04:59 +01:00
Móricz Gergő
3c614a2e5c fix(scrapeURL/engines/pdf,docx): support authorization 2025-01-09 10:03:27 +01:00
Móricz Gergő
49e584f8e1 fix(queue-worker/crawl): use SCARD to generate num_docs field 2025-01-09 09:51:34 +01:00
Móricz Gergő
9e8c629ff4 fix(log_job): don't redact with auth header 2025-01-09 09:51:34 +01:00
Nicolas
14f696805c Update auth.ts 2025-01-08 17:04:57 -03:00
Nicolas
51cb4b1615 Nick: temp rl for /extract 2025-01-08 15:24:38 -03:00
Nicolas
a199208e21 Update rate-limiter.ts 2025-01-08 15:15:21 -03:00
Nicolas
aa31508ccd Nick: links-billed update (temp) 2025-01-08 15:13:33 -03:00
Móricz Gergő
363021ea78 feat(crawl): ensure url trimming 2025-01-08 12:35:42 +01:00
Móricz Gergő
977a3e13c5 fix(scrapeURL): remove short content check 2025-01-08 11:23:25 +01:00
Nicolas
0a41fdd35d Merge branch 'nsc/extract-queue' 2025-01-07 18:21:57 -03:00
Nicolas
7918d0e1c9 Nick: bump 1.12.0 2025-01-07 18:20:56 -03:00
Nicolas
f82a742cd1
Merge pull request #1044 from mendableai/nsc/extract-queue
(feat/extract) Move extract to a queue system
2025-01-07 18:10:46 -03:00
Nicolas
b98e289f03 Nick: 2025-01-07 17:49:21 -03:00
Nicolas
a185c05a5c Nick: sdk async and get status 2025-01-07 17:27:40 -03:00
Nicolas
9ec08d7020 Nick: fixed the sdks 2025-01-07 17:20:49 -03:00
Nicolas
dd14744850 Update types.ts 2025-01-07 16:55:55 -03:00
Nicolas
9fdcfb9314 Update index.ts 2025-01-07 16:24:46 -03:00
Nicolas
51636352a6 Merge branch 'nsc/extract-queue' of https://github.com/mendableai/firecrawl into nsc/extract-queue 2025-01-07 16:21:58 -03:00
Nicolas
11af214db1 Nick: update extract in case there is an error 2025-01-07 16:21:51 -03:00
Gergő Móricz
1f2a76fc23
Update apps/api/src/lib/extract/extraction-service.ts 2025-01-07 20:18:10 +01:00
Nicolas
eb254547e5 Nick: 2025-01-07 16:16:01 -03:00
Gergő Móricz
c6a63793bb crawl incomplete issues 2025-01-07 19:38:17 +01:00
Gergő Móricz
ccfada98ca various queue fixes 2025-01-07 19:15:23 +01:00
Nicolas
86e34d7c6c Nick: wip 2025-01-07 12:13:12 -03:00
Móricz Gergő
7a03275575 add comment 2025-01-07 13:57:47 +01:00
Móricz Gergő
7d73ebdbf1 fix(crawl): never invalidate first crawl scrape if redirects 2025-01-07 13:57:23 +01:00
Móricz Gergő
b96b97ed72 fix(crawl): don't push rawhtml to db unless requested 2025-01-07 10:09:15 +01:00
Móricz Gergő
35d1d85978 fix(crawler): also take the hostname of the base url when determining isInternalLink 2025-01-07 09:29:58 +01:00
Nicolas
bb27594443 Merge branch 'main' into nsc/extract-queue 2025-01-06 13:01:15 -03:00
Kirill
736c3675b6 use new agent generation instead of expired one 2025-01-05 17:07:14 +04:00
Nicolas
ceb2104960
Merge pull request #1034 from mendableai/sdk/fixed-none-undefined-on-response
[SDK] fixed none and undefined on response
2025-01-04 16:31:41 -03:00
Gergő Móricz
461842fe8c fix(v1/crawl-status): handle job's returnvalue being explicitly null (db race) 2025-01-04 17:24:33 +01:00
Gergő Móricz
b92a4eb79b fix(queue-worker): only do redirect handling logic on crawls, not batch scrape 2025-01-04 16:59:35 +01:00
Nicolas
d48ddb8820 Update canonical-url.test.ts 2025-01-03 23:55:05 -03:00
Nicolas
f2e0bfbfe3 Nick: url normalization 2025-01-03 23:54:03 -03:00