1992 Commits

Author SHA1 Message Date
Nicolas
2e5785d8d9 Nick: fetch sitemap timeout param 2025-01-19 11:40:13 -03:00
Nicolas
24ddcd4a6d Update check-fire-engine.ts 2025-01-18 23:53:33 -03:00
Nicolas
382476cb36 Nick: auth extract 2025-01-18 23:16:25 -03:00
Nicolas
81c347f538 Update llmExtract.ts 2025-01-18 22:49:03 -03:00
Nicolas
64607f3f20 Update extraction-service.ts 2025-01-18 22:40:53 -03:00
Nicolas
b8a30a50e2 Update llm-cost.ts 2025-01-18 21:25:25 -03:00
Nicolas
0ec52613e2 Nick: 2025-01-18 21:10:11 -03:00
Nicolas
34b40f6a23 Nick: 2025-01-18 17:17:42 -03:00
Nicolas
9cd48d7f73 Nick: 2025-01-17 23:47:22 -03:00
Nicolas
260a726f37 Merge branch 'main' into nsc/llm-usage-extract 2025-01-17 23:02:12 -03:00
Nicolas
6e3ceccb5c Nick: fixed billing and acuc cache 2025-01-17 21:27:56 -03:00
Nicolas
1f6abf95e8 Nick: extract billing works 2025-01-17 20:59:44 -03:00
Gergő Móricz
dbc6d07871 fix(queue-worker): bring done add to earlier 2025-01-17 17:46:29 +01:00
Gergő Móricz
13abb2bc0e fix(crawl-redis/finishCrawl): increase logging to hunt down race condition 2025-01-17 17:23:13 +01:00
Gergő Móricz
078c0679aa fix(crawl-status): improve finished checking 2025-01-17 17:18:36 +01:00
Gergő Móricz
e6531278f6 feat(v1): crawl/batch scrape errors route 2025-01-17 17:12:04 +01:00
Gergő Móricz
dcd3d6d98d fix(kickoff): mark as finished if it errors out 2025-01-17 17:11:19 +01:00
Gergő Móricz
5992c57158 fix(crawler): bad urls from sitemap 2025-01-17 17:07:44 +01:00
Gergő Móricz
237d0dc197 fix(requests.http): map 2025-01-17 16:21:57 +01:00
Gergő Móricz
d5929af010 fix(queue-worker/kickoff): make crawls wait for kickoff to finish (matters on big sitemapped sites) 2025-01-17 16:04:01 +01:00
Gergő Móricz
23bb172592 fix(crawler): recognize sitemaps in robots.txt 2025-01-17 15:45:52 +01:00
Móricz Gergő
faf58dfca7 fix(removeUnwantedElements): post-includeTags excludeTags
Fixes #700
2025-01-17 12:41:00 +01:00
Móricz Gergő
de08b37480 feat: adjust CI testing 2025-01-17 11:51:46 +01:00
Móricz Gergő
4a947e385f fix(queue-worker): fill out time taken on failure too 2025-01-17 11:28:37 +01:00
Gergő Móricz
6c94db7ed0 fix(html,markdown): always get absolute links 2025-01-16 16:56:13 +01:00
Gergő Móricz
e824303d87 feat(html): always pick largest image from srcset 2025-01-16 16:51:33 +01:00
Gergő Móricz
655753cd27 fix(url): allow domains with ports 2025-01-16 16:30:14 +01:00
Nicolas
ca14c651da Update model-prices.ts 2025-01-15 21:07:53 -03:00
Nicolas
4db023280d Nick: introduce llm-usage cost analysis 2025-01-15 21:01:29 -03:00
Gergő Móricz
cbe67d89a5 feat(queue-worker): proactive job cancel 2025-01-15 19:02:20 +01:00
Gergő Móricz
ec039dcb8f fix(blocklist): unblock 2025-01-15 18:54:26 +01:00
Gergő Móricz
dde3aebac4 fix(v1/crawl-status): fix stuck on 0 jobs 2025-01-15 18:51:39 +01:00
Gergő Móricz
ce2f6ff884 fix(queue-worker/billing): fix crawl overbilling 2025-01-15 17:22:52 +01:00
Nicolas
db89e365eb Update check-fire-engine.ts 2025-01-15 01:16:42 -03:00
Nicolas
957eea4113 Nick: extract without a schema should work as expected 2025-01-14 11:37:00 -03:00
Nicolas
61e6af2b16 Nick: streaming callback experimental 2025-01-14 02:13:42 -03:00
Nicolas
c323c64671 Update extract-redis.ts 2025-01-14 02:00:47 -03:00
Nicolas
2dc87a2e1c Update extraction-service.ts 2025-01-14 01:59:52 -03:00
Nicolas
033e9bbf29 Nick: __experimental_streamSteps 2025-01-14 01:45:50 -03:00
Nicolas
9759f18725 Nick: temp file fixes 2025-01-13 23:56:53 -03:00
Nicolas
ac6650e488 Update requests.http 2025-01-13 22:31:54 -03:00
Nicolas
5e5b5ee0e2
(feat/extract) New re-ranker + multi entity extraction (#1061)
* agent that decides if splits schema or not

* split and merge properties done

* wip

* wip

* changes

* ch

* array merge working!

* comment

* wip

* dereferentiate schema

* dereference schemas

* Nick: new re-ranker

* Create llm-links.txt

* Nick: format

* Update extraction-service.ts

* wip: cooking schema mix and spread functions

* wip

* wip getting there!!!

* nick:

* moved functions to helpers

* nick:

* cant reproduce the error anymore

* error handling all scrapes failed

* fix

* Nick: added the sitemap index

* Update sitemap-index.ts

* Update map.ts

* deduplicate and merge arrays

* added error handler for object transformations

* Update url-processor.ts

* Nick:

* Nick: fixes

* Nick: big improvements to rerank of multi-entity

* Nick: working

* Update reranker.ts

* fixed transformations for nested objs

* fix merge nulls

* Nick: fixed error piping

* Update queue-worker.ts

* Update extraction-service.ts

* Nick: format

* Update queue-worker.ts

* Update pnpm-lock.yaml

* Update queue-worker.ts

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Thomas Kosmas <thomas510111@gmail.com>
2025-01-13 22:30:15 -03:00
Gergő Móricz
5c62bb1195
feat: new snips test framework (FIR-414) (#1033)
* feat: new snips test framework

* Update mock.ts

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2025-01-13 20:50:47 +01:00
Nicolas
9a13c1dede Nick: fixes to extract rephrase prompt 2025-01-11 20:22:36 -03:00
Nicolas
a82160a630 Update crawl-redis.ts 2025-01-10 21:31:23 -03:00
Nicolas
f4d10c5031 Nick: formatting fixes 2025-01-10 18:35:10 -03:00
Gergő Móricz
d1f3b96388 feat: add scrapeId in document.metadata 2025-01-09 20:52:12 +01:00
Gergő Móricz
29c1f126ab feat(scrape-status): adapt 2025-01-09 19:14:00 +01:00
Gergő Móricz
2849ce2f13 fix(queue-worker): errored job logging 2025-01-09 18:48:47 +01:00
Gergő Móricz
97bf54214f fix(scrapeURL/loop): re-add is long enough check with lt 0 2025-01-09 18:43:50 +01:00