17 Commits

Author SHA1 Message Date
Gergő Móricz
b03670a8b7
feat: parse PDFs on fc side and reject if too long for timeout (FIR-2083) (#1592)
* feat: pdf-parser, implementation in scrapeURL

* use pdf-parser for page count instead of mu

* fix(pdf-parser): bindings

* feat(scrapeURL/pdf): adjust MILLISECONDS_PER_PAGE

* implement post-runsync polling and fix

* fix(Dockerfile): copy in the pdf-parser source code

* fix(scrapeURL/pdf): better error for timeout below 0
2025-05-23 13:45:53 +02:00
Gergő Móricz
bd9673e104
Mog/cachable lookup (#1560)
* feat(scrapeURL): use cacheableLookup

* feat(queue-worker): add cacheablelookup

* fix(cacheable-lookup): make it work with tailscale on local

* add devenv

* try again

* allow querying all

* log

* fixes

* asd

* fix:

* fix(lookup):

* lookup
2025-05-16 15:44:52 +02:00
Gergő Móricz
d46ba95924 Revert "feat: use cacheable lookup everywhere (#1559)"
This reverts commit b8703b2a720765b92f5c4cab94cc90ea624198a8.
2025-05-16 15:31:06 +02:00
Gergő Móricz
b8703b2a72
feat: use cacheable lookup everywhere (#1559)
* feat(scrapeURL): use cacheableLookup

* feat(queue-worker): add cacheablelookup

* fix(cacheable-lookup): make it work with tailscale on local

* add devenv

* try again

* allow querying all

* log

* fixes

* asd

* fix:

* fix(lookup):
2025-05-16 15:27:24 +02:00
Nicolas
1c421f2d74
Nick: (#1492) 2025-04-22 21:42:37 -04:00
Nicolas
6634d236bf
(feat/fire-1) FIRE-1 (#1462)
* wip

* integrating smart-scrape

* integrate smartscrape into llmExtract

* wip

* smart scrape multiple links

* fixes

* fix

* wip

* it worked!

* wip. there's a bug on the batchExtract TypeError: Converting circular structure to JSON

* wip

* retry model

* retry models

* feat/scrape+json+extract interfaces ready

* vertex -> googleapi

* fix/transformArrayToObject. required params on schema is still a bug

* change model

* o3-mini -> gemini

* Update extractSmartScrape.ts

* sessionId

* sessionId

* Nick: f-0 start

* Update extraction-service-f0.ts

* Update types.ts

* Nick:

* Update queue-worker.ts

* Nick: new interface

* rename analyzeSchemaAndPrompt -> F0

* refactor: rename agent ID to model in types and extract logic

* agent

* id->model

* id->model

* refactor: standardize agent model handling and validation across extraction logic

* livecast agent

* (feat/f1) sdks (#1459)

* feat: add FIRE-1 agent support to Python and JavaScript SDKs

Co-Authored-By: hello@sideguide.dev <hello@sideguide.dev>

* feat: add FIRE-1 agent support to scrape methods in both SDKs

Co-Authored-By: hello@sideguide.dev <hello@sideguide.dev>

* feat: add prompt and sessionId to AgentOptions interface

Co-Authored-By: hello@sideguide.dev <hello@sideguide.dev>

* Update index.ts

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: hello@sideguide.dev <hello@sideguide.dev>
Co-authored-by: Nicolas <nicolascamara29@gmail.com>

* feat(v1): rate limits

* Update types.ts

* Update llmExtract.ts

* add cost tracking

* remove

* Update requests.http

* fix smart scrape cost calc

* log sm cost

* fix counts

* fix

* expose cost tracking

* models fix

* temp: skipLibcheck

* get rid of it

* fix ts

* dont skip lib check

* Update extractSmartScrape.ts

* Update queue-worker.ts

* Update smartScrape.ts

* Update requests.http

* fix(rate-limiter):

* types: fire-1 refine

* bill 150

* fix credits used on crawl

* ban from crawl

* route cost limit warning

* Update generic-ai.ts

* genres

* Update llmExtract.ts

* test server diff

* cletu

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Thomas Kosmas <thomas510111@gmail.com>
Co-authored-by: Ademílson F. Tonato <ademilsonft@outlook.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: hello@sideguide.dev <hello@sideguide.dev>
Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2025-04-15 00:19:45 -07:00
Gergő Móricz
3a8de846e3
read from GCS (again) (#1433)
* feat(crawl-status): retrieve job data from GCS

* feat(gcs-jobs/save): retrying saving metadata (might conflict)

* feat(gcs-jobs/save): retry save operation

* fix(gcs-jobs/save): respect metadata rules

* feat(crawl-status): log if gcs job is not found

* feat(ci/test/server): add gcs
2025-04-09 12:47:51 +02:00
Gergő Móricz
71b6b83ec2
tally rework api switchover (#1328)
* tally rework api switchover

* fix and send logs

* temp: force main instance while RPCs propagate

* Revert "temp: force main instance while RPCs propagate"

This reverts commit 4c93379cfa64efd60eb4767dd8eced1bdd302531.
2025-03-12 20:10:33 +01:00
Gergő Móricz
e1cfe1da48
feat(crawl): includes/excludes fixes (FIR-1300) (#1303)
* feat(crawl): includes/excludes fixes pt. 1

* fix(snips): billing tests

* drop tha logs

* fix(ci): add replica url

* feat(crawl): drop initial scrape if it's not included

* feat(ci): more verbose logging

* fix crawl path in test

* fix(ci): wait for api

* fix(snips/scrape/ad): test for more pixels

* feat(js-sdk/crawl): add regexOnFullURL
2025-03-06 17:05:15 +01:00
Gergő Móricz
9ad947884d
feat(tests/snips): add billing tests + misc billing fixes (FIR-1280) (#1283)
* feat(tests/snips): add billing tests + misc billing fixes

* add testing key

* asd
2025-03-02 16:51:42 -03:00
Gergő Móricz
387cc60668 fix(ci/test-server): clean up old envs 2025-02-20 15:06:37 +01:00
Gergő Móricz
04218de2b0 Revert "feat(ci): use pull_request_target (+ manual approval)"
This reverts commit 9142030881e0d153396279520e127b74af8417c9.
2025-02-20 10:58:08 +01:00
Gergő Móricz
9142030881 feat(ci): use pull_request_target (+ manual approval) 2025-02-20 10:52:29 +01:00
Gergő Móricz
bc5a16d048 feat(ci/test-server): build go markdown parser 2025-02-20 10:05:39 +01:00
Gergő Móricz
f4f75fe184 fix(ci): path to lock 2025-02-19 22:15:41 +01:00
Gergő Móricz
e9cb8ac956 feat(ci): caching improvements 2025-02-19 22:11:32 +01:00
Gergő Móricz
1a9f6b985a feat(github/ci): improvements 2025-02-19 20:51:38 +01:00