* feat: pdf-parser, implementation in scrapeURL
* use pdf-parser for page count instead of mu
* fix(pdf-parser): bindings
* feat(scrapeURL/pdf): adjust MILLISECONDS_PER_PAGE
* implement post-runsync polling and fix
* fix(Dockerfile): copy in the pdf-parser source code
* fix(scrapeURL/pdf): better error for timeout below 0
* feat(crawl): includes/excludes fixes pt. 1
* fix(snips): billing tests
* drop tha logs
* fix(ci): add replica url
* feat(crawl): drop initial scrape if it's not included
* feat(ci): more verbose logging
* fix crawl path in test
* fix(ci): wait for api
* fix(snips/scrape/ad): test for more pixels
* feat(js-sdk/crawl): add regexOnFullURL
* fix(extract): construct OpenAI on demand
Fixes hard-crash if api key not specified in a self-hosting environment.
* fix(ci): try sleeping
* fix(ci): override host
* fix(ci): wait for server to start
* Support /extract and /crawl for self-hosted (FIR-1097) (#1137)
* Support /extract for self-hosted
This returns the job response from redis rather than supabase when db auth is disabled (self hosted mode)
* Use getJob for extract and use correct types
* fix(v1/crawl-status): only poll DB for total count if DB is enabled
* feat(snips): TEST_SUITE_SELF_HOSTED
* fix(ci/test-server-self-host): use pr trigger
* fix(scrapeURL): f-e mocking in selfhosted env
* fix(snips): do not try to eval json format on selfhost
* fix(scrapeURL): further f-e mocking
* fix(snips): don't timeout on hard fail polling
* fix(v1/extract-status): fix-up the db-agnostic impl
unfortunately had to separate the functions since the schema
was too divergent :(
* fix(snips): boost screenshot delay
* feat(ci): test with openai
* feat(ci): extract, search testing
* fix(ci): matrix
* fix(ci): bleh
* Update: fix default google search (#1174)
* fix log title
* search should always work
* asd
* fix ci
---------
Co-authored-by: Nick Roth <nlr06886@gmail.com>
Co-authored-by: William <sdustusun@gmail.com>