Gergő Móricz
8d467c8ca7
WebScraper
refactor into scrapeURL
(#714 )
...
* feat: use strictNullChecking
* feat: switch logger to Winston
* feat(scrapeURL): first batch
* fix(scrapeURL): error swallow
* fix(scrapeURL): add timeout to EngineResultsTracker
* fix(scrapeURL): report unexpected error to sentry
* chore: remove unused modules
* feat(transfomers/coerce): warn when a format's response is missing
* feat(scrapeURL): feature flag priorities, engine quality sorting, PDF and DOCX support
* (add note)
* feat(scrapeURL): wip readme
* feat(scrapeURL): LLM extract
* feat(scrapeURL): better warnings
* fix(scrapeURL/engines/fire-engine;playwright): fix screenshot
* feat(scrapeURL): add forceEngine internal option
* feat(scrapeURL/engines): scrapingbee
* feat(scrapeURL/transformars): uploadScreenshot
* feat(scrapeURL): more intense tests
* bunch of stuff
* get rid of WebScraper (mostly)
* adapt batch scrape
* add staging deploy workflow
* fix yaml
* fix logger issues
* fix v1 test schema
* feat(scrapeURL/fire-engine/chrome-cdp): remove wait inserts on actions
* scrapeURL: v0 backwards compat
* logger fixes
* feat(scrapeurl): v0 returnOnlyUrls support
* fix(scrapeURL/v0): URL leniency
* fix(batch-scrape): ts non-nullable
* fix(scrapeURL/fire-engine/chromecdp): fix wait action
* fix(logger): remove error debug key
* feat(requests.http): use dotenv expression
* fix(scrapeURL/extractMetadata): extract custom metadata
* fix crawl option conversion
* feat(scrapeURL): Add retry logic to robustFetch
* fix(scrapeURL): crawl stuff
* fix(scrapeURL): LLM extract
* fix(scrapeURL/v0): search fix
* fix(tests/v0): grant larger response size to v0 crawl status
* feat(scrapeURL): basic fetch engine
* feat(scrapeURL): playwright engine
* feat(scrapeURL): add url-specific parameters
* Update readme and examples
* added e2e tests for most parameters. Still a few actions, location and iframes to be done.
* fixed type
* Nick:
* Update scrape.ts
* Update index.ts
* added actions and base64 check
* Nick: skipTls feature flag?
* 403
* todo
* todo
* fixes
* yeet headers from url specific params
* add warning when final engine has feature deficit
* expose engine results tracker for ScrapeEvents implementation
* ingest scrape events
* fixed some tests
* comment
* Update index.test.ts
* fixed rawHtml
* Update index.test.ts
* update comments
* move geolocation to global f-e option, fix removeBase64Images
* Nick:
* trim url-specific params
* Update index.ts
---------
Co-authored-by: Eric Ciarla <ericciarla@yahoo.com>
Co-authored-by: rafaelmmiller <8574157+rafaelmmiller@users.noreply.github.com>
Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2024-11-07 20:57:33 +01:00
Sebastjan Prachovskij
503e83e83e
Add SearchApi to search
...
Add support for engines, improve status code error
Remove changes in package, add engine to env params
Improve description in env example
Remove unnecessary empty line
Improve text
2024-09-05 18:36:59 +03:00
rafaelsideguide
7a61325500
map + search + scrape markdown bug
2024-08-16 17:57:11 -03:00
Nicolas
09ca165d2e
Merge pull request #531 from kevinswiber/fix/respect-docker-env-file-comments
...
Self-host fix: Moving comments of .env.example values from end-of-line to above-line.
2024-08-12 16:54:56 -04:00
Kevin Swiber
33aa5cf0de
Moving comments of .env.example values from end-of-line to above-line. Self-host docs suggest using .env.example as a base. However, Docker doesn't respect end-of-line comments. It sets the comment as the actual value of the variable. This fix prevents that.
2024-08-12 12:24:46 -07:00
Rafael Miller
36e4b2cf49
Update .env.example
2024-08-12 10:37:00 -03:00
Quan Ming
a96ad4b0e2
Update redis url to use comment
2024-08-10 12:33:26 +08:00
Quan Ming
0221872a70
Update redis urls in example .env
2024-08-10 00:40:11 +08:00
rafaelsideguide
6208ecdbc0
added logger
2024-07-23 17:30:46 -03:00
Nicolas
17a1f9b55f
Update .env.example
2024-07-17 16:22:04 -04:00
Nicolas
5683bb2cc8
Nick:
2024-06-05 13:20:26 -07:00
Jakob Stadlhuber
9e5ddec207
Remove default webhook URL from .env.example
...
The default value for the SELF_HOSTED_WEBHOOK_URL in the .env.example file was removed to prevent unintentional exposure or usage. The users are now required to explicitly specify
2024-06-04 19:56:35 +02:00
Jakob Stadlhuber
6208f4207d
Add support for Self-Hosted Webhook URL Usage and added project_id into the webhook payload
...
This commit introduces the capability of using a Self-Hosted Webhook URL. The application now checks for a self-hosted URL before querying the database for the webhook settings. If a Self-Hosted Webhook URL is set in the environment variables, it will be used directly, diminishing unnecessary database queries.
2024-06-04 19:55:07 +02:00
Rafael Miller
b80fb374e5
Merge branch 'main' into playwright-service-bug-222
2024-06-04 11:57:17 -03:00
rombru
3ff91ddd1f
fix: use @ instead of # for default BULL_AUTH_KEY. hash mark is reserved for URI fragments.
2024-06-03 21:28:25 +02:00
Matt Joyce
14896a9fdd
Fix PLAYWRIGHT_MICROSERVICE_URL
...
It needs to end in html, otherwise scrape will 404
2024-06-01 19:03:16 +10:00
Nicolas
ace46f340b
Nick: new limits, new pricing
2024-05-30 14:31:36 -07:00
Jakob Stadlhuber
9fc5a0ff98
Update comment in .env.example for proxy settings
...
This commit modifies the comment in .env.example to specify that proxy settings are for Playwright. This clarification aims to provide users a more clear context about when and why these proxy settings are used.
2024-05-24 17:45:59 +02:00
Jakob Stadlhuber
b001aded46
Add proxy and media blocking configurations
...
Updated environment variables and application settings to include proxy configurations and media blocking option. The proxy settings allow users to use a proxy service, while the media blocking is an optional feature that can help save bandwidth. Changes have been made in the .env.example, docker-compose.yaml, and main.py files.
2024-05-24 17:41:34 +02:00
Nicolas
a5e718b084
Nick: improvements
2024-05-21 18:34:23 -07:00
Nicolas
2644e1c029
Update .env.example
2024-05-20 13:36:51 -07:00
Nicolas
f473793ba3
Merge branch 'main' into feat/rate-limits
2024-05-19 12:23:34 -07:00
rafaelsideguide
54049be539
Added e2e tests
2024-05-17 15:37:47 -03:00
rafaelsideguide
40ad97dee8
added rate limits
2024-05-14 18:08:31 -03:00
rafaelsideguide
18480b2005
Removed .env.example, improved docs and docker compose envs
2024-05-10 11:38:17 -03:00
Eric Ciarla
caf3f9eede
Add Posthog Logging
2024-05-02 15:30:22 -04:00
Nicolas
9ded75adb7
Merge branch 'main' into nsc/mvp-search
2024-04-23 16:52:40 -07:00
Nicolas
41263bb4b6
Nick: serper support
2024-04-23 16:45:06 -07:00
rafaelsideguide
a680c7ce84
[Feat] Server health check + slack message
2024-04-23 15:46:29 -03:00
Caleb Peffer
ef4ffd3a18
Adding contributors guide
2024-04-21 10:56:30 -07:00
Caleb Peffer
ad7951a679
Merge branch 'main' of https://github.com/mendableai/firecrawl into cjp/contributors-guide-and
2024-04-20 19:56:55 -07:00
Caleb Peffer
e6b46178dd
Caleb: added .env.example
2024-04-20 19:53:27 -07:00