Update README.md

This commit is contained in:
Nicolas 2024-08-28 13:03:07 -03:00
parent 4003d37fbc
commit 3039cc264f

300
README.md
View File

@ -6,7 +6,7 @@ _This repository is in its early development stages. We are still merging custom
## What is Firecrawl? ## What is Firecrawl?
[Firecrawl](https://firecrawl.dev?ref=github) is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap required. [Firecrawl](https://firecrawl.dev?ref=github) is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap required. Check out our [documentation](https://docs.firecrawl.dev).
_Pst. hey, you, join our stargazers :)_ _Pst. hey, you, join our stargazers :)_
@ -41,18 +41,26 @@ To use the API, you need to sign up on [Firecrawl](https://firecrawl.dev) and ge
Used to crawl a URL and all accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl. Used to crawl a URL and all accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.
```bash ```bash
curl -X POST https://api.firecrawl.dev/v0/crawl \ curl -X POST https://api.firecrawl.dev/v1/crawl \
-H 'Content-Type: application/json' \ -H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \ -H 'Authorization: Bearer fc-YOUR_API_KEY' \
-d '{ -d '{
"url": "https://mendable.ai" "url": "https://docs.firecrawl.dev",
"limit": 100,
"scrapeOptions": {
"formats": ["markdown", "html"]
}
}' }'
``` ```
Returns a jobId Returns a crawl job id and the url to check the status of the crawl.
```json ```json
{ "jobId": "1234-5678-9101" } {
"success": true,
"id": "123-456-789",
"url": "https://api.firecrawl.dev/v1/crawl/123-456-789"
}
``` ```
### Check Crawl Job ### Check Crawl Job
@ -60,7 +68,7 @@ Returns a jobId
Used to check the status of a crawl job and get its result. Used to check the status of a crawl job and get its result.
```bash ```bash
curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \ curl -X GET https://api.firecrawl.dev/v1/crawl/123-456-789 \
-H 'Content-Type: application/json' \ -H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' -H 'Authorization: Bearer YOUR_API_KEY'
``` ```
@ -68,18 +76,20 @@ curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
```json ```json
{ {
"status": "completed", "status": "completed",
"current": 22, "totalCount": 36,
"total": 22, "creditsUsed": 36,
"expiresAt": "2024-00-00T00:00:00.000Z",
"data": [ "data": [
{ {
"content": "Raw Content ", "markdown": "[Firecrawl Docs home page![light logo](https://mintlify.s3-us-west-1.amazonaws.com/firecrawl/logo/light.svg)!...",
"markdown": "# Markdown Content", "html": "<!DOCTYPE html><html lang=\"en\" class=\"js-focus-visible lg:[--scroll-mt:9.5rem]\" data-js-focus-visible=\"\">...",
"provider": "web-scraper",
"metadata": { "metadata": {
"title": "Mendable | AI for CX and Sales", "title": "Build a 'Chat with website' using Groq Llama 3 | Firecrawl",
"description": "AI for CX and Sales", "language": "en",
"language": null, "sourceURL": "https://docs.firecrawl.dev/learn/rag-llama3",
"sourceURL": "https://www.mendable.ai/" "description": "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.",
"ogLocaleAlternate": [],
"statusCode": 200
} }
} }
] ]
@ -88,14 +98,15 @@ curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
### Scraping ### Scraping
Used to scrape a URL and get its content. Used to scrape a URL and get its content in the specified formats.
```bash ```bash
curl -X POST https://api.firecrawl.dev/v0/scrape \ curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \ -H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \ -H 'Authorization: Bearer YOUR_API_KEY' \
-d '{ -d '{
"url": "https://mendable.ai" "url": "https://docs.firecrawl.dev",
"formats" : ["markdown", "html"]
}' }'
``` ```
@ -105,55 +116,83 @@ Response:
{ {
"success": true, "success": true,
"data": { "data": {
"content": "Raw Content ", "markdown": "Launch Week I is here! [See our Day 2 Release 🚀](https://www.firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits)[💥 Get 2 months free...",
"markdown": "# Markdown Content", "html": "<!DOCTYPE html><html lang=\"en\" class=\"light\" style=\"color-scheme: light;\"><body class=\"__variable_36bd41 __variable_d7dc5d font-inter ...",
"provider": "web-scraper",
"metadata": { "metadata": {
"title": "Mendable | AI for CX and Sales", "title": "Home - Firecrawl",
"description": "AI for CX and Sales", "description": "Firecrawl crawls and converts any website into clean markdown.",
"language": null, "language": "en",
"sourceURL": "https://www.mendable.ai/" "keywords": "Firecrawl,Markdown,Data,Mendable,Langchain",
"robots": "follow, index",
"ogTitle": "Firecrawl",
"ogDescription": "Turn any website into LLM-ready data.",
"ogUrl": "https://www.firecrawl.dev/",
"ogImage": "https://www.firecrawl.dev/og.png?123",
"ogLocaleAlternate": [],
"ogSiteName": "Firecrawl",
"sourceURL": "https://firecrawl.dev",
"statusCode": 200
} }
} }
} }
``` ```
### Search (Beta) ### Map (Alpha)
Used to search the web, get the most relevant results, scrape each page and return the markdown. Used to map a URL and get urls of the website. This returns most links present on the website.
```bash ```bash cURL
curl -X POST https://api.firecrawl.dev/v0/search \ curl -X POST https://api.firecrawl.dev/v1/map \
-H 'Content-Type: application/json' \ -H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \ -H 'Authorization: Bearer YOUR_API_KEY' \
-d '{ -d '{
"query": "firecrawl", "url": "https://firecrawl.dev"
"pageOptions": {
"fetchPageContent": true // false for a fast serp api
}
}' }'
``` ```
Response:
```json ```json
{ {
"success": true, "status": "success",
"data": [ "links": [
{ "https://firecrawl.dev",
"url": "https://mendable.ai", "https://www.firecrawl.dev/pricing",
"markdown": "# Markdown Content", "https://www.firecrawl.dev/blog",
"provider": "web-scraper", "https://www.firecrawl.dev/playground",
"metadata": { "https://www.firecrawl.dev/smart-crawl",
"title": "Mendable | AI for CX and Sales",
"description": "AI for CX and Sales",
"language": null,
"sourceURL": "https://www.mendable.ai/"
}
}
] ]
} }
``` ```
### Intelligent Extraction (Beta) #### Map with search
Map with `search` param allows you to search for specific urls inside a website.
```bash cURL
curl -X POST https://api.firecrawl.dev/v1/map \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://firecrawl.dev",
"search": "docs"
}'
```
Response will be an ordered list from the most relevant to the least relevant.
```json
{
"status": "success",
"links": [
"https://docs.firecrawl.dev",
"https://docs.firecrawl.dev/sdks/python",
"https://docs.firecrawl.dev/learn/rag-llama3",
]
}
```
### LLM Extraction (v0) (Beta)
Used to extract structured data from scraped pages. Used to extract structured data from scraped pages.
@ -220,6 +259,42 @@ curl -X POST https://api.firecrawl.dev/v0/scrape \
} }
``` ```
### Search (v0) (Beta)
Used to search the web, get the most relevant results, scrape each page and return the markdown.
```bash
curl -X POST https://api.firecrawl.dev/v0/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"query": "firecrawl",
"pageOptions": {
"fetchPageContent": true // false for a fast serp api
}
}'
```
```json
{
"success": true,
"data": [
{
"url": "https://mendable.ai",
"markdown": "# Markdown Content",
"provider": "web-scraper",
"metadata": {
"title": "Mendable | AI for CX and Sales",
"description": "AI for CX and Sales",
"language": null,
"sourceURL": "https://www.mendable.ai/"
}
}
]
}
```
## Using Python SDK ## Using Python SDK
### Installing Python SDK ### Installing Python SDK
@ -231,24 +306,28 @@ pip install firecrawl-py
### Crawl a website ### Crawl a website
```python ```python
from firecrawl import FirecrawlApp from firecrawl.firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY") app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}}) # Scrape a website:
scrape_status = app.scrape_url(
'https://firecrawl.dev',
params={'formats': ['markdown', 'html']}
)
print(scrape_status)
# Get the markdown # Crawl a website:
for result in crawl_result: crawl_status = app.crawl_url(
print(result['markdown']) 'https://firecrawl.dev',
``` params={
'limit': 100,
### Scraping a URL 'scrapeOptions': {'formats': ['markdown', 'html']}
},
To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary. wait_until_done=True,
poll_interval=30
```python )
url = 'https://example.com' print(crawl_status)
scraped_data = app.scrape_url(url)
``` ```
### Extracting structured data from a URL ### Extracting structured data from a URL
@ -256,6 +335,11 @@ scraped_data = app.scrape_url(url)
With LLM extraction, you can easily extract structured data from any URL. We support pydantic schemas to make it easier for you too. Here is how you to use it: With LLM extraction, you can easily extract structured data from any URL. We support pydantic schemas to make it easier for you too. Here is how you to use it:
```python ```python
from firecrawl.firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR_API_KEY", version="v0")
class ArticleSchema(BaseModel): class ArticleSchema(BaseModel):
title: str title: str
points: int points: int
@ -277,15 +361,6 @@ data = app.scrape_url('https://news.ycombinator.com', {
print(data["llm_extraction"]) print(data["llm_extraction"])
``` ```
### Search for a query
Performs a web search, retrieve the top results, extract data from each page, and returns their markdown.
```python
query = 'What is Mendable?'
search_result = app.search(query)
```
## Using the Node SDK ## Using the Node SDK
### Installation ### Installation
@ -301,54 +376,33 @@ npm install @mendable/firecrawl-js
1. Get an API key from [firecrawl.dev](https://firecrawl.dev) 1. Get an API key from [firecrawl.dev](https://firecrawl.dev)
2. Set the API key as an environment variable named `FIRECRAWL_API_KEY` or pass it as a parameter to the `FirecrawlApp` class. 2. Set the API key as an environment variable named `FIRECRAWL_API_KEY` or pass it as a parameter to the `FirecrawlApp` class.
### Scraping a URL
To scrape a single URL with error handling, use the `scrapeUrl` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
```js ```js
try { import FirecrawlApp, { CrawlParams, CrawlStatusResponse } from '@mendable/firecrawl-js';
const url = "https://example.com";
const scrapedData = await app.scrapeUrl(url); const app = new FirecrawlApp({apiKey: "fc-YOUR_API_KEY"});
console.log(scrapedData);
} catch (error) { // Scrape a website
console.error("Error occurred while scraping:", error.message); const scrapeResponse = await app.scrapeUrl('https://firecrawl.dev', {
formats: ['markdown', 'html'],
});
if (scrapeResponse) {
console.log(scrapeResponse)
}
// Crawl a website
const crawlResponse = await app.crawlUrl('https://firecrawl.dev', {
limit: 100,
scrapeOptions: {
formats: ['markdown', 'html'],
}
} as CrawlParams, true, 30) as CrawlStatusResponse;
if (crawlResponse) {
console.log(crawlResponse)
} }
``` ```
### Crawling a Website
To crawl a website with error handling, use the `crawlUrl` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
```js
const crawlUrl = "https://example.com";
const params = {
crawlerOptions: {
excludes: ["blog/"],
includes: [], // leave empty for all pages
limit: 1000,
},
pageOptions: {
onlyMainContent: true,
},
};
const waitUntilDone = true;
const timeout = 5;
const crawlResult = await app.crawlUrl(
crawlUrl,
params,
waitUntilDone,
timeout
);
```
### Checking Crawl Status
To check the status of a crawl job with error handling, use the `checkCrawlStatus` method. It takes the job ID as a parameter and returns the current status of the crawl job.
```js
const status = await app.checkCrawlStatus(jobId);
console.log(status);
```
### Extracting structured data from a URL ### Extracting structured data from a URL
@ -360,6 +414,7 @@ import { z } from "zod";
const app = new FirecrawlApp({ const app = new FirecrawlApp({
apiKey: "fc-YOUR_API_KEY", apiKey: "fc-YOUR_API_KEY",
version: "v0"
}); });
// Define schema to extract contents into // Define schema to extract contents into
@ -384,19 +439,6 @@ const scrapeResult = await app.scrapeUrl("https://news.ycombinator.com", {
console.log(scrapeResult.data["llm_extraction"]); console.log(scrapeResult.data["llm_extraction"]);
``` ```
### Search for a query
With the `search` method, you can search for a query in a search engine and get the top results along with the page content for each result. The method takes the query as a parameter and returns the search results.
```js
const query = "what is mendable?";
const searchResults = await app.search(query, {
pageOptions: {
fetchPageContent: true, // Fetch the page content for each search result
},
});
```
## Contributing ## Contributing
We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request. We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.