mirror of
https://git.mirrors.martin98.com/https://github.com/mendableai/firecrawl
synced 2025-06-04 11:24:40 +08:00
Add notebook and markdown files for two articles: mastering /scrape and mastering /map
This commit is contained in:
parent
ce6d3e21e1
commit
c0fd256021
File diff suppressed because one or more lines are too long
@ -0,0 +1,500 @@
|
||||
---
|
||||
title: "How to Generate a Sitemap Using Firecrawl's /map Endpoint: A Complete Guide"
|
||||
meta_description: "Learn how to generate XML and visual sitemaps using Firecrawl's /map endpoint. Step-by-step guide with Python code examples, performance comparisons, and interactive visualization techniques for effective website mapping."
|
||||
slug: how-to-generate-sitemap-using-firecrawl-map-endpoint
|
||||
date: 23 Nov, 2024
|
||||
author: Bex Tuychiev
|
||||
image: to_fill_in_later
|
||||
categories: [tutorials]
|
||||
keywords: [sitemap generator, website mapping, firecrawl, url discovery, site structure analysis, seo optimization, web crawling, website indexing, automated sitemap, site mapping tools, website architecture, sitemap creation, web development tools, website maintenance, seo tools, map endpoint, firecrawl map endpoint, python sitemap generator, visual sitemap creator, website structure visualization, interactive site mapping, xml sitemap generation, website crawling api, sitemap automation tools, website structure analysis tools]
|
||||
---
|
||||
|
||||
## Introduction
|
||||
|
||||
In this guide, we'll explore Firecrawl's `/map` endpoint - a powerful tool for automated website mapping and URL discovery. We'll cover what it does, why it matters, and how to use it effectively in your web development workflow.
|
||||
|
||||
Website mapping has become increasingly critical in modern web development. As sites grow more complex with dynamic content and single-page applications, having a clear understanding of your site's structure and URL hierarchy is essential for SEO, maintenance, and user experience.
|
||||
|
||||
The /map endpoint helps solve common challenges like keeping track of site structure, identifying broken links, and ensuring search engines can properly crawl and index your content. Let's dive into how it works.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Introduction](#introduction)
|
||||
- [Understanding Firecrawl's `/map` Endpoint: Features and Benefits](#understanding-firecrawls-map-endpoint-features-and-benefits)
|
||||
- [What is Site Mapping and Why is it Essential for Modern Websites?](#what-is-site-mapping-and-why-is-it-essential-for-modern-websites)
|
||||
- [Guide to Sitemap Types: Visual vs XML Sitemaps](#guide-to-sitemap-types-visual-vs-xml-sitemaps)
|
||||
- [How Firecrawl's `/map` Endpoint Solves These Challenges](#how-firecrawls-map-endpoint-solves-these-challenges)
|
||||
- [Limitations of `/map` in the alpha stage](#limitations-of-map-in-the-alpha-stage)
|
||||
- [Step-by-Step Guide to Using the /map Endpoint](#step-by-step-guide-to-using-the-map-endpoint)
|
||||
- [Further Configuration Options for Website Mapping](#further-configuration-options-for-website-mapping)
|
||||
- [Optimizing URL discovery with `search` parameter](#optimizing-url-discovery-with-search-parameter)
|
||||
- [Essential `/map` parameters for customized site mappning](#essential-map-parameters-for-customized-site-mappning)
|
||||
- [Comparing `/crawl` and `/map`: When to Use Each Endpoint](#comparing-crawl-and-map-when-to-use-each-endpoint)
|
||||
- [Step-by-Step Guide: Creating XML Sitemaps with `/map`](#step-by-step-guide-creating-xml-sitemaps-with-map)
|
||||
- [Advanced Visualization: Building Interactive Visual Sitemaps with `/map`](#advanced-visualization-building-interactive-visual-sitemaps-with-map)
|
||||
- [Conclusion](#conclusion)
|
||||
|
||||
## Understanding Firecrawl's `/map` Endpoint: Features and Benefits
|
||||
|
||||
To understand what the `/map` endpoint does, let's briefly cover what's __site mapping__ and why it is important.
|
||||
|
||||
### What is Site Mapping and Why is it Essential for Modern Websites?
|
||||
|
||||
Put simply, a sitemap is a list or a diagram that communicates the structure of web pages in a website. It is useful for a number of reasons.
|
||||
|
||||
First, it helps developers and site owners understand and maintain their website's structure. Having a clear overview of how pages are connected makes it easier to manage content, identify navigation issues, and ensure a logical flow for users.
|
||||
|
||||
Second, sitemaps are crucial for SEO. Search engines use sitemaps to discover and index pages more efficiently. A well-structured sitemap helps ensure all your important content gets crawled and indexed properly.
|
||||
|
||||
Third, sitemaps can help identify potential issues like broken links, orphaned pages (pages with no incoming links), or circular references. This makes troubleshooting and maintenance much more manageable.
|
||||
|
||||
Finally, sitemaps are valuable for planning site improvements and expansions. They provide a bird's-eye view that helps in making strategic decisions about content organization and information architecture.
|
||||
|
||||
### Guide to Sitemap Types: Visual vs XML Sitemaps
|
||||
|
||||
There are two main types of sitemaps: visual and XML.
|
||||
|
||||
Visual sitemaps are diagrams or flowcharts that show how websites are structured at-a-glance. They typically use boxes, lines, and other visual elements to represent pages and their relationships. These visual representations make it easy for stakeholders, designers, and developers to quickly understand site hierarchy, navigation paths, and content organization. They're particularly useful during the planning and design phases of web development, as well as for communicating site structure to non-technical team members.
|
||||
|
||||

|
||||
|
||||
Source: [Flowapp](https://www.flowmapp.com/features/generate-sitemap-online)
|
||||
|
||||
XML sitemaps are shown to the public much less frequently because they contain structured XML code that can look intimidating to non-technical users. But an XML sitemap is just an organized file containing all the URLs of a website that is readable to search engines. It includes important metadata about each URL like when it was last modified, how often it changes, and its relative importance. Search engines like Google use this information to crawl websites more intelligently and ensure all important pages are indexed. While XML sitemaps aren't meant for human consumption, they play a vital role in SEO and are often required for large websites to achieve optimal search engine visibility.
|
||||
|
||||

|
||||
|
||||
Source: [DataCamp](https://www.datacamp.com/sitemap/es/tutorial/category.xml)
|
||||
|
||||
### How Firecrawl's `/map` Endpoint Solves These Challenges
|
||||
|
||||
When you are building a website from scratch, you usually need a visual sitemap and can develop the XML one over the time as you add more pages. However, if you neglected these steps early on and suddenly find yourself with a massive website, possibly with thousands of URLs, creating either type of sitemap manually becomes an overwhelming task. This is where automated solutions like the `/map` endpoint can become invaluable.
|
||||
|
||||
The real challenge of mapping existing sites is finding all the URLs that exist on your website. Without automated tools, you'd need to manually click through every link, record every URL, and track which pages link to which others. Traditional web scraping solutions using Python libraries like `beautifulsoup`, `scrapy` or `lxml` can automate this process but they can quickly become useless when dealing with modern web applications that heavily rely on JavaScript for rendering content, use complex authentication systems, or implement rate limiting and bot detection.
|
||||
|
||||
These traditional approaches are not only time-consuming but also error-prone, as it's easy to miss URLs in JavaScript-rendered content, dynamically generated pages, or deeply nested navigation menus.
|
||||
|
||||
The `/map` endpoint solves these challenges and provides the fastest and easiest solution to go from a single URL to a map of the entire website. The /map endpoint is particularly useful in scenarios where:
|
||||
|
||||
- You want to give end-users control over which links to scrape by presenting them with options
|
||||
- Rapid discovery of all available links on a website is crucial
|
||||
- You need to focus on topic-specific content, you can use the search parameter to find relevant pages
|
||||
- You only want to extract data from particular sections of a website rather than crawling everything
|
||||
|
||||
### Limitations of `/map` in the alpha stage
|
||||
|
||||
While the /map endpoint is still in alpha stage, it has some limitations. The endpoint prioritizes speed so it may not capture all website links. The vision for this endpoint is to maintain its blazing-fast speed and still capture every single link in a given website. Feedback and suggestions are welcome.
|
||||
|
||||
## Step-by-Step Guide to Using the /map Endpoint
|
||||
|
||||
Firecrawl is a scraping engine exposed as a REST API, which means you can use it from the command-line using cURL or by using one of its SDKs in Python, Node, Go or Rust. In this tutorial, we will use its Python SDK, so please install it in your environment:
|
||||
|
||||
```bash
|
||||
pip install firecrawl-py
|
||||
```
|
||||
|
||||
The next step is obtaining a Firecrawl API key by signing up at [firecrawl.dev](firecrawl.dev) and choosing a plan (the free plan is fine for this tutorial).
|
||||
|
||||
Once you have your API key, you should save it in a .env file, which provides a secure way to store sensitive credentials without exposing them in your code:
|
||||
|
||||
```bash
|
||||
touch .env
|
||||
echo "FIRECRAWL_API_KEY='YOUR-API-KEY'" >> .env
|
||||
```
|
||||
|
||||
Then, you should install python-dotenv to automatically load the variables in `.env` files in Python scripts and notebooks:
|
||||
|
||||
```bash
|
||||
pip install python-dotenv
|
||||
```
|
||||
|
||||
Then, using the /map endpoint is as easy as the following code:
|
||||
|
||||
```python
|
||||
from firecrawl import FirecrawlApp
|
||||
from dotenv import load_dotenv; load_dotenv()
|
||||
|
||||
app = FirecrawlApp()
|
||||
|
||||
response = app.map_url(url="https://firecrawl.dev")
|
||||
```
|
||||
|
||||
In this code snippet, we're using the Firecrawl Python SDK to map a URL. Let's break down what's happening:
|
||||
|
||||
First, we import two key components:
|
||||
|
||||
- FirecrawlApp from the firecrawl package, which provides the main interface to interact with Firecrawl's API
|
||||
- `load_dotenv` from `dotenv` to load our environment variables containing the API key
|
||||
|
||||
After importing, we initialize a FirecrawlApp instance, which automatically picks up our API key from the environment variables.
|
||||
|
||||
Finally, we make a request to map the URL `https://firecrawl.dev` using the`map_url()` method. This crawls the website and returns information about its structure and pages, taking about two seconds on my machine (the speed may vary based on internet speeds).
|
||||
|
||||
Let's look at the `response` dictionary:
|
||||
|
||||
```python
|
||||
response.keys()
|
||||
```
|
||||
|
||||
```python
|
||||
dict_keys(['success', 'links'])
|
||||
```
|
||||
|
||||
It only has two keys: 'success' and 'links'. The 'success' key indicates whether the request was successful, and the 'links' key contains the URLs found on the website:
|
||||
|
||||
```python
|
||||
len(response['links'])
|
||||
```
|
||||
|
||||
```python
|
||||
98
|
||||
```
|
||||
|
||||
## Further Configuration Options for Website Mapping
|
||||
|
||||
### Optimizing URL discovery with `search` parameter
|
||||
|
||||
The most notable feature of the endpoint is its `search` parameter. This parameter allows you to filter the URLs returned by the crawler based on specific patterns or criteria. For example, you can use it to only retrieve URLs containing certain keywords or matching specific paths. This makes it incredibly useful for focused crawling tasks where you're only interested in a subset of pages for massive websites.
|
||||
|
||||
Let's use this feature on the Stripe documentation and only search for pages related to taxes:
|
||||
|
||||
```python
|
||||
url = "https://docs.stripe.com"
|
||||
|
||||
response = app.map_url(url=url, params={"search": "tax"})
|
||||
```
|
||||
|
||||
The response structure will be the same:
|
||||
|
||||
```python
|
||||
response["links"][:10]
|
||||
```
|
||||
|
||||
```python
|
||||
['https://docs.stripe.com/tax',
|
||||
'https://docs.stripe.com/tax/how-tax-works',
|
||||
'https://docs.stripe.com/tax/reports',
|
||||
'https://docs.stripe.com/tax/calculating',
|
||||
'https://docs.stripe.com/api/tax_rates',
|
||||
'https://docs.stripe.com/tax/tax-codes',
|
||||
'https://docs.stripe.com/tax/zero-tax',
|
||||
'https://docs.stripe.com/tax/products-prices-tax-codes-tax-behavior',
|
||||
'https://docs.stripe.com/payments/checkout/taxes',
|
||||
'https://docs.stripe.com/billing/taxes/tax-rates']
|
||||
```
|
||||
|
||||
Let's count up the found links:
|
||||
|
||||
```python
|
||||
len(response["links"])
|
||||
```
|
||||
|
||||
```out
|
||||
2677
|
||||
```
|
||||
|
||||
More than 2600 in only three seconds!
|
||||
|
||||
### Essential `/map` parameters for customized site mappning
|
||||
|
||||
There are some additional parameters /map accepts to control its behavior:
|
||||
|
||||
__`ignoreSitemap`__
|
||||
|
||||
- Type: boolean
|
||||
- Default: true
|
||||
- Description: When set to true, the crawler will not attempt to parse or use the website's `sitemap.xml` file during crawling. This can be useful when you want to discover pages through navigation links only.
|
||||
|
||||
__`sitemapOnly`__
|
||||
|
||||
- Type: boolean
|
||||
- Default: false
|
||||
- Description: When enabled, the crawler will exclusively return URLs found in the website's sitemap files, ignoring any links discovered through page crawling. This is useful for quickly indexing officially published pages.
|
||||
|
||||
__`includeSubdomains`__
|
||||
|
||||
- Type: boolean
|
||||
- Default: false
|
||||
- Description: Controls whether the crawler should follow and return links to subdomains (e.g., blog.example.com when crawling example.com). Enabling this provides a more comprehensive view of the entire web property.
|
||||
|
||||
__`limit`__
|
||||
|
||||
- Type: integer
|
||||
- Default: 5000
|
||||
- Description: Specifies the maximum number of URLs the crawler will return in a single request. This helps manage response sizes and processing time. Must be less than 5000 to prevent excessive server load for the time being.
|
||||
- Required range: x < 5000
|
||||
|
||||
Let's try running the Stripe example by including some of these parameters, like the `sitemapOnly` and `includeSubdomains` options set to True:
|
||||
|
||||
```python
|
||||
url = "https://docs.stripe.com"
|
||||
|
||||
response = app.map_url(url=url, params={"search": "tax", "sitemapOnly": True, "includeSubdomains": True})
|
||||
len(response['links'])
|
||||
```
|
||||
|
||||
```python
|
||||
2712
|
||||
```
|
||||
|
||||
This time, the link count increased.
|
||||
|
||||
## Comparing `/crawl` and `/map`: When to Use Each Endpoint
|
||||
|
||||
If you read our [separate guide on the `/crawl` endpoint](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl) of Firecrawl, you may notice one similarity between it and the `/map` endpoint:
|
||||
|
||||
If you set the response format of crawl to "links", you will also get a list of URLs found on the website. While the purpose is the same, there are huge differences performance-wise.
|
||||
|
||||
First, the `/crawl` endpoint is painfully slow for URL discovery, as evidenced by the execution times in the examples below:
|
||||
|
||||
```python
|
||||
%%time
|
||||
|
||||
url = "books.toscrape.com"
|
||||
|
||||
crawl_response = app.crawl_url(url=url, params={"scrapeOptions": {"formats": ["links"]}})
|
||||
```
|
||||
|
||||
```text
|
||||
CPU times: user 843 ms, sys: 470 ms, total: 1.31 s
|
||||
Wall time: 2min 9s
|
||||
```
|
||||
|
||||
```python
|
||||
%%time
|
||||
|
||||
url = "books.toscrape.com"
|
||||
|
||||
map_response = app.map_url(url=url)
|
||||
```
|
||||
|
||||
```out
|
||||
CPU times: user 4.91 ms, sys: 3.58 ms, total: 8.49 ms
|
||||
Wall time: 2.04 s
|
||||
```
|
||||
|
||||

|
||||
|
||||
This is because `/crawl` needs to fully load and parse each page's HTML content, even when we only want the links. In contrast, `/map` is optimized specifically for URL discovery, making it much faster for generating sitemaps and link analysis.
|
||||
|
||||
But, since `/map` is alpha, it doesn't capture as many links as `/crawl`:
|
||||
|
||||
```python
|
||||
crawl_links = set()
|
||||
|
||||
for page in crawl_response['data']:
|
||||
crawl_links.update(page["links"])
|
||||
|
||||
len(crawl_links)
|
||||
```
|
||||
|
||||
```python
|
||||
1216
|
||||
```
|
||||
|
||||
```python
|
||||
len(map_response["links"])
|
||||
```
|
||||
|
||||
```python
|
||||
298
|
||||
```
|
||||
|
||||
In fact, `/map` found three times less links than `/crawl`.
|
||||
|
||||
However, due to its speed, `/map` can still provide a good foundation for sitemap generation and its accuracy will increase as it progresses towards being stable.
|
||||
|
||||
## Step-by-Step Guide: Creating XML Sitemaps with `/map`
|
||||
|
||||
Now, let's see how to convert the links found with `/map` to an XML sitemap in Python. We will need to import the following packages:
|
||||
|
||||
```python
|
||||
from datetime import datetime
|
||||
import xml.etree.ElementTree as ET
|
||||
from urllib.parse import urlparse
|
||||
```
|
||||
|
||||
We'll use:
|
||||
|
||||
- `datetime`: To add timestamps to our sitemap entries
|
||||
- `xml.etree.ElementTree`: To create and structure the XML sitemap file
|
||||
- `urllib.parse`: To parse and validate URLs before adding them to the sitemap
|
||||
|
||||
Let's start by defining a new function - `create_xml_sitemap`:
|
||||
|
||||
```python
|
||||
def create_xml_sitemap(urls, base_url):
|
||||
# Create the root element
|
||||
urlset = ET.Element("urlset")
|
||||
urlset.set("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9")
|
||||
```
|
||||
|
||||
In the body of the function, we first create the root XML element named "urlset" using `ET.Element()`. Then we set its `xmlns` attribute to the sitemap schema URL `http://www.sitemaps.org/schemas/sitemap/0.9` to identify this as a valid sitemap XML document.
|
||||
|
||||
Then, we get the current date for providing a last modified date (since `/map` doesn't return the modified dates of pages):
|
||||
|
||||
```python
|
||||
def create_xml_sitemap(urls, base_url):
|
||||
# Create the root element
|
||||
...
|
||||
|
||||
# Get current date for lastmod
|
||||
today = datetime.now().strftime("%Y-%m-%d")
|
||||
```
|
||||
|
||||
Then, we add each URL to the sitemap:
|
||||
|
||||
```python
|
||||
def create_xml_sitemap(urls, base_url):
|
||||
...
|
||||
|
||||
# Add each URL to the sitemap
|
||||
for url in urls:
|
||||
# Only include URLs from the same domain
|
||||
if urlparse(url).netloc == urlparse(base_url).netloc:
|
||||
url_element = ET.SubElement(urlset, "url")
|
||||
loc = ET.SubElement(url_element, "loc")
|
||||
loc.text = url
|
||||
|
||||
# Add optional elements
|
||||
lastmod = ET.SubElement(url_element, "lastmod")
|
||||
lastmod.text = today
|
||||
|
||||
changefreq = ET.SubElement(url_element, "changefreq")
|
||||
changefreq.text = "monthly"
|
||||
|
||||
priority = ET.SubElement(url_element, "priority")
|
||||
priority.text = "0.5"
|
||||
```
|
||||
|
||||
The loop iterates through each URL in the provided list and adds it to the sitemap XML structure. For each URL, it first checks if the domain matches the base URL's domain to ensure we only include URLs from the same website. If it matches, it creates a new `<url>` element and adds several child elements:
|
||||
|
||||
- `<loc>`: Contains the actual URL
|
||||
- `<lastmod>`: Set to today's date to indicate when the page was last modified
|
||||
- `<changefreq>`: Set to "monthly" to suggest how often the page content changes
|
||||
- `<priority>`: Set to "0.5" to indicate the relative importance of the page
|
||||
|
||||
This creates a properly formatted sitemap entry for each URL following the Sitemap XML protocol specifications.
|
||||
|
||||
After the loop finishes, we create and return the XML string:
|
||||
|
||||
```python
|
||||
def create_xml_sitemap(urls, base_url):
|
||||
...
|
||||
|
||||
# Add each URL to the sitemap
|
||||
for url in urls:
|
||||
...
|
||||
|
||||
# Create the XML string
|
||||
return ET.tostring(urlset, encoding="unicode", method="xml")
|
||||
```
|
||||
|
||||
Here is the full function:
|
||||
|
||||
```python
|
||||
def create_xml_sitemap(urls, base_url):
|
||||
# Create the root element
|
||||
urlset = ET.Element("urlset")
|
||||
urlset.set("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9")
|
||||
|
||||
# Get current date for lastmod
|
||||
today = datetime.now().strftime("%Y-%m-%d")
|
||||
|
||||
# Add each URL to the sitemap
|
||||
for url in urls:
|
||||
# Only include URLs from the same domain
|
||||
if urlparse(url).netloc == urlparse(base_url).netloc:
|
||||
url_element = ET.SubElement(urlset, "url")
|
||||
loc = ET.SubElement(url_element, "loc")
|
||||
loc.text = url
|
||||
|
||||
# Add optional elements
|
||||
lastmod = ET.SubElement(url_element, "lastmod")
|
||||
lastmod.text = today
|
||||
|
||||
changefreq = ET.SubElement(url_element, "changefreq")
|
||||
changefreq.text = "monthly"
|
||||
|
||||
priority = ET.SubElement(url_element, "priority")
|
||||
priority.text = "0.5"
|
||||
|
||||
# Create the XML string
|
||||
return ET.tostring(urlset, encoding="unicode", method="xml")
|
||||
```
|
||||
|
||||
Let's use the function on the links returned by the last `/map` endpoint use:
|
||||
|
||||
```python
|
||||
base_url = "https://books.toscrape.com"
|
||||
links = map_response["links"]
|
||||
|
||||
xml_sitemap = create_xml_sitemap(links, base_url)
|
||||
|
||||
# Save to file
|
||||
with open("sitemap.xml", "w", encoding="utf-8") as f:
|
||||
f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
|
||||
f.write(xml_sitemap)
|
||||
```
|
||||
|
||||
Here is what the file looks like after saving:
|
||||
|
||||

|
||||
|
||||
Such `sitemap.xml` file provides a standardized way for search engines to discover and crawl all pages on your website.
|
||||
|
||||
## Advanced Visualization: Building Interactive Visual Sitemaps with `/map`
|
||||
|
||||
If you want a visual sitemap of a website, you don't have to sign up for expensive third-party services and platforms. You can automatically generate one using the `/map` endpoint, Plotly and a few other libraries.
|
||||
|
||||
The resulting graph would like the following:
|
||||
|
||||

|
||||
|
||||
The Sankey diagram above visualizes the hierarchical structure of [the Stripe documentation](docs.stripe.org) (which is quite large) by showing how pages are organized and connected across different sections. The width of each flow represents the number of pages in that section, making it easy to identify which parts of the website contain the most content. The colors help distinguish between different sections and their subsections.
|
||||
|
||||
The diagram starts from a central root node and branches out into main sections of the website. Each section can then split further into subsections, creating a tree-like visualization of the site's architecture. This makes it simple to understand the overall organization and identify potential navigation or structural issues.
|
||||
|
||||
For example, you can quickly spot which sections are the largest (the API section), how content is distributed across different areas, and whether there's a logical grouping of related pages. This visualization is particularly useful for content strategists, SEO specialists, and web architects who need to analyze and optimize website structure.
|
||||
|
||||
[The script that generated this plot](https://github.com/FirstClassML/firecrawl_articles/blob/main/3_generating_sitemap/sitemap_generator.py) contains more than 400 lines of code and I made it fully customizable. The code in `sitemap_generator.py` follows a modular, object-oriented approach with several key components:
|
||||
|
||||
1. A `HierarchyBuilder` class that analyzes URLs returned my `/map` or `/crawl` and builds a tree-like data structure up to 4 levels deep.
|
||||
2. A `SankeyDataPreparator` class that transforms this hierarchy into a format suitable for visualization, using thresholds to control complexity
|
||||
3. A `SitemapVisualizer` class that creates the final Sankey diagram with proper styling and interactivity
|
||||
|
||||
The script automatically handles things like grouping smaller sections together, truncating long labels, generating color schemes, and adding hover information (the generated plots are all interactive through Plotly). All aspects like minimum branch size, relative thresholds, label length, and color schemes can be customized through parameters.
|
||||
|
||||
Here is another plot generated for the [PyData.org](https://pydata.org) website:
|
||||
|
||||

|
||||
|
||||
## Conclusion
|
||||
|
||||
The `/map` endpoint represents a powerful tool in the modern web developer's toolkit, offering a fast and efficient way to discover and analyze website structures. While still in alpha, it provides significant advantages:
|
||||
|
||||
- __Speed__: As demonstrated, it's significantly faster than traditional crawling methods, making it ideal for quick site analysis
|
||||
- __Flexibility__: With parameters like `search`, `sitemapOnly`, and `includeSubdomains`, it can be tailored to specific needs
|
||||
- __Practical Applications__: From generating XML sitemaps for SEO to creating visual site hierarchies, the endpoint serves multiple use cases
|
||||
|
||||
While it may not capture every single URL compared to full crawling solutions, its speed and ease of use make it an excellent choice for rapid site mapping and initial structure analysis. As the endpoint continues to evolve, its combination of performance and accuracy will make it an increasingly valuable tool for website maintenance, SEO optimization, and content strategy.
|
||||
|
||||
To discover what more Firecrawl has to offer, be sure to read the following related resources:
|
||||
|
||||
- [Firecrawl Documentation](docs.firecrawl.dev)
|
||||
- [Firecrawl Blog](https://www.firecrawl.dev/blog/category/tutorials)
|
||||
- [Firecrawl API Reference](https://docs.firecrawl.dev/api-reference/introduction)
|
||||
|
||||
## Frequently Asked Questions
|
||||
|
||||
### How fast is Firecrawl's `/map` endpoint?
|
||||
|
||||
The /map endpoint typically processes websites in 2-3 seconds, compared to several minutes with traditional crawling methods.
|
||||
|
||||
### Can I use `/map` endpoint for large websites?
|
||||
|
||||
Yes, the `/map` endpoint can handle large websites with a current limit of 5000 URLs per request, making it suitable for most medium to large websites.
|
||||
|
||||
### What's the difference between XML and visual sitemaps?
|
||||
|
||||
XML sitemaps are machine-readable files used by search engines for indexing, while visual sitemaps provide a graphical representation of website structure for human understanding and planning.
|
File diff suppressed because one or more lines are too long
@ -0,0 +1,983 @@
|
||||
---
|
||||
title: "How to Use Firecrawl's Scrape API: Complete Web Scraping Tutorial"
|
||||
meta_description: Learn how to scrape websites using Firecrawl's /scrape endpoint. Master JavaScript rendering, structured data extraction, and batch operations with Python code examples.
|
||||
slug: mastering-firecrawl-scrape-endpoint
|
||||
date: 22 Nov, 2024
|
||||
author: Bex Tuychiev
|
||||
image: to_fill_in_later
|
||||
categories: [tutorials]
|
||||
keywords: ["firecrawl", "web scraping", "scrape endpoint", "data extraction", "javascript rendering", "structured data", "web automation", "python sdk", "api", "web crawling", "data collection", "web data", "web scraping tutorial", "python"]
|
||||
---
|
||||
|
||||
## Getting Started with Modern Web Scraping: An Introduction
|
||||
|
||||
Traditional web scraping offers unique challenges. Relevant information is often scattered across multiple pages containing complex elements like code blocks, iframes, and media. JavaScript-heavy websites and authentication requirements add additional complexity to the scraping process.
|
||||
|
||||
Even after successfully scraping, the content requires specific formatting to be useful for downstream processes like data engineering or training AI and machine learning models.
|
||||
|
||||
Firecrawl addresses these challenges by providing a specialized scraping solution. Its [`/scrape` endpoint](https://docs.firecrawl.dev/features/scrape) offers features like JavaScript rendering, automatic content extraction, bypassing blockers and flexible output formats that make it easier to collect high-quality information and training data at scale.
|
||||
|
||||
In this guide, we'll explore how to effectively use Firecrawl's `/scrape` endpoint to extract structured data from static and dynamic websites. We'll start with basic scraping setup and then dive into a real-world example of scraping weather data from weather.com, demonstrating how to handle JavaScript-based interactions, extract structured data using schemas, and capture screenshots during the scraping process.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Getting Started with Modern Web Scraping: An Introduction](#getting-started-with-modern-web-scraping-an-introduction)
|
||||
- [What Is Firecrawl's `/scrape` Endpoint? The Short Answer](#what-is-firecrawls-scrape-endpoint-the-short-answer)
|
||||
- [Prerequisites: Setting Up Firecrawl](#prerequisites-setting-up-firecrawl)
|
||||
- [Basic Scraping Setup](#basic-scraping-setup)
|
||||
- [Large-scale Scraping With Batch Operations](#large-scale-scraping-with-batch-operations)
|
||||
- [Batch Scraping with `batch_scrape_urls`](#batch-scraping-with-batch_scrape_urls)
|
||||
- [Asynchronous batch scraping with `async_batch_scrape_urls`](#asynchronous-batch-scraping-with-async_batch_scrape_urls)
|
||||
- [How to Scrape Dynamic JavaScript Websites](#how-to-scrape-dynamic-javascript-websites)
|
||||
- [Conclusion](#conclusion)
|
||||
|
||||
## What Is Firecrawl's `/scrape` Endpoint? The Short Answer
|
||||
|
||||
The `/scrape` endpoint is Firecrawl's core web scraping API that enables automated extraction of content from any webpage. It handles common web scraping challenges like:
|
||||
|
||||
- JavaScript rendering - Executes JavaScript to capture dynamically loaded content
|
||||
- Content extraction - Automatically identifies and extracts main content while filtering out noise
|
||||
- Format conversion - Converts HTML to clean formats like Markdown or structured JSON
|
||||
- Screenshot capture - Takes full or partial page screenshots during scraping
|
||||
- Browser automation - Supports clicking, typing and other browser interactions
|
||||
- Anti-bot bypass - Uses rotating proxies and browser fingerprinting to avoid blocks
|
||||
|
||||
The endpoint accepts a URL and configuration parameters, then returns the scraped content in your desired format. It's designed to be flexible enough for both simple static page scraping and complex dynamic site automation.
|
||||
|
||||
Now that we understand what the endpoint does at a high level, let's look at how to set it up and start using it in practice.
|
||||
|
||||
## Prerequisites: Setting Up Firecrawl
|
||||
|
||||
Firecrawl's scraping engine is exposed as a REST API, so you can use command-line tools like cURL to use it. However, for a more comfortable experience, better flexibility and control, I recommend using one of its SDKs for Python, Node, Rust or Go. This tutorial will focus on the Python version.
|
||||
|
||||
To get started, please make sure to:
|
||||
|
||||
1. Sign up at [firecrawl.dev](firecrawl.dev).
|
||||
2. Choose a plan (the free one will work fine for this tutorial).
|
||||
|
||||
Once you sign up, you will be given an API token which you can copy from your [dashboard](https://www.firecrawl.dev/app). The best way to save your key is by using a `.env` file, ideal for the purposes of this article:
|
||||
|
||||
```bash
|
||||
touch .env
|
||||
echo "FIRECRAWL_API_KEY='YOUR_API_KEY'" >> .env
|
||||
```
|
||||
|
||||
Now, let's install Firecrawl Python SDK, `python-dotenv` to read `.env` files, and Pandas for data analysis later:
|
||||
|
||||
```bash
|
||||
pip install firecrawl-py python-dotenv pandas
|
||||
```
|
||||
|
||||
## Basic Scraping Setup
|
||||
|
||||
Scraping with Firecrawl starts by creating an instance of the `FirecrawlApp` class:
|
||||
|
||||
```python
|
||||
from firecrawl import FirecrawlApp
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
app = FirecrawlApp()
|
||||
```
|
||||
|
||||
When you use the `load_dotenv()` function, the app can automatically use your loaded API key to establish a connection with the scraping engine. Then, scraping any URL takes a single line of code:
|
||||
|
||||
```python
|
||||
url = "https://arxiv.org"
|
||||
data = app.scrape_url(url)
|
||||
```
|
||||
|
||||
Let's take a look at the response format returned by `scrape_url` method:
|
||||
|
||||
```python
|
||||
data['metadata']
|
||||
```
|
||||
|
||||
{
|
||||
"title": "arXiv.org e-Print archiveopen searchopen navigation menucontact arXivsubscribe to arXiv mailings",
|
||||
"language": "en",
|
||||
"ogLocaleAlternate": [],
|
||||
"viewport": "width=device-width, initial-scale=1",
|
||||
"msapplication-TileColor": "#da532c",
|
||||
"theme-color": "#ffffff",
|
||||
"sourceURL": "<https://arxiv.org>",
|
||||
"url": "<https://arxiv.org/>",
|
||||
"statusCode": 200
|
||||
}
|
||||
|
||||
The response `metadata` includes basic information like the page title, viewport settings and a status code.
|
||||
|
||||
Now, let's look at the scraped contents, which is converted into `markdown` by default:
|
||||
|
||||
```python
|
||||
from IPython.display import Markdown
|
||||
|
||||
Markdown(data['markdown'][:500])
|
||||
|
||||
```
|
||||
|
||||
```text
|
||||
arXiv is a free distribution service and an open-access archive for nearly 2.4 million
|
||||
scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
|
||||
Materials on this site are not peer-reviewed by arXiv.
|
||||
|
||||
Subject search and browse:
|
||||
|
||||
Physics
|
||||
|
||||
Mathematics
|
||||
|
||||
Quantitative Biology
|
||||
|
||||
Computer Science
|
||||
|
||||
Quantitative Finance
|
||||
|
||||
Statistics
|
||||
|
||||
Electrical Engineering and Systems Scienc
|
||||
```
|
||||
|
||||
The response can include several other formats that we can request when scraping a URL. Let's try requesting multiple formats at once to see what additional data we can get back:
|
||||
|
||||
```python
|
||||
data = app.scrape_url(
|
||||
url,
|
||||
params={
|
||||
'formats': [
|
||||
'html',
|
||||
'rawHtml',
|
||||
'links',
|
||||
'screenshot',
|
||||
]
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
Here is what these formats scrape:
|
||||
|
||||
- **HTML**: The raw HTML content of the page.
|
||||
- **rawHtml**: The unprocessed HTML content, exactly as it appears on the page.
|
||||
- **links**: A list of all the hyperlinks found on the page.
|
||||
- **screenshot**: An image capture of the page as it appears in a browser.
|
||||
|
||||
The HTML format is useful for developers who need to analyze or manipulate the raw structure of a webpage. The `rawHtml` format is ideal for cases where the exact original HTML content is required, such as for archival purposes or detailed comparison. The links format is beneficial for SEO specialists and web crawlers who need to extract and analyze all hyperlinks on a page. The screenshot format is perfect for visual documentation, quality assurance, and capturing the appearance of a webpage at a specific point in time.
|
||||
|
||||
Passing more than one scraping format to `params` adds additional keys to the response:
|
||||
|
||||
```python
|
||||
data.keys()
|
||||
```
|
||||
|
||||
```text
|
||||
dict_keys(['rawHtml', 'screenshot', 'metadata', 'html', 'links'])
|
||||
```
|
||||
|
||||
Let's display the screenshot Firecrawl took of arXiv.org:
|
||||
|
||||
```python
|
||||
from IPython.display import Image
|
||||
|
||||
Image(data['screenshot'])
|
||||
```
|
||||
|
||||

|
||||
|
||||
Notice how the screenshot is cropped to fit a certain viewport. For most pages, it is better to capture the entire screen by using the `screenshot@fullPage` format:
|
||||
|
||||
```python
|
||||
data = app.scrape_url(
|
||||
url,
|
||||
params={
|
||||
"formats": [
|
||||
"screenshot@fullPage",
|
||||
]
|
||||
}
|
||||
)
|
||||
|
||||
Image(data['screenshot'])
|
||||
```
|
||||
|
||||

|
||||
|
||||
As a bonus, the `/scrape` endpoint can handle PDF links as well:
|
||||
|
||||
```python
|
||||
pdf_link = "https://arxiv.org/pdf/2411.09833.pdf"
|
||||
data = app.scrape_url(pdf_link)
|
||||
|
||||
Markdown(data['markdown'][:500])
|
||||
```
|
||||
|
||||
```text
|
||||
arXiv:2411.09833v1 \[math.DG\] 14 Nov 2024
|
||||
EINSTEIN METRICS ON THE FULL FLAG F(N).
|
||||
MIKHAIL R. GUZMAN
|
||||
Abstract.LetM=G/Kbe a full flag manifold. In this work, we investigate theG-
|
||||
stability of Einstein metrics onMand analyze their stability types, including coindices,
|
||||
for several cases. We specifically focus onF(n) = SU(n)/T, emphasizingn= 5, where
|
||||
we identify four new Einstein metrics in addition to known ones. Stability data, including
|
||||
coindex and Hessian spectrum, confirms that these metrics on
|
||||
```
|
||||
|
||||
### Further Scrape Configuration Options
|
||||
|
||||
By default, `scrape_url` converts everything it sees on a webpage to one of the specified formats. To control this behavior, Firecrawl offers the following parameters:
|
||||
|
||||
- `onlyMainContent`
|
||||
- `includeTags`
|
||||
- `excludeTags`
|
||||
|
||||
`onlyMainContent` excludes the navigation, footers, headers, etc. and is set to True by default.
|
||||
|
||||
`includeTags` and `excludeTags` can be used to whitelist/blacklist certain HTML elements:
|
||||
|
||||
```python
|
||||
url = "https://arxiv.org"
|
||||
|
||||
data = app.scrape_url(url, params={"includeTags": ["p"], "excludeTags": ["span"]})
|
||||
|
||||
Markdown(data['markdown'][:1000])
|
||||
```
|
||||
|
||||
```markdown
|
||||
[Help](https://info.arxiv.org/help) \| [Advanced Search](https://arxiv.org/search/advanced)
|
||||
|
||||
arXiv is a free distribution service and an open-access archive for nearly 2.4 million
|
||||
scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
|
||||
Materials on this site are not peer-reviewed by arXiv.
|
||||
|
||||
|
||||
[arXiv Operational Status](https://status.arxiv.org)
|
||||
|
||||
Get status notifications via
|
||||
[email](https://subscribe.sorryapp.com/24846f03/email/new)
|
||||
or [slack](https://subscribe.sorryapp.com/24846f03/slack/new)
|
||||
```
|
||||
|
||||
`includeTags` and `excludeTags` also support referring to HTML elements by their `#id` or `.class-name`.
|
||||
|
||||
These configuration options help ensure efficient and precise scraping. While `onlyMainContent` filters out peripheral elements, `includeTags` and `excludeTags` enable surgical targeting of specific HTML elements - particularly valuable when dealing with complex webpage structures or when only certain content types are needed.
|
||||
|
||||
## Advanced Data Extraction: Structured Techniques
|
||||
|
||||
Scraping clean, LLM-ready data is the core philosophy of Firecrawl. However, certain web pages with their complex structures can interfere with this philosophy when scraped in their entirety. For this reason, Firecrawl offers two scraping methods for better structured outputs:
|
||||
|
||||
1. Natural language extraction - Use prompts to extract specific information and have an LLM structure the response
|
||||
2. Manual structured data extraction - Define JSON schemas to have an LLM scrape data in a predefined format
|
||||
|
||||
In this section, we will cover both methods.
|
||||
|
||||
### Natural Language Extraction - Use AI to Extract Data
|
||||
|
||||
To illustrate natural language scraping, let's try extracting all news article links that may be related to the 2024 US presidential election from the New York Times:
|
||||
|
||||
```python
|
||||
url = "https://nytimes.com"
|
||||
|
||||
data = app.scrape_url(
|
||||
url,
|
||||
params={
|
||||
'formats': ['markdown', 'extract', 'screenshot'],
|
||||
'extract': {
|
||||
'prompt': "Return a list of links of news articles that may be about the 2024 US presidential election"
|
||||
}
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
To enable this feature, you are required to pass the `extract` option to the list of `formats` and provide a prompt in a dictionary to a separate `extract` field.
|
||||
|
||||
Once scraping finishes, the response will include a new `extract` key:
|
||||
|
||||
```python
|
||||
data['extract']
|
||||
```
|
||||
|
||||
```python
|
||||
{'news_articles': [{'title': 'Harris Loss Has Democrats Fighting Over How to Talk About Transgender Rights',
|
||||
'link': 'https://www.nytimes.com/2024/11/20/us/politics/presidential-campaign-transgender-rights.html'},
|
||||
{'title': 'As Democrats Question How to Win Back Latinos, Ruben Gallego Offers Answers',
|
||||
'link': 'https://www.nytimes.com/2024/11/20/us/politics/ruben-gallego-arizona-latino-voters-democrats.html'},
|
||||
...
|
||||
{'title': 'The Final Push for Ukraine?',
|
||||
'link': 'https://www.nytimes.com/2024/11/20/briefing/ukraine-russia-trump.html'}]}
|
||||
```
|
||||
|
||||
Due to the nature of this scraping method, the returned output can have arbitrary structure as we can see above. It seems the above output has the following format:
|
||||
|
||||
```python
|
||||
{
|
||||
"news_articles": [
|
||||
{"title": "article_title", "link": "article_url"},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This LLM-based extraction can have endless applications, from extracting specific data points from complex websites to analyzing sentiment across multiple news sources to gathering structured information from unstructured web content.
|
||||
|
||||
To improve the accuracy of the extraction and give additional instructions, you have the option to include a system prompt to the underlying LLM:
|
||||
|
||||
```python
|
||||
data = app.scrape_url(
|
||||
url,
|
||||
params={
|
||||
'formats': ['markdown', 'extract'],
|
||||
'extract': {
|
||||
'prompt': "Find any mentions of specific dollar amounts or financial figures and return them with their context and article link.",
|
||||
'systemPrompt': "You are a helpful assistant that extracts numerical financial data."
|
||||
}
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
Above, we are dictating that the LLM must act as an assistant that extracts numerical financial data. Let's look at its response:
|
||||
|
||||
```python
|
||||
data['extract']
|
||||
```
|
||||
|
||||
```python
|
||||
{'financial_data': [
|
||||
{
|
||||
'amount': 121200000,
|
||||
'context': 'René Magritte became the 16th artist whose work broke the nine-figure '
|
||||
'threshold at auction when his painting sold for $121.2 million.',
|
||||
'article_link': 'https://www.nytimes.com/2024/11/19/arts/design/magritte-surrealism-christies-auction.html'
|
||||
},
|
||||
{
|
||||
'amount': 5000000,
|
||||
'context': 'Benjamin Netanyahu offers $5 million for each hostage freed in Gaza.',
|
||||
'article_link': 'https://www.nytimes.com/2024/11/19/world/middleeast/israel-5-million-dollars-hostage.html'
|
||||
}
|
||||
]}
|
||||
```
|
||||
|
||||
The output shows the LLM successfully extracted two financial data points from the articles.
|
||||
|
||||
The LLM not only identified the specific amounts but also provided relevant context and source article links for each figure.
|
||||
|
||||
### Schema-Based Data Extraction - Building Structured Models
|
||||
|
||||
While natural language scraping is powerful for exploration and prototyping, production systems typically require more structured and deterministic approaches. LLM responses can vary between runs of the same prompt, making the output format inconsistent and difficult to reliably parse in automated workflows.
|
||||
|
||||
For this reason, Firecrawl allows you to pass a predefined schema to guide the LLM's output when transforming the scraped content. To facilitate this feature, Firecrawl uses Pydantic models.
|
||||
|
||||
In the example below, we will extract only news article links, their titles with some additional details from the New York Times:
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
class IndividualArticle(BaseModel):
|
||||
title: str = Field(description="The title of the news article")
|
||||
subtitle: str = Field(description="The subtitle of the news article")
|
||||
url: str = Field(description="The URL of the news article")
|
||||
author: str = Field(description="The author of the news article")
|
||||
date: str = Field(description="The date the news article was published")
|
||||
read_duration: int = Field(description="The estimated time it takes to read the news article")
|
||||
topics: list[str] = Field(description="A list of topics the news article is about")
|
||||
|
||||
class NewsArticlesSchema(BaseModel):
|
||||
news_articles: list[IndividualArticle] = Field(
|
||||
description="A list of news articles extracted from the page"
|
||||
)
|
||||
```
|
||||
|
||||
Above, we define a Pydantic schema that specifies the structure of the data we want to extract. The schema consists of two models:
|
||||
|
||||
`IndividualArticle` defines the structure for individual news articles with fields for:
|
||||
|
||||
- `title`
|
||||
- `subtitle`
|
||||
- `url`
|
||||
- `author`
|
||||
- `date`
|
||||
- `read_duration`
|
||||
- `topics`
|
||||
|
||||
`NewsArticlesSchema` acts as a container model that holds a list of `IndividualArticle` objects, representing multiple articles extracted from the page. If we don't use this container model, Firecrawl will only return the first news article it finds.
|
||||
|
||||
Each model field uses Pydantic's `Field` class to provide descriptions that help guide the LLM in correctly identifying and extracting the requested data. This structured approach ensures consistent output formatting.
|
||||
|
||||
The next step is passing this schema to the `extract` parameter of `scrape_url`:
|
||||
|
||||
```python
|
||||
url = "https://nytimes.com"
|
||||
|
||||
structured_data = app.scrape_url(
|
||||
url,
|
||||
params={
|
||||
"formats": ["extract", "screenshot"],
|
||||
"extract": {
|
||||
"schema": NewsArticlesSchema.model_json_schema(),
|
||||
"prompt": "Extract the following data from the NY Times homepage: news article title, url, author, date, read_duration for all news articles",
|
||||
"systemPrompt": "You are a helpful assistant that extracts news article data from NY Times.",
|
||||
},
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
While passing the schema, we call its `model_json_schema()` method to automatically convert it to valid JSON. Let's look at the output:
|
||||
|
||||
```python
|
||||
structured_data['extract']
|
||||
```
|
||||
|
||||
```python
|
||||
{
|
||||
'news_articles': [
|
||||
{
|
||||
'title': 'How Google Spent 15 Years Creating a Culture of Concealment',
|
||||
'subtitle': '',
|
||||
'url': 'https://www.nytimes.com/2024/11/20/technology/google-antitrust-employee-messages.html',
|
||||
'author': 'David Streitfeld',
|
||||
'date': '2024-11-20',
|
||||
'read_duration': 9,
|
||||
'topics': []
|
||||
},
|
||||
# ... additional articles ...
|
||||
{
|
||||
'title': 'The Reintroduction of Daniel Craig',
|
||||
'subtitle': '',
|
||||
'url': 'https://www.nytimes.com/2024/11/20/movies/daniel-craig-queer.html',
|
||||
'author': '',
|
||||
'date': '2024-11-20',
|
||||
'read_duration': 9,
|
||||
'topics': []
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This time, the response fields exactly match the fields we set during schema definition:
|
||||
|
||||
```python
|
||||
{
|
||||
"news_articles": [
|
||||
{...}, # Article 1
|
||||
{...}, # Article 2,
|
||||
... # Article n
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
When creating the scraping schema, the following best practices can go a long way in ensuring reliable and accurate data extraction:
|
||||
|
||||
1. Keep field names simple and descriptive
|
||||
2. Use clear field descriptions that guide the LLM
|
||||
3. Break complex data into smaller, focused fields
|
||||
4. Include validation rules where possible
|
||||
5. Consider making optional fields that may not always be present
|
||||
6. Test the schema with a variety of content examples
|
||||
7. Iterate and refine based on extraction results
|
||||
|
||||
To follow these best practices, the following Pydantic tips can help:
|
||||
|
||||
1. Use `Field(default=None)` to make fields optional
|
||||
2. Add validation with `Field(min_length=1, max_length=100)`
|
||||
3. Create custom validators with @validator decorator
|
||||
4. Use `conlist()` for list fields with constraints
|
||||
5. Add example values with `Field(example="Sample text")`
|
||||
6. Create nested models for complex data structures
|
||||
7. Use computed fields with `@property` decorator
|
||||
|
||||
If you follow all these tips, your schema can become quite sophisticated like below:
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel, Field
|
||||
from typing import Optional, List
|
||||
from datetime import datetime
|
||||
|
||||
|
||||
class Author(BaseModel):
|
||||
# Required field - must be provided when creating an Author
|
||||
name: str = Field(
|
||||
...,
|
||||
min_length=1,
|
||||
max_length=100,
|
||||
description="The full name of the article author",
|
||||
)
|
||||
|
||||
# Optional field - can be None or omitted
|
||||
title: Optional[str] = Field(
|
||||
None, description="Author's title or role, if available"
|
||||
)
|
||||
|
||||
|
||||
class NewsArticle(BaseModel):
|
||||
# Required field - must be provided when creating a NewsArticle
|
||||
title: str = Field(
|
||||
...,
|
||||
min_length=5,
|
||||
max_length=300,
|
||||
description="The main headline or title of the news article",
|
||||
example="Breaking News: Major Scientific Discovery",
|
||||
)
|
||||
|
||||
# Required field - must be provided when creating a NewsArticle
|
||||
url: str = Field(
|
||||
...,
|
||||
description="The full URL of the article",
|
||||
example="https://www.nytimes.com/2024/01/01/science/discovery.html",
|
||||
)
|
||||
|
||||
# Optional field - can be None or omitted
|
||||
authors: Optional[List[Author]] = Field(
|
||||
default=None, description="List of article authors and their details"
|
||||
)
|
||||
|
||||
# Optional field - can be None or omitted
|
||||
publish_date: Optional[datetime] = Field(
|
||||
default=None, description="When the article was published"
|
||||
)
|
||||
|
||||
# Optional field with default empty list
|
||||
financial_amounts: List[float] = Field(
|
||||
default_factory=list,
|
||||
max_length=10,
|
||||
description="Any monetary amounts mentioned in the article in USD",
|
||||
)
|
||||
|
||||
@property
|
||||
def is_recent(self) -> bool:
|
||||
if not self.publish_date:
|
||||
return False
|
||||
return (datetime.now() - self.publish_date).days < 7
|
||||
```
|
||||
|
||||
The schema above defines two key data models for news article data:
|
||||
|
||||
Author - Represents article author information with:
|
||||
|
||||
- `name` (required): The author's full name
|
||||
- `title` (optional): The author's role or title
|
||||
|
||||
NewsArticle - Represents a news article with:
|
||||
|
||||
- `title` (required): The article headline (5-300 chars)
|
||||
- `url` (required): Full article URL
|
||||
- `authors` (optional): List of Author objects
|
||||
- `publish_date` (optional): Article publication datetime
|
||||
- `financial_amounts` (optional): List of monetary amounts in USD
|
||||
|
||||
The `NewsArticle` model includes an `is_recent` property that checks if the article was published within the last 7 days.
|
||||
|
||||
As you can see, web scraping process becomes much easier and more powerful if you combine it with structured data models that validate and organize the scraped information. This allows for consistent data formats, type checking, and easy access to properties like checking if an article is recent.
|
||||
|
||||
## Large-scale Scraping With Batch Operations
|
||||
|
||||
Up to this point, we have been focusing on scraping pages one URL at a time. In reality, you will work with multiple, perhaps, thousands of URLs that need to be scraped in parallel. This is where batch operations become essential for efficient web scraping at scale. Batch operations allow you to process multiple URLs simultaneously, significantly reducing the overall time needed to collect data from multiple web pages.
|
||||
|
||||
### Batch Scraping with `batch_scrape_urls`
|
||||
|
||||
The `batch_scrape_urls` method lets you scrape multiple URLs at once.
|
||||
|
||||
Let's scrape all the news article links we obtained from our previous schema extraction example.
|
||||
|
||||
```python
|
||||
articles = structured_data['extract']['news_articles']
|
||||
article_links = [article['url'] for article in articles]
|
||||
|
||||
class ArticleSummary(BaseModel):
|
||||
title: str = Field(description="The title of the news article")
|
||||
summary: str = Field(description="A short summary of the news article")
|
||||
|
||||
batch_data = app.batch_scrape_urls(article_links, params={
|
||||
"formats": ["extract"],
|
||||
"extract": {
|
||||
"schema": ArticleSummary.model_json_schema(),
|
||||
"prompt": "Extract the title of the news article and generate its brief summary",
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
Here is what is happening in the codeblock above:
|
||||
|
||||
- We extract the list of news articles from our previous structured data result
|
||||
- We create a list of article URLs by mapping over the articles and getting their 'url' field
|
||||
- We define an `ArticleSummary` model with title and summary fields to structure our output
|
||||
- We use `batch_scrape_urls()` to process all article URLs in parallel, configuring it to:
|
||||
- Extract data in structured format
|
||||
- Use our `ArticleSummary` schema
|
||||
- Generate titles and summaries based on the article content
|
||||
|
||||
The response from `batch_scrape_urls()` is a bit different:
|
||||
|
||||
```python
|
||||
batch_data.keys()
|
||||
```
|
||||
|
||||
```python
|
||||
dict_keys(['success', 'status', 'completed', 'total', 'creditsUsed', 'expiresAt', 'data'])
|
||||
```
|
||||
|
||||
It contains the following fields:
|
||||
|
||||
- `success`: Boolean indicating if the batch request succeeded
|
||||
- `status`: Current status of the batch job
|
||||
- `completed`: Number of URLs processed so far
|
||||
- `total`: Total number of URLs in the batch
|
||||
- `creditsUsed`: Number of API credits consumed
|
||||
- `expiresAt`: When the results will expire
|
||||
- `data`: The extracted data for each URL
|
||||
|
||||
Let's focus on the `data` key where the actual content is stored:
|
||||
|
||||
```python
|
||||
len(batch_data['data'])
|
||||
|
||||
```
|
||||
|
||||
```out
|
||||
19
|
||||
```
|
||||
|
||||
The batch processing completed successfully with 19 articles. Let's examine the structure of the first article:
|
||||
|
||||
```python
|
||||
batch_data['data'][0].keys()
|
||||
```
|
||||
|
||||
```out
|
||||
dict_keys(['extract', 'metadata'])
|
||||
|
||||
|
||||
The response format here matches what we get from individual `scrape_url` calls.
|
||||
|
||||
|
||||
```python
|
||||
print(batch_data['data'][0]['extract'])
|
||||
|
||||
```
|
||||
|
||||
```out
|
||||
{'title': 'Ukrainian Forces Face Increasing Challenges Amidst Harsh Winter Conditions', 'summary': 'As the war in Ukraine enters its fourth winter, conditions are worsening for Ukrainian soldiers who find themselves trapped on the battlefield, surrounded by Russian forces. Military commanders express concerns over dwindling supplies and increasingly tough situations. The U.S. has recently allowed Ukraine to use American weapons for deeper strikes into Russia, marking a significant development in the ongoing conflict.'}
|
||||
```
|
||||
|
||||
The scraping was performed according to our specifications, extracting the metadata, the title and generating a brief summary.
|
||||
|
||||
### Asynchronous batch scraping with `async_batch_scrape_urls`
|
||||
|
||||
Scraping the 19 NY Times articles in a batch took about 10 seconds on my machine. While that's not much, in practice, we cannot wait around as Firecrawl batch-scrapes thousands of URLs. For these larger workloads, Firecrawl provides an asynchronous batch scraping API that lets you submit jobs and check their status later, rather than blocking until completion. This is especially useful when integrating web scraping into automated workflows or processing large URL lists.
|
||||
|
||||
This feature is available through the `async_batch_scrape_urls` method and it works a bit differently:
|
||||
|
||||
```python
|
||||
batch_scrape_job = app.async_batch_scrape_urls(
|
||||
article_links,
|
||||
params={
|
||||
"formats": ["extract"],
|
||||
"extract": {
|
||||
"schema": ArticleSummary.model_json_schema(),
|
||||
"prompt": "Extract the title of the news article and generate its brief summary",
|
||||
},
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
When using `async_batch_scrape_urls` instead of the synchronous version, the response comes back immediately rather than waiting for all URLs to be scraped. This allows the program to continue executing while the scraping happens in the background.
|
||||
|
||||
```python
|
||||
batch_scrape_job
|
||||
```
|
||||
|
||||
```python
|
||||
{'success': True,
|
||||
'id': '77a94b62-c676-4db2-b61b-4681e99f4704',
|
||||
'url': 'https://api.firecrawl.dev/v1/batch/scrape/77a94b62-c676-4db2-b61b-4681e99f4704'}
|
||||
```
|
||||
|
||||
The response contains an ID belonging the background task that was initiated to process the URLs under the hood.
|
||||
|
||||
You can use this ID later to check the job's status with `check_batch_scrape_status` method:
|
||||
|
||||
```python
|
||||
batch_scrape_job_status = app.check_batch_scrape_status(batch_scrape_job['id'])
|
||||
|
||||
batch_scrape_job_status.keys()
|
||||
```
|
||||
|
||||
```python
|
||||
dict_keys(['success', 'status', 'total', 'completed', 'creditsUsed', 'expiresAt', 'data', 'error', 'next'])
|
||||
```
|
||||
|
||||
If the job finished scraping all URLs, its `status` will be set to `completed`:
|
||||
|
||||
```python
|
||||
batch_scrape_job_status['status']
|
||||
```
|
||||
|
||||
```out
|
||||
'completed'
|
||||
```
|
||||
|
||||
Let's look at how many pages were scraped:
|
||||
|
||||
```python
|
||||
batch_scrape_job_status['total']
|
||||
```
|
||||
|
||||
```python
|
||||
19
|
||||
```
|
||||
|
||||
The response always includes the `data` field, whether the job is complete or not, with the content scraped up to that point. It has `error` and `next` fields to indicate if any errors occurred during scraping and whether there are more results to fetch.
|
||||
|
||||
## How to Scrape Dynamic JavaScript Websites
|
||||
|
||||
Out in the wild, many websites you encounter will be dynamic, meaning their content is generated on-the-fly using JavaScript rather than being pre-rendered on the server. These sites often require user interaction like clicking buttons or typing into forms before displaying their full content. Traditional web scrapers that only look at the initial HTML fail to capture this dynamic content, which is why browser automation capabilities are essential for comprehensive web scraping.
|
||||
|
||||
Firecrawl supports dynamic scraping by default. In the parameters of `scrape_url` or `batch_scrape_url`, you can define necessary actions to reach the target state of the page you are scraping. As an example, we will build a scraper that will extract the following information from `https://weather.com`:
|
||||
|
||||
- Current Temperature
|
||||
- Temperature High
|
||||
- Temperature Low
|
||||
- Humidity
|
||||
- Pressure
|
||||
- Visibility
|
||||
- Wind Speed
|
||||
- Dew Point
|
||||
- UV Index
|
||||
- Moon Phase
|
||||
|
||||
These details are displayed for every city you search through the website:
|
||||
|
||||

|
||||
|
||||
Unlike websites such as Amazon where you can simply modify the URL's search parameter (e.g. `?search=your-query`), weather.com presents a unique challenge. The site generates dynamic and unique IDs for each city, making traditional URL manipulation techniques ineffective. To scrape weather data for any given city, you must simulate the actual user journey: visiting the homepage, interacting with the search bar, entering the city name, and selecting the appropriate result from the dropdown list. This multi-step interaction process is necessary because of how weather.com structures its dynamic content delivery (at this point, I urge to visit the website and visit a few city pages).
|
||||
|
||||
Fortunately, Firecrawl natively supports such interactions through the `actions` parameter. It accepts a list of dictionaries, where each dictionary represents one of the following interactions:
|
||||
|
||||
- Waiting for the page to load
|
||||
- Clicking on an element
|
||||
- Writing text in input fields
|
||||
- Scrolling up/down
|
||||
- Take a screenshot at the current state
|
||||
- Scrape the current state of the webpage
|
||||
|
||||
Let's define the actions we need for weather.com:
|
||||
|
||||
```python
|
||||
actions = [
|
||||
{"type": "wait", "milliseconds": 3000},
|
||||
{"type": "click", "selector": 'input[id="LocationSearch_input"]'},
|
||||
{"type": "write", "text": "London"},
|
||||
{"type": "screenshot"},
|
||||
{"type": "wait", "milliseconds": 1000},
|
||||
{"type": "click", "selector": "button[data-testid='ctaButton']"},
|
||||
{"type": "wait", "milliseconds": 3000},
|
||||
]
|
||||
```
|
||||
|
||||
Let's examine how we choose the selectors, as this is the most technical aspect of the actions. Using browser developer tools, we inspect the webpage elements to find the appropriate selectors. For the search input field, we locate an element with the ID "LocationSearch_input". After entering a city name, we include a 3-second wait to allow the dropdown search results to appear. At this stage, we capture a screenshot for debugging to verify the text input was successful.
|
||||
|
||||
The final step involves clicking the first matching result, which is identified by a button element with the `data-testid` attribute `ctaButton`. Note that if you're implementing this in the future, these specific attribute names may have changed - you'll need to use browser developer tools to find the current correct selectors.
|
||||
|
||||
Now, let's define a Pydantic schema to guide the LLM:
|
||||
|
||||
```python
|
||||
class WeatherData(BaseModel):
|
||||
location: str = Field(description="The name of the city")
|
||||
temperature: str = Field(description="The current temperature in degrees Fahrenheit")
|
||||
temperature_high: str = Field(description="The high temperature for the day in degrees Fahrenheit")
|
||||
temperature_low: str = Field(description="The low temperature for the day in degrees Fahrenheit")
|
||||
humidity: str = Field(description="The current humidity as a percentage")
|
||||
pressure: str = Field(description="The current air pressure in inches of mercury")
|
||||
visibility: str = Field(description="The current visibility in miles")
|
||||
wind_speed: str = Field(description="The current wind speed in miles per hour")
|
||||
dew_point: str = Field(description="The current dew point in degrees Fahrenheit")
|
||||
uv_index: str = Field(description="The current UV index")
|
||||
moon_phase: str = Field(description="The current moon phase")
|
||||
```
|
||||
|
||||
Finally, let's pass these objects to `scrape_url`:
|
||||
|
||||
```python
|
||||
url = "https://weather.com"
|
||||
|
||||
data = app.scrape_url(
|
||||
url,
|
||||
params={
|
||||
"formats": ["screenshot", "markdown", "extract"],
|
||||
"actions": actions,
|
||||
"extract": {
|
||||
"schema": WeatherData.model_json_schema(),
|
||||
"prompt": "Extract the following weather data from the weather.com page: temperature, temperature high, temperature low, humidity, pressure, visibility, wind speed, dew point, UV index, and moon phase",
|
||||
},
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
The scraping only happens once all actions are performed. Let's see if it was successful by looking at the `extract` key:
|
||||
|
||||
```python
|
||||
data['extract']
|
||||
```
|
||||
|
||||
```python
|
||||
{'location': 'London, England, United Kingdom',
|
||||
'temperature': '33°',
|
||||
'temperature_high': '39°',
|
||||
'temperature_low': '33°',
|
||||
'humidity': '79%',
|
||||
'pressure': '29.52in',
|
||||
'visibility': '10 mi',
|
||||
'wind_speed': '5 mph',
|
||||
'dew_point': '28°',
|
||||
'uv_index': '0 of 11',
|
||||
'moon_phase': 'Waning Gibbous'}
|
||||
```
|
||||
|
||||
All details are accounted for! But, for illustration, we need to take a closer look at the response structure when using JS-based actions:
|
||||
|
||||
```python
|
||||
data.keys()
|
||||
```
|
||||
|
||||
```python
|
||||
dict_keys(['markdown', 'screenshot', 'actions', 'metadata', 'extract'])
|
||||
```
|
||||
|
||||
The response has a new actions key:
|
||||
|
||||
```python
|
||||
data['actions']
|
||||
```
|
||||
|
||||
```python
|
||||
{'screenshots': ['https://service.firecrawl.dev/storage/v1/object/public/media/screenshot-16bf71d8-dcb5-47eb-9af4-5fa84195b91d.png'],
|
||||
'scrapes': []}
|
||||
```
|
||||
|
||||
The actions array contained a single screenshot-generating action, which is reflected in the output above.
|
||||
|
||||
Let's look at the screenshot:
|
||||
|
||||
```python
|
||||
from IPython.display import Image
|
||||
|
||||
Image(data['actions']['screenshots'][0])
|
||||
```
|
||||
|
||||

|
||||
|
||||
The image shows the stage where the scraper just typed the search query.
|
||||
|
||||
Now, we have to convert this whole process into a function that works for any given city:
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel, Field
|
||||
from typing import Optional, Dict, Any
|
||||
|
||||
|
||||
class WeatherData(BaseModel):
|
||||
location: str = Field(description="The name of the city")
|
||||
temperature: str = Field(
|
||||
description="The current temperature in degrees Fahrenheit"
|
||||
)
|
||||
temperature_high: str = Field(
|
||||
description="The high temperature for the day in degrees Fahrenheit"
|
||||
)
|
||||
temperature_low: str = Field(
|
||||
description="The low temperature for the day in degrees Fahrenheit"
|
||||
)
|
||||
humidity: str = Field(description="The current humidity as a percentage")
|
||||
pressure: str = Field(description="The current air pressure in inches of mercury")
|
||||
visibility: str = Field(description="The current visibility in miles")
|
||||
wind_speed: str = Field(description="The current wind speed in miles per hour")
|
||||
dew_point: str = Field(description="The current dew point in degrees Fahrenheit")
|
||||
uv_index: str = Field(description="The current UV index")
|
||||
moon_phase: str = Field(description="The current moon phase")
|
||||
|
||||
|
||||
def scrape_weather_data(app: FirecrawlApp, city: str) -> Optional[WeatherData]:
|
||||
try:
|
||||
# Define the actions to search for the city
|
||||
actions = [
|
||||
{"type": "wait", "milliseconds": 3000},
|
||||
{"type": "click", "selector": 'input[id="LocationSearch_input"]'},
|
||||
{"type": "write", "text": city},
|
||||
{"type": "wait", "milliseconds": 1000},
|
||||
{"type": "click", "selector": "button[data-testid='ctaButton']"},
|
||||
{"type": "wait", "milliseconds": 3000},
|
||||
]
|
||||
|
||||
# Perform the scraping
|
||||
data = app.scrape_url(
|
||||
"https://weather.com",
|
||||
params={
|
||||
"formats": ["extract"],
|
||||
"actions": actions,
|
||||
"extract": {
|
||||
"schema": WeatherData.model_json_schema(),
|
||||
"prompt": "Extract the following weather data from the weather.com page: temperature, temperature high, temperature low, humidity, pressure, visibility, wind speed, dew point, UV index, and moon phase",
|
||||
},
|
||||
},
|
||||
)
|
||||
|
||||
# Return the extracted weather data
|
||||
return WeatherData(**data["extract"])
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error scraping weather data for {city}: {str(e)}")
|
||||
return None
|
||||
```
|
||||
|
||||
The code is the same but it is wrapped inside a function. Let's test it on various cities:
|
||||
|
||||
```python
|
||||
cities = ["Tashkent", "New York", "Tokyo", "Paris", "Istanbul"]
|
||||
data_full = []
|
||||
|
||||
for city in cities:
|
||||
weather_data = scrape_weather_data(app, city)
|
||||
data_full.append(weather_data)
|
||||
```
|
||||
|
||||
We can convert the data for all cities into a DataFrame now:
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Convert list of WeatherData objects into dictionaries
|
||||
data_dicts = [city.model_dump() for city in data_full]
|
||||
|
||||
# Convert list of dictionaries into DataFrame
|
||||
df = pd.DataFrame(data_dicts)
|
||||
|
||||
print(df.head())
|
||||
```
|
||||
|
||||
| location | temperature | temperature_high | temperature_low | humidity | pressure | visibility | wind_speed | dew_point | uv_index | moon_phase |
|
||||
|----------|-------------|------------------|-----------------|----------|-----------|------------|------------|-----------|-----------|------------|
|
||||
| Tashkent, Uzbekistan | 48 | 54 | 41 | 81 | 30.30 | 2.5 | 2 | 43 | 0 | Waning Gibbous |
|
||||
| New York City, NY | 48° | 49° | 39° | 93% | 29.45 in | 4 mi | 10 mph | 46° | 0 of 11 | Waning Gibbous |
|
||||
| Tokyo, Tokyo Prefecture, Japan | 47° | 61° | 48° | 95% | 29.94 in | 10 mi | 1 mph | 45° | 0 of 11 | Waning Gibbous |
|
||||
| Paris, France | 34° | 36° | 30° | 93% | 29.42 in | 2.4 mi | 11 mph | 33° | 0 of 11 | Waning Gibbous |
|
||||
| Istanbul, Türkiye | 47° | 67° | 44° | 79% | 29.98 in | 8 mi | 4 mph | 41° | 0 of 11 | Waning Gibbous |
|
||||
|
||||
We have successfully scraped weather data from multiple cities using Firecrawl and organized it into a structured DataFrame. This demonstrates how we can efficiently collect and analyze data generated by dynamic websites for further analysis and monitoring.
|
||||
|
||||
## Conclusion
|
||||
|
||||
In this comprehensive guide, we've explored Firecrawl's `/scrape` endpoint and its powerful capabilities for modern web scraping. We covered:
|
||||
|
||||
- Basic scraping setup and configuration options
|
||||
- Multiple output formats including HTML, markdown, and screenshots
|
||||
- Structured data extraction using both natural language prompts and Pydantic schemas
|
||||
- Batch operations for processing multiple URLs efficiently
|
||||
- Advanced techniques for scraping JavaScript-heavy dynamic websites
|
||||
|
||||
Through practical examples like extracting news articles from the NY Times and weather data from weather.com, we've demonstrated how Firecrawl simplifies complex scraping tasks while providing flexible output formats suitable for data engineering and AI/ML pipelines.
|
||||
|
||||
The combination of LLM-powered extraction, structured schemas, and browser automation capabilities makes Firecrawl a versatile tool for gathering high-quality web data at scale, whether you're building training datasets, monitoring websites, or conducting research.
|
||||
|
||||
To discover more what Firecrawl has to offer, refer to [our guide on the `/crawl` endpoint](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl), which scrapes websites in their entirety with a single command while using the `/scrape` endpoint under the hood.
|
||||
|
||||
For more hands-on uses-cases of Firecrawl, these posts may interest you as well:
|
||||
|
||||
- [Using Prompt Caching With Anthropic](https://www.firecrawl.dev/blog/using-prompt-caching-with-anthropic)
|
||||
- [Scraping Job Boards With Firecrawl and OpenAI](https://www.firecrawl.dev/blog/scrape-job-boards-firecrawl-openai)
|
||||
- [Scraping and Analyzing Airbnb Listings in Python Tutorial](https://www.firecrawl.dev/blog/scrape-analyze-airbnb-data-with-e2b)
|
Loading…
x
Reference in New Issue
Block a user