We crawl all accessible subpages and give you clean markdown for each. No sitemap required.
\\n [\\n {\\n \"url\": \"https://www.firecrawl.dev/\",\\n \"markdown\": \"## Welcome to Firecrawl\\n Firecrawl is a web scraper that allows you to extract the content of a webpage.\"\\n },\\n {\\n \"url\": \"https://www.firecrawl.dev/features\",\\n \"markdown\": \"## Features\\n Discover how Firecrawl\\'s cutting-edge features can \\n transform your data operations.\"\\n },\\n {\\n \"url\": \"https://www.firecrawl.dev/pricing\",\\n \"markdown\": \"## Pricing Plans\\n Choose the perfect plan that fits your needs.\"\\n },\\n {\\n \"url\": \"https://www.firecrawl.dev/about\",\\n \"markdown\": \"## About Us\\n Learn more about Firecrawl\\'s mission and the \\n team behind our innovative platform.\"\\n }\\n ]\\n
Note: The markdown has been edited for display purposes.
Trusted by Top Companies
Integrate today
Enhance your applications with top-tier web scraping and crawling capabilities.
Firecrawl turns entire websites into clean, LLM-ready markdown or structured data. Scrape, crawl and extract the web with a single API. Ideal for AI companies looking to empower their LLM applications with web data.
What sites work?
Firecrawl is best suited for business websites, docs and help centers. We currently don\\'t support social media platforms.
Who can benefit from using Firecrawl?
Firecrawl is tailored for LLM engineers, data scientists, AI researchers, and developers looking to harness web data for training machine learning models, market research, content aggregation, and more. It simplifies the data preparation process, allowing professionals to focus on insights and model development.
Is Firecrawl open-source?
Yes, it is. You can check out the repository on GitHub. Keep in mind that this repository is currently in its early stages of development. We are in the process of merging custom modules into this mono repository.
Scraping & Crawling
How does Firecrawl handle dynamic content on websites?
Unlike traditional web scrapers, Firecrawl is equipped to handle dynamic content rendered with JavaScript. It ensures comprehensive data collection from all accessible subpages, making it a reliable tool for scraping websites that rely heavily on JS for content delivery.
Why is it not crawling all the pages?
There are a few reasons why Firecrawl may not be able to crawl all the pages of a website. Some common reasons include rate limiting, and anti-scraping mechanisms, disallowing the crawler from accessing certain pages. If you\\'re experiencing issues with the crawler, please reach out to our support team at help@firecrawl.com.
Can Firecrawl crawl websites without a sitemap?
Yes, Firecrawl can access and crawl all accessible subpages of a website, even in the absence of a sitemap. This feature enables users to gather data from a wide array of web sources with minimal setup.
What formats can Firecrawl convert web data into?
Firecrawl specializes in converting web data into clean, well-formatted markdown. This format is particularly suited for LLM applications, offering a structured yet flexible way to represent web content.
How does Firecrawl ensure the cleanliness of the data?
Firecrawl employs advanced algorithms to clean and structure the scraped data, removing unnecessary elements and formatting the content into readable markdown. This process ensures that the data is ready for use in LLM applications without further preprocessing.
Is Firecrawl suitable for large-scale data scraping projects?
Absolutely. Firecrawl offers various pricing plans, including a Scale plan that supports scraping of millions of pages. With features like caching and scheduled syncs, it\\'s designed to efficiently handle large-scale data scraping and continuous updates, making it ideal for enterprises and large projects.
Does it respect robots.txt?
Yes, Firecrawl crawler respects the rules set in a website\\'s robots.txt file. If you notice any issues with the way Firecrawl interacts with your website, you can adjust the robots.txt file to control the crawler\\'s behavior. Firecrawl user agent name is \\'FirecrawlAgent\\'. If you notice any behavior that is not expected, please let us know at help@firecrawl.com.
What measures does Firecrawl take to handle web scraping challenges like rate limits and caching?
Firecrawl is built to navigate common web scraping challenges, including reverse proxies, rate limits, and caching. It smartly manages requests and employs caching techniques to minimize bandwidth usage and avoid triggering anti-scraping mechanisms, ensuring reliable data collection.
Does Firecrawl handle captcha or authentication?
Firecrawl avoids captcha by using stealth proxyies. When it encounters captcha, it attempts to solve it automatically, but this is not always possible. We are working to add support for more captcha solving methods. Firecrawl can handle authentication by providing auth headers to the API.
API Related
Where can I find my API key?
Click on the dashboard button on the top navigation menu when logged in and you will find your API key in the main screen and under API Keys.
Billing
Is Firecrawl free?
Firecrawl is free for the first 500 scraped pages (500 free credits). After that, you can upgrade to our Standard or Scale plans for more credits.
Is there a pay per use plan instead of monthly?
No we do not currently offer a pay per use plan, instead you can upgrade to our Standard or Growth plans for more credits and higher rate limits.
How many credit does scraping, crawling, and extraction cost?
Scraping costs 1 credit per page. Crawling costs 1 credit per page.
Do you charge for failed requests (scrape, crawl, extract)?
We do not charge for any failed requests (scrape, crawl, extract). Please contact support at caleb@firecrawl.com if you have any questions.
What payment methods do you accept?
We accept payments through Stripe which accepts most major credit cards, debit cards, and PayPal.
', 'actions': {'screenshots': []}, 'metadata': {'title': 'Home - Firecrawl', 'description': 'Firecrawl crawls and converts any website into clean markdown.', 'language': 'en', 'keywords': 'Firecrawl,Markdown,Data,Mendable,Langchain', 'robots': 'follow, index', 'ogTitle': 'Firecrawl', 'ogDescription': 'Turn any website into LLM-ready data.', 'ogUrl': 'https://www.firecrawl.dev/', 'ogImage': 'https://www.firecrawl.dev/og.png?123', 'ogLocaleAlternate': [], 'ogSiteName': 'Firecrawl', 'sourceURL': 'https://www.firecrawl.dev', 'statusCode': 200}}}\n"
+ ]
+ }
+ ],
+ "source": [
+ "import requests\n",
+ "\n",
+ "payload = {\n",
+ " \"url\": \"https://www.firecrawl.dev\",\n",
+ " \"formats\": [\"html\"],\n",
+ " \"actions\": [{'type': 'click', 'selector': 'a[href=\"https://calendly.com/d/cj83-ngq-knk/meet-firecrawl\"]'}]\n",
+ " }\n",
+ "headers = {\n",
+ " \"Authorization\": f\"Bearer fc-fa95acf54c0e496fbe6b403745f246ab\",\n",
+ " \"Content-Type\": \"application/json\"\n",
+ " }\n",
+ "\n",
+ "response = requests.post(\"https://api.firecrawl.dev/v1/scrape\", json=payload, headers=headers)\n",
+ " \n",
+ "\n",
+ "scrape_result = response.json() \n",
+ "print(scrape_result)\n",
+ "\n",
+ " \n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/examples/o1_web_crawler_actions/o1_web_crawler_actions.py b/examples/o1_web_crawler_actions/o1_web_crawler_actions.py
new file mode 100644
index 00000000..e9d56987
--- /dev/null
+++ b/examples/o1_web_crawler_actions/o1_web_crawler_actions.py
@@ -0,0 +1,271 @@
+import os
+import json
+import requests
+from dotenv import load_dotenv
+from openai import OpenAI
+import re
+
+# ANSI color codes
+class Colors:
+ CYAN = '\033[96m'
+ YELLOW = '\033[93m'
+ GREEN = '\033[92m'
+ RED = '\033[91m'
+ MAGENTA = '\033[95m'
+ BLUE = '\033[94m'
+ RESET = '\033[0m'
+
+# Load environment variables
+load_dotenv()
+
+# Retrieve API keys from environment variables
+firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")
+openai_api_key = os.getenv("OPENAI_API_KEY")
+
+# Initialize the OpenAI client
+client = OpenAI(api_key=openai_api_key)
+
+# Step 1: Get objective and URL
+def get_objective_and_url():
+ url = input(f"{Colors.BLUE}Enter the website to crawl: {Colors.RESET}")
+ objective = input(f"{Colors.BLUE}Enter your objective: {Colors.RESET}")
+ return objective, url
+
+# Function to get top N pages from a URL using Firecrawl Map API
+def get_top_pages(url, search_term, num_pages=3):
+ try:
+ print(f"{Colors.YELLOW}Mapping website using the Firecrawl Map API...{Colors.RESET}")
+ api_url = "https://api.firecrawl.dev/v1/map"
+ payload = {
+ "url": url,
+ "search": search_term,
+ }
+ headers = {
+ "Authorization": f"Bearer {firecrawl_api_key}",
+ "Content-Type": "application/json"
+ }
+ response = requests.post(api_url, json=payload, headers=headers)
+ if response.status_code == 200:
+ map_result = response.json()
+
+ if map_result.get('success'):
+ links = map_result.get('links', [])
+ top_pages = links[:num_pages]
+ print(f"{Colors.GREEN}Found {len(links)} links. Using top {num_pages} pages.{Colors.RESET}")
+ for i, page in enumerate(top_pages, 1):
+ print(f"{Colors.CYAN}URL {i}: {page}{Colors.RESET}")
+ return top_pages
+ else:
+ print(f"{Colors.RED}Error: Map API request was not successful{Colors.RESET}")
+ return []
+ else:
+ print(f"{Colors.RED}Error: Received status code {response.status_code} from Map API{Colors.RESET}")
+ return []
+ except Exception as e:
+ print(f"{Colors.RED}Error encountered during mapping: {str(e)}{Colors.RESET}")
+ return []
+
+# Step 2: Visit a page and get HTML
+def visit_page_and_get_html(url, actions):
+ try:
+ if actions:
+ print(f"{Colors.YELLOW}Scraping page: {url} with actions:{Colors.RESET}")
+ for action in actions:
+ print(f" - {action}")
+ else:
+ print(f"{Colors.YELLOW}Scraping page: {url}{Colors.RESET}")
+
+ payload = {
+ "url": url,
+ "formats": ["html"],
+ "actions": actions
+ }
+ headers = {
+ "Authorization": f"Bearer {firecrawl_api_key}",
+ "Content-Type": "application/json"
+ }
+
+ response = requests.post("https://api.firecrawl.dev/v1/scrape", json=payload, headers=headers)
+
+ if response.status_code == 200:
+ scrape_result = response.json()
+ html_content = scrape_result["data"]["html"]
+ if len(actions) > 0:
+ print("html_content: ", scrape_result)
+ print(f"{Colors.GREEN}Page scraping completed successfully.{Colors.RESET}")
+
+ return html_content
+ else:
+ print(f"{Colors.RED}Error: Received status code {response.status_code}{Colors.RESET}")
+ return None
+ except Exception as e:
+ print(f"{Colors.RED}Error encountered during page scraping: {str(e)}{Colors.RESET}")
+ return None
+
+# Step 3: Process the page to fulfill the objective or decide next action
+def process_page(html_content, objective):
+ try:
+ process_prompt = f"""
+You are an AI assistant helping to achieve the following objective: '{objective}'.
+Given the HTML content of a web page, determine if the objective is met.
+
+Instructions:
+1. If the objective is met, respond in JSON format as follows:
+{{
+ "status": "Objective met",
+ "data": {{ ... extracted information ... }}
+}}
+
+2. If the objective is not met, analyze the HTML content to decide the best next action to get closer to the objective. Provide the action(s) needed to navigate to the next page or interact with the page. Respond in JSON format as follows:
+{{
+ "status": "Objective not met",
+ "actions": [{{ ... actions to perform ... }}]
+}}
+
+3. The actions should be in the format accepted by the 'actions' parameter of the 'scrape_url' function in Firecrawl. Available actions include:
+ - {{"type": "wait", "milliseconds": }}
+ Example: {{"type": "wait", "milliseconds": 2000}}
+ - {{"type": "click", "selector": ""}}
+ Example: {{"type": "click", "selector": "#load-more-button"}}
+ - {{"type": "write", "text": "", "selector": ""}}
+ Example: {{"type": "write", "text": "Hello, world!", "selector": "#search-input"}}
+ - {{"type": "press", "key": ""}}
+ Example: {{"type": "press", "key": "Enter"}}
+ - {{"type": "scroll", "direction": "", "amount": }}
+ Example: {{"type": "scroll", "direction": "down", "amount": 500}}
+
+4. Do not include any explanations or additional text outside of the JSON response.
+
+HTML Content:
+{html_content[:20000]}
+"""
+
+ completion = client.chat.completions.create(
+ model="o1-preview",
+ messages=[
+ {
+ "role": "user",
+ "content": [
+ {
+ "type": "text",
+ "text": process_prompt
+ }
+ ]
+ }
+ ]
+ )
+
+ response = completion.choices[0].message.content.strip()
+
+ # Remove any JSON code blocks from the response
+ response = re.sub(r'```json\s*(.*?)\s*```', r'\1', response, flags=re.DOTALL)
+
+ # Parse the response as JSON
+ try:
+ result = json.loads(response)
+ status = result.get('status')
+ if status == 'Objective met':
+ data = result.get('data')
+ return {'result': data}
+ elif status == 'Objective not met':
+ actions = result.get('actions')
+ return {'actions': actions}
+ else:
+ print(f"{Colors.RED}Unexpected status in response: {status}{Colors.RESET}")
+ return {}
+ except json.JSONDecodeError:
+ print(f"{Colors.RED}Error parsing assistant's response as JSON.{Colors.RESET}")
+ print(f"{Colors.RED}Response was: {response}{Colors.RESET}")
+ return {}
+ except Exception as e:
+ print(f"{Colors.RED}Error encountered during processing of the page: {str(e)}{Colors.RESET}")
+ return {}
+
+# Function to determine search term based on the objective
+def determine_search_term(objective):
+ try:
+ prompt = f"""
+Based on the following objective: '{objective}', provide a 1-2 word search term that would help find relevant pages on the website. Only respond with the search term and nothing else.
+"""
+ completion = client.chat.completions.create(
+ model="o1-preview",
+ messages=[
+ {
+ "role": "user",
+ "content": [
+ {
+ "type": "text",
+ "text": prompt
+ }
+ ]
+ }
+ ]
+ )
+ search_term = completion.choices[0].message.content.strip()
+ print(f"{Colors.GREEN}Determined search term: {search_term}{Colors.RESET}")
+ return search_term
+ except Exception as e:
+ print(f"{Colors.RED}Error determining search term: {str(e)}{Colors.RESET}")
+ return ""
+
+# Main function
+def main():
+ objective, url = get_objective_and_url()
+
+ print(f"{Colors.YELLOW}Initiating web crawling process...{Colors.RESET}")
+
+ # Determine search term based on objective
+ search_term = determine_search_term(objective)
+ if not search_term:
+ print(f"{Colors.RED}Could not determine a search term based on the objective.{Colors.RESET}")
+ return
+
+ # Get the top 3 pages using Firecrawl Map API
+ top_pages = get_top_pages(url, search_term, num_pages=3)
+ if not top_pages:
+ print(f"{Colors.RED}No pages found to process.{Colors.RESET}")
+ return
+
+ for page_url in top_pages:
+ print(f"{Colors.CYAN}Processing page: {page_url}{Colors.RESET}")
+
+ # Step 2: Visit page and get HTML
+ html_content = visit_page_and_get_html(page_url, actions=[])
+ if not html_content:
+ print(f"{Colors.RED}Failed to retrieve content from {page_url}{Colors.RESET}")
+ continue
+
+ # Step 3: Process HTML and objective
+ action_result = process_page(html_content, objective)
+ if action_result.get('result'):
+ print(f"{Colors.GREEN}Objective met. Extracted information:{Colors.RESET}")
+ print(f"{Colors.MAGENTA}{json.dumps(action_result['result'], indent=2)}{Colors.RESET}")
+ return
+ elif action_result.get('actions'):
+ print(f"{Colors.YELLOW}Objective not met yet. Suggested actions:{Colors.RESET}")
+ for action in action_result['actions']:
+ print(f"{Colors.MAGENTA}- {action}{Colors.RESET}")
+ actions = action_result['actions']
+ # Visit the page again with the actions
+ html_content = visit_page_and_get_html(page_url, actions)
+ if not html_content:
+ print(f"{Colors.RED}Failed to retrieve content from {page_url} with actions{Colors.RESET}")
+ continue
+ # Process the new HTML
+ action_result = process_page(html_content, objective)
+ if action_result.get('result'):
+ print(f"{Colors.GREEN}Objective met after performing actions. Extracted information:{Colors.RESET}")
+ print(f"{Colors.MAGENTA}{json.dumps(action_result['result'], indent=2)}{Colors.RESET}")
+ return
+ else:
+ print(f"{Colors.RED}Objective still not met after performing actions on {page_url}{Colors.RESET}")
+ continue
+ else:
+ print(f"{Colors.RED}No actions suggested. Unable to proceed with {page_url}.{Colors.RESET}")
+ continue
+
+ # If we reach here, the objective was not met on any of the pages
+ print(f"{Colors.RED}Objective not fulfilled after processing top 3 pages.{Colors.RESET}")
+
+if __name__ == "__main__":
+ main()
diff --git a/examples/o1_web_crawler_actions/requirements.txt b/examples/o1_web_crawler_actions/requirements.txt
new file mode 100644
index 00000000..249f8beb
--- /dev/null
+++ b/examples/o1_web_crawler_actions/requirements.txt
@@ -0,0 +1,3 @@
+firecrawl-py
+python-dotenv
+openai
\ No newline at end of file