mirror of
https://git.mirrors.martin98.com/https://github.com/mendableai/firecrawl
synced 2025-06-20 19:17:30 +08:00
83 lines
1.9 KiB
Markdown
83 lines
1.9 KiB
Markdown
# GPT-4.1 Web Crawler
|
|
|
|
A smart web crawler powered by GPT-4.1 that intelligently searches websites to find specific information based on user objectives.
|
|
|
|
## Features
|
|
|
|
- Intelligently maps website content using semantic search
|
|
- Ranks website pages by relevance to your objective
|
|
- Extracts structured information using GPT-4.1
|
|
- Returns results in clean JSON format
|
|
|
|
## Prerequisites
|
|
|
|
- Python 3.8+
|
|
- Firecrawl API key
|
|
- OpenAI API key (with access to GPT-4.1 models)
|
|
|
|
## Installation
|
|
|
|
1. Clone this repository:
|
|
|
|
```
|
|
git clone https://github.com/yourusername/gpt-4.1-web-crawler.git
|
|
cd gpt-4.1-web-crawler
|
|
```
|
|
|
|
2. Install the required dependencies:
|
|
|
|
```
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
3. Set up environment variables:
|
|
```
|
|
cp .env.example .env
|
|
```
|
|
Then edit the `.env` file and add your API keys.
|
|
|
|
## Usage
|
|
|
|
Run the script:
|
|
|
|
```
|
|
python gpt-4.1-web-crawler.py
|
|
```
|
|
|
|
The program will prompt you for:
|
|
|
|
1. The website URL to crawl
|
|
2. Your specific objective (what information you want to find)
|
|
|
|
Example:
|
|
|
|
```
|
|
Enter the website to crawl: https://example.com
|
|
Enter your objective: Find the company's leadership team with their roles and short bios
|
|
```
|
|
|
|
The crawler will then:
|
|
|
|
1. Map the website
|
|
2. Identify the most relevant pages
|
|
3. Scrape and analyze those pages
|
|
4. Return structured information if the objective is met
|
|
|
|
## How It Works
|
|
|
|
1. **Mapping**: The crawler uses Firecrawl to map the website structure and find relevant pages based on search terms derived from your objective.
|
|
|
|
2. **Ranking**: GPT-4.1 analyzes the URLs to determine which pages are most likely to contain the information you're looking for.
|
|
|
|
3. **Extraction**: The top pages are scraped and analyzed to extract the specific information requested in your objective.
|
|
|
|
4. **Results**: If found, the information is returned in a clean, structured JSON format.
|
|
|
|
## License
|
|
|
|
[MIT License](LICENSE)
|
|
|
|
## Contributing
|
|
|
|
Contributions are welcome! Please feel free to submit a Pull Request.
|