2025-04-15 00:48:19 +05:30

83 lines
1.9 KiB
Markdown

# GPT-4.1 Web Crawler
A smart web crawler powered by GPT-4.1 that intelligently searches websites to find specific information based on user objectives.
## Features
- Intelligently maps website content using semantic search
- Ranks website pages by relevance to your objective
- Extracts structured information using GPT-4.1
- Returns results in clean JSON format
## Prerequisites
- Python 3.8+
- Firecrawl API key
- OpenAI API key (with access to GPT-4.1 models)
## Installation
1. Clone this repository:
```
git clone https://github.com/yourusername/gpt-4.1-web-crawler.git
cd gpt-4.1-web-crawler
```
2. Install the required dependencies:
```
pip install -r requirements.txt
```
3. Set up environment variables:
```
cp .env.example .env
```
Then edit the `.env` file and add your API keys.
## Usage
Run the script:
```
python gpt-4.1-web-crawler.py
```
The program will prompt you for:
1. The website URL to crawl
2. Your specific objective (what information you want to find)
Example:
```
Enter the website to crawl: https://example.com
Enter your objective: Find the company's leadership team with their roles and short bios
```
The crawler will then:
1. Map the website
2. Identify the most relevant pages
3. Scrape and analyze those pages
4. Return structured information if the objective is met
## How It Works
1. **Mapping**: The crawler uses Firecrawl to map the website structure and find relevant pages based on search terms derived from your objective.
2. **Ranking**: GPT-4.1 analyzes the URLs to determine which pages are most likely to contain the information you're looking for.
3. **Extraction**: The top pages are scraped and analyzed to extract the specific information requested in your objective.
4. **Results**: If found, the information is returned in a clean, structured JSON format.
## License
[MIT License](LICENSE)
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.