https://github.com/unclecode/crawl4ai
Crawl4AI 的功能
Crawl4AI 为我们提供了一组强大的功能,可简化 Web 爬取和数据提取。总结如下:
开源且免费: Crawl 完全免费,开发者无需任何费用即可使用其功能。AI
解析能力: Crawl 利用 AI 自动定义和解析元素,节省时间和精力。
结构化输出: Crawl 将提取的数据转换为 JSON 和 Markdown 等结构化格式,以便于分析。
功能多样性: Crawl 支持滚动、多 URL 抓取、媒体标签提取、元数据提取和屏幕截图。
使用 Crawl 的分步指南
接下来是它的安装和使用快速学习手册。
步骤 1:安装和设置
pip install “crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltk
步骤2:数据提取
接下来,我们创建一个 Python 脚本来启动网络爬虫并从 URL 中提取数据:
from crawl4ai import WebCrawler
# Create an instance of WebCrawler
crawler = WebCrawler()
# Warm up the crawler (load necessary models)
crawler.warmup()
# Run the crawler on a URL
result = crawler.run(url="https://openai.com/api/pricing/")
# Print the extracted content
print(result.markdown)
步骤 3:使用 LLM 进行数据结构化
使用LLM(大语言模型)定义提取策略,并将提取的数据转换为结构化格式:
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")
url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)
print(result.extracted_content)
步骤4:与AI代理集成
将 Crawl 与 Praison CrewAI 代理集成,实现高效的数据处理:
pip install praisonai
创建一个工具文件(tools.py)来包装Crawl工具。
# tools.py
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from praisonai_tools import BaseTool
class ModelFee(BaseModel):
llm_model_name: str = Field(..., description="Name of the model.")
input_fee: str = Field(..., description="Fee for input token for the model.")
output_fee: str = Field(..., description="Fee for output token for the model.")
class ModelFeeTool(BaseTool):
name: str = "ModelFeeTool"
description: str = "Extracts model fees for input and output tokens from the given pricing page."
def _run(self, url: str):
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
schema=ModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache=True,
)
return result.extracted_content
if __name__ == "__main__":
# Test the ModelFeeTool
tool = ModelFeeTool()
url = "https://www.openai.com/pricing"
result = tool.run(url)
print(result)
配置 AI 代理以使用 Crawl 工具进行网页抓取和数据提取。
framework: crewai
topic: extract model pricing from websites
roles:
web_scraper:
backstory: An expert in web scraping with a deep understanding of extracting structured
data from online sources. https://openai.com/api/pricing/ https://www.anthropic.com/pricing https://cohere.com/pricing
goal: Gather model pricing data from various websites
role: Web Scraper
tasks:
scrape_model_pricing:
description: Scrape model pricing information from the provided list of websites.
expected_output: Raw HTML or JSON containing model pricing data.
tools:
- 'ModelFeeTool'
data_cleaner:
backstory: Specialist in data cleaning, ensuring that all collected data is accurate
and properly formatted.
goal: Clean and organize the scraped pricing data
role: Data Cleaner
tasks:
clean_pricing_data:
description: Process the raw scraped data to remove any duplicates and inconsistencies,
and convert it into a structured format.
expected_output: Cleaned and organized JSON or CSV file with model pricing
data.
tools:
- ''
data_analyzer:
backstory: Data analysis expert focused on deriving actionable insights from structured
data.
goal: Analyze the cleaned pricing data to extract insights
role: Data Analyzer
tasks:
analyze_pricing_data:
description: Analyze the cleaned data to extract trends, patterns, and insights
on model pricing.
expected_output: Detailed report summarizing model pricing trends and insights.
tools:
- ''
dependencies: []
AI 代理示例
例如 Praison-AI 代理示例基于 Crawl 提取的数据执行网页抓取、数据清理和数据分析。
这些代理协同工作,从多个电商网站提取商品的定价信息,并提供一份详细的报告,总结分析结果。
结论
Crawl 是一款强大的工具,它能够帮助 AI 代理更高效、更准确地执行网页爬取和数据提取任务。
通过仅需几行代码,用户便能实现高效的网页爬取和数据提取。
它的开源特性、AI 驱动的功能以及多功能性,使其成为致力于构建智能数据驱动代理的开发人员的宝贵资产。
欢迎分享您的想法,在下面留下评论,说明您计划如何在项目中使用 Crawl4AI。
作者:行动的大雄
本篇文章为 @ 万能的大雄 创作并授权 21CTO 发布,未经许可,请勿转载。
内容授权事宜请您联系 webmaster@21cto.com或关注 21CTO 公众号。
该文观点仅代表作者本人,21CTO 平台仅提供信息存储空间服务。