Crawl4AI：免费开源 LLM 友好型异步Web抓取和数据提取工具可用于大型语言模型（LLMs）和AI应用程序

Crawl4AI 是一款开源的 LLM 友好型 Web 爬虫工具，旨在简化异步 Web 爬取和数据提取，专为大型语言模型 (LLM) 和 AI 应用程序设计。它可以作为 Python 包或通过 Docker 安装，提供灵活的使用方式。

Crawl4AI 的主要特点包括支持多 URL 并行爬取、提取所有媒体标签、外部和内部链接、元数据等。它支持自定义钩子、用户代理、页面截图、JavaScript 执行，并能生成结构化的输出，适合各种复杂的爬取场景，工具还具备异步架构和隐私保护功能。

Crawl4AI 功能

开源且免费：Crawl4AI 完全开源，开发人员可以自由使用和修改，无需担心成本问题。
AI 驱动的自动化数据提取：通过 LLM，Crawl4AI 能够智能化地识别和解析网页元素，自动进行数据提取，极大节省开发者的时间与精力。
结构化数据输出：支持将提取到的数据转换为 JSON、Markdown 等结构化格式，方便后续的分析和处理，确保数据能够无缝集成到 AI 模型训练中。
多功能支持/多URL抓取：支持滚动页面、抓取多个 URL、提取媒体标签（如图片、视频、音频）、元数据、外部/内部链接以及屏幕截图等。
高度定制化：支持用户自定义认证、请求头信息、爬取前页面修改、用户代理以及 JavaScript 脚本执行，确保爬虫可以针对不同网页做出灵活调整。
高级提取策略：支持多种提取策略，包括基于主题、正则表达式、句子的分块策略，以及利用 LLM 或余弦聚类的高级提取策略。

Crawl4AI 特点

🆓 完全免费且开源
🚀 性能超快，超越许多付费服务
🤖 LLM 友好的输出格式（JSON、清理的 HTML、markdown）
🌍 支持同时抓取多个 URL
🎨 提取并返回所有媒体标签（图像、音频和视频）
🔗 提取所有外部和内部链接
📚 从页面中提取元数据
🔄 爬取之前用于身份验证、标头和页面修改的自定义钩子
🕵️ 用户代理自定义
🖼️ 截取页面截图
📜 抓取前执行多个自定义 JavaScript
📊 使用 JsonCssExtractionStrategy 生成无需 LLM 的结构化输出
📚 各种分块策略：基于主题、正则表达式、句子等
🧠 高级提取策略：余弦聚类、LLM 等
🎯 CSS 选择器支持精确的数据提取
📝 传递指令/关键字以优化提取
🔒 代理支持，增强隐私和访问
🔄 针对复杂的多页面爬取场景的会话管理
🌐 异步架构，提高性能和可扩展性

Crawl4AI 安装

Crawl4AI提供灵活的安装选项，以适应各种用例。您可以将其安装为Python包或使用Docker。

使用pip

选择最适合您需求的安装选项：

基本安装

对于基本的Web爬网和抓取任务：

pip install crawl4ai

默认情况下，这将安装异步版本的Crawl4AI，使用Playwright进行Web抓取。

注意：当你安装Crawl4AI时，安装脚本应该自动安装和设置Playwright。但是，如果您遇到任何与剧作家相关的错误，您可以使用以下方法之一手动安装它：

通过命令行：

playwright install

如果上面的命令不起作用，请尝试以下更具体的命令：

python -m playwright install chromium

第二种方法在某些情况下被证明是更可靠的。

同步版本安装

如果您需要使用Selenium的同步版本：

pip install crawl4ai[sync]

开发安装

对于计划修改源代码的贡献者：

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

使用Docker🐳

我们正在创建Docker镜像并将其推送到Docker Hub。这将提供一种在容器化环境中运行Crawl 4AI的简单方法。敬请关注更新！

快速入门

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

高级用法

执行JavaScript和使用CSS选择器

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
            css_selector="article.tease-card",
            bypass_cache=True
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

使用代理

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            bypass_cache=True
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

不使用LLM提取结构化数据

JsonCssExtractionStrategy允许使用CSS选择器从网页中精确提取结构化数据。

import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def extract_news_teasers():
    schema = {
        "name": "News Teaser Extractor",
        "baseSelector": ".wide-tease-item__wrapper",
        "fields": [
            {
                "name": "category",
                "selector": ".unibrow span[data-testid='unibrow-text']",
                "type": "text",
            },
            {
                "name": "headline",
                "selector": ".wide-tease-item__headline",
                "type": "text",
            },
            {
                "name": "summary",
                "selector": ".wide-tease-item__description",
                "type": "text",
            },
            {
                "name": "time",
                "selector": "[data-testid='wide-tease-date']",
                "type": "text",
            },
            {
                "name": "image",
                "type": "nested",
                "selector": "picture.teasePicture img",
                "fields": [
                    {"name": "src", "type": "attribute", "attribute": "src"},
                    {"name": "alt", "type": "attribute", "attribute": "alt"},
                ],
            },
            {
                "name": "link",
                "selector": "a[href]",
                "type": "attribute",
                "attribute": "href",
            },
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
        )

        assert result.success, "Failed to crawl the page"

        news_teasers = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(news_teasers)} news teasers")
        print(json.dumps(news_teasers[0], indent=2))

if __name__ == "__main__":
    asyncio.run(extract_news_teasers())

使用OpenAI提取结构化数据

import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), 
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
            ),            
            bypass_cache=True,
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

会话管理和动态内容抓取

Crawl4AI擅长处理复杂的场景，例如抓取通过JavaScript加载的动态内容的多个页面。下面是一个在多个页面上抓取GitHub提交的例子：

import asyncio
import re
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler

async def crawl_typescript_commits():
    first_commit = ""
    async def on_execution_started(page):
        nonlocal first_commit 
        try:
            while True:
                await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
                commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
                commit = await commit.evaluate('(element) => element.textContent')
                commit = re.sub(r'\s+', '', commit)
                if commit and commit != first_commit:
                    first_commit = commit
                    break
                await asyncio.sleep(0.5)
        except Exception as e:
            print(f"Warning: New content didn't appear after JavaScript execution: {e}")

    async with AsyncWebCrawler(verbose=True) as crawler:
        crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)

        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        all_commits = []

        js_next_page = """
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
        """

        for page in range(3):  # Crawl 3 pages
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                css_selector="li.Box-sc-g0xbh4-0",
                js=js_next_page if page > 0 else None,
                bypass_cache=True,
                js_only=page > 0
            )

            assert result.success, f"Failed to crawl page {page + 1}"

            soup = BeautifulSoup(result.cleaned_html, 'html.parser')
            commits = soup.select("li")
            all_commits.extend(commits)

            print(f"Page {page + 1}: Found {len(commits)} commits")

        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

if __name__ == "__main__":
    asyncio.run(crawl_typescript_commits())

这个例子展示了Crawl4AI处理异步加载内容的复杂场景的能力。它抓取GitHub提交的多个页面，执行JavaScript来加载新内容，并使用自定义钩子来确保在继续之前加载数据。

展开阅读全文