LLM使用ScrapeGraphAI进行网页抓取

2024年05月06日由 alex 发表 1968 0

简介：网络抓取的演变

在数据驱动的动态行业领域，从在线资源中提取有价值的见解至关重要。从市场分析到学术研究，对特定数据的需求助长了对强大网络抓取工具的需求。传统上，BeautifulSoup 和 Scrapy 等 Python 库一直是最常用的解决方案，需要用户利用编程专业知识来浏览复杂的Web结构。

# BeautifulSoup Example
from bs4 import BeautifulSoup
import requests
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title)

# Scrapy Example
import scrapy
class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']
    def parse(self, response):
        title = response.css('title::text').get()
        print(title)

ScrapeGraphAI 简介：简化数据提取

ScrapeGraphAI是一个开创性的Python库，它重塑了网络搜索的格局。这款创新工具由 Niharika Singh 开发，利用大型语言模型（LLM）和直接图逻辑的强大功能来简化数据收集。与前代产品不同，ScrapeGraphAI 让用户能够明确表达自己的数据需求，从而抽象出网络搜索的复杂性。

%%capture
!apt install chromium-chromedriver
!pip install nest_asyncio
!pip install scrapegraphai
!playwright install
# if you plan on using text_to_speech and GPT4-Vision models be sure to use the
# correct APIKEY
OPENAI_API_KEY = "YOUR API KEY"
GOOGLE_API_KEY = "YOUR API KEY"
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions.",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects/",
    config=graph_config
)
result = smart_scraper_graph.run()
import json
output = json.dumps(result, indent=2)
line_list = output.split("\n")  # Sort of line replacing "\n" with a new line
for line in line_list:
    print(line)

语音图表

SpeechGraph 是一个表示默认抓取管道之一的类，它将答案与音频文件一起生成。与 SmartScraperGraph 类似，但增加了 TextToSpeechNode 节点。

from scrapegraphai.graphs import SpeechGraph
# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
    "tts_model": {
        "api_key": OPENAI_API_KEY,
        "model": "tts-1",
        "voice": "alloy"
    },
    "output_path": "website_summary.mp3",
}
# Create the SpeechGraph instance
speech_graph = SpeechGraph(
    prompt="Create a summary of the website",
    source="https://perinim.github.io/projects/",
    config=graph_config,
)
result = speech_graph.run()
answer = result.get("answer", "No answer found")

import json
output = json.dumps(answer, indent=2)
line_list = output.split("\n")  # Sort of line replacing "\n" with a new line
for line in line_list:
    print(line)

from IPython.display import Audio
wn = Audio("website_summary.mp3", autoplay=True)
display(wn)

GraphBuilder（实验）

GraphBuilder 会根据用户提示从头开始创建一个刮擦管道。它会返回一个包含节点和边的图形。

GraphBuilder 是一个实验类，可帮助你根据提示创建自定义图形。它会创建一个包含图形基本要素的 json，并允许你使用 graphviz 将其可视化。它知道库默认提供的节点类型，并将它们连接起来，帮助你实现目标。

from scrapegraphai.builders import GraphBuilder
# Define the configuration for the graph
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
    },
}
# Example usage of GraphBuilder
graph_builder = GraphBuilder(
    user_prompt="Extract the news and generate a text summary with a voiceover.",
    config=graph_config
)
graph_json = graph_builder.build_graph()
# Convert the resulting JSON to Graphviz format
graphviz_graph = graph_builder.convert_json_to_graphviz(graph_json)
# Save the graph to a file and open it in the default viewer
graphviz_graph.render('ScrapeGraphAI_generated_graph', view=True)

graph_json
graphviz_graph

ScrapeGraphAI 的工作原理：仔细观察

ScrapeGraphAI 通过解释用户查询和智能浏览网页内容来获取所需信息。利用 LLM，它可以自主构建搜索管道，最大限度地减少用户干预。这种方法不仅提高了效率，还降低了入门门槛，使用户能够专注于数据分析，而不是复杂的技术问题。

利用 ScrapeGraphAI 提高效率

ScrapeGraphAI 能够自动执行复杂的抓取任务，同时确保高准确性，这对于各行各业的专业人士来说无疑是一场变革。无论是监控竞争对手还是开展学术研究，该工具都能帮助用户高效利用网络数据。随着数字领域的不断发展，ScrapeGraphAI 已成为推动数据驱动决策向前发展的不可或缺的盟友。

结论

在以数据为中心的世界里，高效数据提取的重要性怎么强调都不为过。ScrapeGraphAI代表了网络抓取的范式转变，提供了一种由尖端技术驱动的用户友好型方法。随着企业和研究人员努力在竞争激烈的环境中保持领先地位，采用 ScrapeGraphAI 等工具对于获取可操作的见解和推动明智决策至关重要。

文章来源：https://medium.com/@amanatulla1606/llm-web-scraping-with-scrapegraphai-a-breakthrough-in-data-extraction-d6596b282b4d

标签：

人工智能机器学习

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇【指南】金字塔分层变压器

下一篇探索Hugging Face：物体检测

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来