LlamaIndex:大型语言模型的分块策略

2024年03月26日 由 alex 发表 143 0

现在我们已经深入探讨了构建和查询过程,这是开发 RAG 应用程序的一个重要组成部分,你可能会有一些问题。究竟什么是 RAG?检索在这里是什么意思?llama-index又是如何应对我们之前讨论过的挑战的呢?


检索增强生成(RAG)是一种通过从其他来源添加额外上下文或信息来增强大语言模型(LLM)的系统。


检索是将额外信息或上下文引入语言模型的过程。


LLamaIndex 解决了将语言模型扩展到大型文档集的难题。为了克服这一挑战,LLamaIndex 采用了两个关键策略。首先,它将文档分块为更小的上下文,如句子或段落,这些上下文被称为节点。语言模型可以有效地处理这些节点。其次,LLamaIndex 使用向量嵌入为这些节点编制索引,从而实现快速和语义搜索。


通过对文档进行分块并利用向量嵌入,LLamaIndex 可以在大型数据集上进行可扩展的语义搜索(我们将在下一篇博客中详细讨论其他相关技术)。它通过从索引中检索相关节点并使用语言模型合成响应来实现这一目标。在本文中,我们将只关注分块或节点整理。


让我们从理论上相同的文本分割/分块/节点整理开始。我们不能将无限制的数据传递给应用程序,主要有两个原因:


1. 语境限制:语言模型的语境窗口有限。

2. 信噪比: 当所提供的信息与任务相关时,语言模型会更加有效。


分块的目的不是为了分块而分块,而是将数据转换成一种格式,使其能够用于预期任务,并在日后检索其价值。与其问 "我应该如何对数据进行分块?",不如问 "怎样才能以最佳方式将数据传递给我的语言模型,以满足其任务的需要?"


! pip install llama_index


节点解析器

节点解析器将文档列表分解为 Node 对象,其中每个节点代表父文档的一个不同部分,子节点继承父文档的所有属性。


节点解析器--基于文件

为了简化节点解析,有各种基于文件的解析器可供选择,它们是为 JSON 或 Markdown 等不同内容类型量身定制的。最简单的方法是将 FlatFileReader 与 SimpleFileNodeParser 相结合,后者会为每种内容类型智能地选择合适的解析器。此外,你还可以使用基于文本的解析器来增强这一功能,以准确处理文本长度。


节点分析器 - 简单文件


from llama_index.core.node_parser import SimpleFileNodeParser
from llama_index.readers.file import FlatReader
from pathlib import Path
md_docs = FlatReader().load_data(Path("/content/README (1).md"))
parser = SimpleFileNodeParser()
md_nodes = parser.get_nodes_from_documents(md_docs)
md_nodes[0]


# output
{
  "id": "1bab03a5-2071-4ea7-ab31-2d6211aac74a",
  "embedding": null,
  "metadata": {
    "Header 1": "Rasa Customer Service Bot",
    "filename": "README (1).md",
    "extension": ".md"
  },
  "excluded_embed_metadata_keys": [],
  "excluded_llm_metadata_keys": [],
  "relationships": {
    "SOURCE": {
      "node_id": "72319e2c-d36b-470d-9487-a59971ee19ca",
      "node_type": "DOCUMENT",
      "metadata": {
        "filename": "README (1).md",
        "extension": ".md"
      },
      "hash": "3d5f2703485b1f4903690fdb8a57085949b74cdfce9ac9729e483d0fec4831cf"
    },
    "NEXT": {
      "node_id": "9edaadf9-425b-417f-b18a-bb1df7f73cfc",
      "node_type": "TEXT",
      "metadata": {
        "Header 1": "Rasa Customer Service Bot",
        "Header 2": "File Structure",
        "filename": "README (1).md",
        "extension": ".md"
      },
      "hash": "584ffef592cb306af05bcfcc93d66aa9fbd72078ccaa10de78032a4de3b490ff"
    }
  },
  "text": "Rasa Customer Service Bot\n\nWelcome to the Rasa Customer Service Bot! This bot is designed to assist users from three different counties: Clay County, Utah, and West Hollywood. It provides customer service functionalities tailored to the needs and inquiries specific to each county.",
  "start_char_idx": 2,
  "end_char_idx": 283,
  "text_template": "{metadata_str}\n\n{content}",
  "metadata_template": "{key}: {value}",
  "metadata_seperator": "\n"
}


节点解析器 - HTML

该节点解析器利用 Beautiful Soup 来解析原始 HTML 内容。默认情况下,它会解析一组预定义的 HTML 标记,但你也可以自定义选择。默认标签包括 "p"、"h1 "至 "h6"、"li"、"b"、"i"、"u "和 "section"。


import requests
from llama_index.core import Document
from llama_index.core.node_parser import HTMLNodeParser
# URL of the website to fetch HTML from
url = "https://www.utoronto.ca/"
# Send a GET request to the URL
response = requests.get(url)
print(response)
# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Extract the HTML content from the response
    html_doc = response.text
    
    # Create a Document object with the HTML content
    document = Document(id_=url, text=html_doc)
    
    # Initialize the HTMLNodeParser with optional list of tags
    parser = HTMLNodeParser(tags=["p", "h1"])
    
    # Parse nodes from the HTML document
    nodes = parser.get_nodes_from_documents([document])
    
    # Print the parsed nodes
    print(nodes)
else:
    # Print an error message if the request was unsuccessful
    print("Failed to fetch HTML content:", response.status_code)


# output
<Response [200]>
[TextNode(id_='4316c3a3-8a91-4e5f-aa7f-555237742254', embedding=None, metadata={'tag': 'h1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://www.utoronto.ca/', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='b84e192f8243da374c83690615fad3543fa108588df3cb448376a932209fece1'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='0856479e-14cf-45c2-903c-7774deee317c', node_type=<ObjectType.TEXT: '1'>, metadata={'tag': 'p'}, hash='6c1cbb06e3dc64a06ebd226cd7dd1960a8da0950feaf4e221bb997bc1cf46a26')}, text='Welcome to University of Toronto', start_char_idx=2784, end_char_idx=2816, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), TextNode(id_='0856479e-14cf-45c2-903c-7774deee317c', embedding=None, metadata={'tag': 'p'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://www.utoronto.ca/', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='b84e192f8243da374c83690615fad3543fa108588df3cb448376a932209fece1'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='4316c3a3-8a91-4e5f-aa7f-555237742254', node_type=<ObjectType.TEXT: '1'>, metadata={'tag': 'h1'}, hash='e1e6af749b6a40a4055c80ca6b821ed841f1d20972e878ca1881e508e4446c26')}, text='Five things to look forward to at Entrepreneurship Week 2024\nYour guide to the U of T community\nThe University of Toronto is home to some of the world’s top faculty, students, alumni and staff. U of T Celebrates recognizes their award-winning accomplishments.\nDavid Dyzenhaus recognized with Gold Medal from Social Sciences and Humanities Research Council\nOur latest issue is all about feeling good: the only diet you really need to know about, the science behind cold plunges, a uniquely modern way to quit smoking, the “sex, drugs and rock ‘n’ roll” of university classes, how to become a better workplace leader, and more.\nResearch and Ideas\nYou’ve decided you want to eat better. Now what?\nThere are countless diets to choose from, but one rises above the rest, say U of T nutrition experts\n\nStatement of Land Acknowledgement\nWe wish to acknowledge this land on which the University of Toronto operates. For thousands of years it has been the traditional land of the Huron-Wendat, the Seneca, and the Mississaugas of the Credit. Today, this meeting place is still the home to many Indigenous people from across Turtle Island and we are grateful to have the opportunity to work on this land.\nRead about U of T’s Statement of Land Acknowledgement.\nUNIVERSITY OF TORONTO - SINCE 1827', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]


节点分析器 - JSON

JSONNodeParser 用于解析 JSON。


from llama_index.core.node_parser import JSONNodeParser
url = "https://housesigma.com/bkv2/api/search/address_v2/suggest"
payload = {"lang": "en_US", "province": "ON", "search_term": "Mississauga, ontario"}
headers = {
    'Authorization': 'Bearer 20240127frk5hls1ba07nsb8idfdg577qa'
}
response = requests.post(url, headers=headers, data=payload)
if response.status_code == 200:
    # Create a Document object with the JSON response
    document = Document(id_=url, text=response.text)
    
    # Initialize the JSONNodeParser
    parser = JSONNodeParser()
    
    # Parse nodes from the JSON document
    nodes = parser.get_nodes_from_documents([document])
    
    # Print the parsed nodes
    print(nodes)
else:
    print("Failed to fetch JSON content:", response.status_code)


# output
[TextNode(id_='272b5be3-052c-42da-a77f-66a3b6229804', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://housesigma.com/bkv2/api/search/address_v2/suggest', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='d1865a5da7c9cc78321e84704aa8226ec9278cb3db9c9488d2a2e0a0f006ff9a')}, text='status True\ndata house_list id_listing owJKR7PNnP9YXeLP\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.75M\ndata house_list price 749,000\ndata house_list price_sold 690,000\ndata house_list tags Sold\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.71996\ndata house_list location lat 43.58322\ndata house_list addr 31 Ontario Crt\ndata house_list address 31 Ontario Crt\ndata house_list address_raw 31 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list rooms_text 4+1 Bedroom, 4 Bathroom, 2 Garage\ndata house_list date_preview 2015-03-16\ndata house_list ml_count_text Listed 1 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 31-ontario-crt\ndata house_list id_listing_history_v2 510QqypNo263LGlV\ndata house_list photo_url https://cache18.housesigma.com/Live/photos/FULL/1/115/W3129115.jpg?6cc85981\ndata house_list bedroom_string 4+1\ndata house_list washroom 4\ndata house_list garage 2\ndata house_list province_abbr ON\ndata house_list id_listing kbEDRYarbz1y1VaB\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.63M\ndata house_list price 634,900\ndata house_list price_sold 629,000\ndata house_list tags Sold\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72032\ndata house_list location lat 43.58365\ndata house_list addr 28 Ontario Crt\ndata house_list address 28 Ontario Crt\ndata house_list address_raw 28 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list rooms_text 4+2 Bedroom, 4 Bathroom, 2 Garage\ndata house_list date_preview 2009-08-17\ndata house_list ml_count_text Listed 1 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 28-ontario-crt\ndata house_list id_listing_history_v2 6VLaGyGalLA3W1ZD\ndata house_list photo_url https://cache08.housesigma.com/Live/photos/FULL/1/637/W1641637.jpg?ab92e198\ndata house_list bedroom_string 4+2\ndata house_list washroom 4\ndata house_list garage 2\ndata house_list province_abbr ON\ndata house_list id_listing 0J6Em7brmxL7XBeq\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.45M\ndata house_list price 449,000\ndata house_list price_sold 410,000\ndata house_list tags Sold\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.71896\ndata house_list location lat 43.58448\ndata house_list addr 16 Ontario St W\ndata house_list address 16 Ontario St W\ndata house_list address_raw 16 Ontario St W\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list rooms_text 2+1 Bedroom, 2 Bathroom, 0 Garage\ndata house_list date_preview 2013-04-29\ndata house_list ml_count_text Listed 3 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 16-ontario-st-w\ndata house_list id_listing_history_v2 MB5bO3xqdmpYkWVP\ndata house_list photo_url https://cache19.housesigma.com/Live/photos/FULL/1/534/W2608534.jpg?c46e9414\ndata house_list bedroom_string 2+1\ndata house_list washroom 2\ndata house_list garage 0\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing weQp5yOpz1V7d0ZE\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.61M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.71846\ndata house_list location lat 43.58431\ndata house_list addr 11 Ontario St W\ndata house_list address_raw 11 Ontario St W\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 2 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 11-ontario-st-w\ndata house_list id_listing_history_v2 ZNkKJ3J1x5Z7d4V6\ndata house_list photo_url https://cache06.housesigma.com/Live/photos/FULL/1/749/W3429749.jpg?e67c455d\ndata house_list province_abbr ON\ndata house_list id_listing xmZRW7ngVM13EBO9\ndata house_list house_type_in_map D\ndata house_list price_abbr 1M\ndata house_list price 999,990\ndata house_list price_sold 945,000\ndata house_list tags Sold\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.71835\ndata house_list location lat 43.58445\ndata house_list addr 9 Ontario St W\ndata house_list address 9 Ontario St W\ndata house_list address_raw 9 Ontario St W\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list rooms_text 4+1 Bedroom, 4 Bathroom, 2 Garage\ndata house_list date_preview 2019-12-02\ndata house_list ml_count_text Listed 9 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 9-ontario-st-w\ndata house_list id_listing_history_v2 ZEXrx30pQXp3OklN\ndata house_list photo_url https://cache05.housesigma.com/Live/photos/FULL/1/760/W4548760.jpg?a2d2c969\ndata house_list bedroom_string 4+1\ndata house_list washroom 4\ndata house_list garage 2\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing JKdOYrGzb9Zy54lW\ndata house_list house_type_in_map D\ndata house_list price_abbr 1.3M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72076\ndata house_list location lat 43.58335\ndata house_list addr 34 Ontario Crt\ndata house_list address_raw 34 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 3 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 34-ontario-crt\ndata house_list id_listing_history_v2 0J6Em7bLwDjYXBeq\ndata house_list photo_url https://cache17.housesigma.com/Live/photos/FULL/1/078/W5743078.jpg?17bcecbf\ndata house_list province_abbr ON\ndata house_list id_listing a6zqW7dmkmv35eZE\ndata house_list house_type_in_map D\ndata house_list price_abbr 1.4M\ndata house_list price 1,449,000\ndata house_list price_sold None\ndata house_list tags Terminated\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 0\ndata house_list list_status status TER\ndata house_list list_status text Terminated\ndata house_list location lon -79.721\ndata house_list location lat 43.58264\ndata house_list addr 45 Ontario Crt\ndata house_list address 45 Ontario Crt\ndata house_list address_raw 45 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status TER\ndata house_list rooms_text 5 Bedroom, 5 Bathroom, 2 Garage\ndata house_list date_preview 2015-07-06\ndata house_list ml_count_text Listed 5 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 45-ontario-crt\ndata house_list id_listing_history_v2 knbq6y1d9QPYo9DA\ndata house_list photo_url https://cache06.housesigma.com/Live/photos/FULL/1/263/W3198263.jpg?d6e729f3\ndata house_list bedroom_string 5\ndata house_list washroom 5\ndata house_list garage 2\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing kbEDRYa8zpQ31VaB\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.75M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72076\ndata house_list location lat 43.58252\ndata house_list addr 43 Ontario Crt\ndata house_list address_raw 43 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 1 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 43-ontario-crt\ndata house_list id_listing_history_v2 VgAaOyLvZ1N3GxMb\ndata house_list photo_url https://cache-e13.housesigma.com/Live/photos/FULL/1/176/W2359176.jpg?67229221\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing BXeEn7XJREdYrPo8\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.47M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72122\ndata house_list location lat 43.58302\ndata house_list addr 40 Ontario Crt\ndata house_list address_raw 40 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 3 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 40-ontario-crt\ndata house_list id_listing_history_v2 jJKdOYr5ZVZ354lW\ndata house_list photo_url https://cache17.housesigma.com/Live/photos/FULL/1/174/W1629174.jpg?1fba5908\ndata house_list province_abbr ON\ndata house_list id_listing obqB176q16WyZajD\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.54M\ndata house_list price 539,900\ndata house_list price_sold 522,500\ndata house_list tags Sold\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72021\ndata house_list location lat 43.58301\ndata house_list addr 35 Ontario Crt\ndata house_list address 35 Ontario Crt\ndata house_list address_raw 35 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list rooms_text 4+1 Bedroom, 4 Bathroom, 2 Garage\ndata house_list date_preview 2010-10-21\ndata house_list ml_count_text Listed 2 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 35-ontario-crt\ndata house_list id_listing_history_v2 ZNkKJ3JaQrx7d4V6\ndata house_list photo_url https://cache09.housesigma.com/Live/photos/FULL/1/279/W1962279.jpg?5b470131\ndata house_list bedroom_string 4+1\ndata house_list washroom 4\ndata house_list garage 2\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing EeVbOYE14XGYx2P0\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.9M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72117\ndata house_list location lat 43.58278\ndata house_list addr 42 Ontario Crt\ndata house_list address_raw 42 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 2 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 42-ontario-crt\ndata house_list id_listing_history_v2 JjAXw7Q4MKMyQOzg\ndata house_list photo_url https://cache16.housesigma.com/Live/photos/FULL/1/062/W2731062.jpg?193f3cce\ndata house_list province_abbr ON\ndata house_list id_listing mLzQ1y5dvvjYqdeK\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.6M\ndata house_list price 599,000\ndata house_list price_sold None\ndata house_list tags Terminated\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 0\ndata house_list list_status status TER\ndata house_list list_status text Terminated\ndata house_list location lon -79.71883\ndata house_list location lat 43.58459\ndata house_list addr 12 Ontario St W\ndata house_list address 12 Ontario St W\ndata house_list address_raw 12 Ontario St W\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status TER\ndata house_list rooms_text 2 Bedroom, 3 Bathroom, 0 Garage\ndata house_list date_preview 2015-01-23\ndata house_list ml_count_text Listed 8 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 12-ontario-st-w\ndata house_list id_listing_history_v2 b1DBW7R6K2Q7qlAp\ndata house_list photo_url https://cache-e14.housesigma.com/Live/photos/FULL/1/380/W3101380.jpg?e7dc9b16\ndata house_list bedroom_string 2\ndata house_list washroom 3\ndata house_list garage 0\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing oK8OgYBLo4z7JmG2\ndata house_list house_type_in_map V\ndata house_list price_abbr 0.35M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.7208\ndata house_list location lat 43.58283\ndata house_list addr 46 Ontario Crt\ndata house_list address_raw 46 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 1 times\ndata house_list type_text Vacant Land\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 46-ontario-crt\ndata house_list id_listing_history_v2 02Zpj39nW9dYDrK8\ndata house_list photo_url https://cache18.housesigma.com/Live/photos/FULL/1/903/W2269903.jpg?5defed43\ndata house_list province_abbr ON\ndata place_list id owJKR7PNnP9YXeLP\ndata place_list text 31 Ontario Crt, Mississauga - Streetsville, ON\ndata place_list province_abbr ON\ndata place_list id_municipality 10205\ndata place_list seo_municipality mississauga-real-estate\ndata place_list lng -79.71996\ndata place_list lat 43.58322\ndata community_list municipality_name Red Rock Ontario\ndata community_list coordinate lon -88.2573624\ndata community_list coordinate lat 48.9421387\ndata community_list id_municipality 73087\ndata community_list community_name Red Rock Ontario\ndata community_list province_abbr ON\ndata community_list id_community 16245\ndata community_list seo_municipality red-rock-ontario-real-estate\ndata community_list municipality_name Thunder Bay, Ontario\ndata community_list coordinate lon -89.2625046\ndata community_list coordinate lat 48.3723145\ndata community_list id_municipality 81843\ndata community_list community_name Thunder Bay, Ontario\ndata community_list province_abbr ON\ndata community_list id_community 43959\ndata community_list seo_municipality thunder-bay-ontario-real-estate\ndata community_list municipality_name Longlac, Ontario\ndata community_list coordinate lon -86.5466461\ndata community_list coordinate lat 49.7723846\ndata community_list id_municipality 81674\ndata community_list community_name Longlac, Ontario\ndata community_list province_abbr ON\ndata community_list id_community 42851\ndata community_list seo_municipality longlac-ontario-real-estate\ndata community_list municipality_name Mississauga\ndata community_list coordinate lon -79.7529144\ndata community_list coordinate lat 43.5792542\ndata community_list id_municipality 10205\ndata community_list community_name Mississauga\ndata community_list province_abbr ON\ndata community_list id_community 15057\ndata community_list seo_municipality mississauga-real-estate\ndata community_list municipality_name Mississauga\ndata community_list coordinate lon -79.6569591\ndata community_list coordinate lat 43.53322\ndata community_list id_municipality 10205\ndata community_list community_name Sheridan\ndata community_list province_abbr ON\ndata community_list id_community 385\ndata community_list seo_municipality mississauga-real-estate\ndata community_list municipality_name Mississauga\ndata community_list coordinate lon -79.612725\ndata community_list coordinate lat 43.621956\ndata community_list id_municipality 10205\ndata community_list community_name Rathwood\ndata community_list province_abbr ON\ndata community_list id_community 404\ndata community_list seo_municipality mississauga-real-estate\ndata community_list municipality_name London\ndata community_list coordinate lon -81.24136\ndata community_list coordinate lat 42.9754\ndata community_list id_municipality 10176\ndata community_list community_name London Ontario\ndata community_list province_abbr ON\ndata community_list id_community 8866\ndata community_list seo_municipality london-real-estate\ndata community_list municipality_name London\ndata community_list coordinate lon -81.24184\ndata community_list coordinate lat 42.97614\ndata community_list id_municipality 10176\ndata community_list community_name London, Ontario\ndata community_list province_abbr ON\ndata community_list id_community 8868\ndata community_list seo_municipality london-real-estate\ndata community_list municipality_name Mississauga\ndata community_list coordinate lon -79.6213918\ndata community_list coordinate lat 43.5937139\ndata community_list id_municipality 10205\ndata community_list community_name Mississauga Valleys\ndata community_list province_abbr ON\ndata community_list id_community 398\ndata community_list seo_municipality mississauga-real-estate\ndata community_list municipality_name Elgin\ndata community_list coordinate lon -76.1232702\ndata community_list coordinate lat 44.6624265\ndata community_list id_municipality 12786\ndata community_list community_name Harlem Ontario Rideau Lakes\ndata community_list province_abbr ON\ndata community_list id_community 5889\ndata community_list seo_municipality elgin-real-estate\nerror code 0\nerror message \ndebug API v5.34.4\ndebug environment production\ndebug server_group ovh\ndebug server OVH01', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]


节点分析器 - Markdown

MarkdownNodeParser 可解析原始的 Markdown 文本。


from llama_index.core.node_parser import MarkdownNodeParser
md_docs = FlatReader().load_data(Path("/content/README (1).md"))
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(md_docs)
nodes[0]


# output
TextNode(id_='02165e38-ed7d-4157-ad6f-af8fea5a4b2c', embedding=None, metadata={'Header 1': 'Rasa Customer Service Bot', 'filename': 'README (1).md', 'extension': '.md'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='a7c31f47-74e7-4991-841f-864a681ce53f', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'README (1).md', 'extension': '.md'}, hash='3d5f2703485b1f4903690fdb8a57085949b74cdfce9ac9729e483d0fec4831cf'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='fbf91bad-d53d-43f7-b6e4-a2a5e07f6b34', node_type=<ObjectType.TEXT: '1'>, metadata={'Header 1': 'Rasa Customer Service Bot', 'Header 2': 'File Structure'}, hash='9250da70b13c424cdadb0346b78eabba52f085cc7e8d1856612fe7c6959381b9')}, text='Rasa Customer Service Bot\n\nWelcome to the Rasa Customer Service Bot! This bot is designed to assist users from three different counties: Clay County, Utah, and West Hollywood. It provides customer service functionalities tailored to the needs and inquiries specific to each county.', start_char_idx=2, end_char_idx=283, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')


这些不同的节点解析器旨在处理 HTML、Markdown 和 JSON 等特定类型的文件。你可以利用它们来完成专门的任务,也可以选择 SimpleFileNodeParser,它能够自动处理所有文件类型。快来试试吧


文本分割器


代码分割器

根据语言分割原始代码文本。


from llama_index.core.node_parser import CodeSplitter
documents = FlatReader().load_data(Path("/content/mnist_utils.py"))
splitter = CodeSplitter(
    language="python",
    chunk_lines=40,  # lines per chunk
    chunk_lines_overlap=15,  # lines overlap between chunks
    max_chars=1500,  # max chars per chunk
)
nodes = splitter.get_nodes_from_documents(documents)
nodes[0]


# output
TextNode(id_='85b26f19-39bb-45c1-99fa-477059f9f7f1', embedding=None, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='9b63f7a5-3c7e-46f8-a684-b865dbfc1976', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, hash='8808ab4e838d76ca2c412ba4c90720ec67c23dea1e45434670022106b5bc254e'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='4b8b8c05-abc6-41ea-a9b5-0d2116cd1fe7', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='333ebfda7feae905b19eee9a88674348a2a2cae61a182cbbda347d17c2571d4b')}, text='# mnist_utils.py\n\nimport torch\nimport torch.nn as nn\nimport torch.optim as optim\nimport torchvision\nimport torchvision.transforms as transforms\nimport numpy as np\nimport matplotlib.pyplot as plt', start_char_idx=0, end_char_idx=194, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')


句子分割器

SentenceSplitter 尝试在尊重句子边界的前提下分割文本。


from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)
nodes[0]


#output
TextNode(id_='c4cb8b40-8130-4351-85d0-d570af4435da', embedding=None, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='9b63f7a5-3c7e-46f8-a684-b865dbfc1976', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, hash='8808ab4e838d76ca2c412ba4c90720ec67c23dea1e45434670022106b5bc254e'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='48cee59c-50da-4328-939f-ff94ad38e8ef', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='b2ce1941dd9ca73caa1e6d656d9ea1086acf519ceba4c069c3d6a7633539b950')}, text='# mnist_utils.py\n\nimport torch\nimport torch.nn as nn\nimport torch.optim as optim\nimport torchvision\nimport torchvision.transforms as transforms\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n\ndef load_mnist(batch_size=64):\n    """\n    Load the MNIST dataset and create data loaders for training and testing.\n\n    Parameters:\n    - batch_size (int): The batch size for data loaders. Default is 64.\n\n    Returns:\n    - trainloader (torch.utils.data.DataLoader): Data loader for the training set.\n    - testloader (torch.utils.data.DataLoader): Data loader for the test set.\n\n    This function loads the MNIST dataset using torchvision.datasets.MNIST. It applies\n    transforms to normalize the pixel values to the range [-1, 1]. It then creates data\n    loaders for the training and test sets using torch.utils.data.DataLoader. The training\n    data loader shuffles the data, while the test data loader does not shuffle the data.\n\n    Example Usage:\n    trainloader, testloader = load_mnist(batch_size=128)\n    """\n    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])\n    trainset = torchvision.datasets.MNIST(root=\'./data\', train=True, download=True, transform=transform)\n    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)\n    testset = torchvision.datasets.MNIST(root=\'./data\', train=False, download=True, transform=transform)\n    testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False)\n    return trainloader, testloader', start_char_idx=0, end_char_idx=1545, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')


句子窗口节点解析器

句子窗口节点解析器(SentenceWindowNodeParser)的功能与其他节点解析器类似,但区别在于它将所有文档分割成单个句子。每个生成的节点都会在元数据中包含与其相邻的 "窗口 "句子。值得注意的是,LLM 或嵌入模型无法访问这些元数据。这种方法尤其适用于生成具有高度特定范围的嵌入模型。如果与元数据替换节点后处理器(MetadataReplacementNodePostProcessor)结合使用,就可以在将节点发送到 LLM 之前,用周围的上下文替换句子。


import nltk
from llama_index.core.node_parser import SentenceWindowNodeParser
node_parser = SentenceWindowNodeParser.from_defaults(
    # how many sentences on either side to capture
    window_size=3,
    # the metadata key that holds the window of surrounding sentences
    window_metadata_key="window",
    # the metadata key that holds the original sentence
    original_text_metadata_key="original_sentence",
)


语义分块节点解析器

"语义分块 "引入了一种新颖的方法,即语义分割器不是按照预先确定的块大小来分割文本,而是根据嵌入的相似性动态地选择句子之间的断点。这就保证了每个 "块 "由语义上相互关联的句子组成。


注意事项:


  • 该 regex 主要针对英语句子进行了优化。
  • 可能有必要调整断点百分位数阈值。


from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)


令牌文本分割器

令牌文本分割器尝试根据原始令牌计数分割成大小一致的块。


from llama_index.core.node_parser import TokenTextSplitter
splitter = TokenTextSplitter(
    chunk_size=1024,
    chunk_overlap=20,
    separator=" ",
)
nodes = splitter.get_nodes_from_documents(documents)
nodes[0]


TextNode(id_='6747586b-52bd-489e-afb4-8f4fa21f11c3', embedding=None, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='9b63f7a5-3c7e-46f8-a684-b865dbfc1976', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, hash='8808ab4e838d76ca2c412ba4c90720ec67c23dea1e45434670022106b5bc254e'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='81ee01e6-e91a-4f87-a56e-958d2618ef0c', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='c518236a47be8b69aa783b7bc77b4f59abee3d8f8eac35f5016979931366df23')}, text='# mnist_utils.py\n\nimport torch\nimport torch.nn as nn\nimport torch.optim as optim\nimport torchvision\nimport torchvision.transforms as transforms\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n\ndef load_mnist(batch_size=64):\n    """\n    Load the MNIST dataset and create data loaders for training and testing.\n\n    Parameters:\n    - batch_size (int): The batch size for data loaders. Default is 64.\n\n    Returns:\n    - trainloader (torch.utils.data.DataLoader): Data loader for the training set.\n    - testloader (torch.utils.data.DataLoader): Data loader for the test set.\n\n    This function loads the MNIST dataset using torchvision.datasets.MNIST. It applies\n    transforms to normalize the pixel values to the range [-1, 1]. It then creates data\n    loaders for the training and test sets using torch.utils.data.DataLoader. The training\n    data loader shuffles the data, while the test data loader does not shuffle the data.\n\n    Example Usage:\n    trainloader, testloader = load_mnist(batch_size=128)\n    """\n    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])\n    trainset = torchvision.datasets.MNIST(root=\'./data\', train=True, download=True, transform=transform)\n    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)\n    testset = torchvision.datasets.MNIST(root=\'./data\', train=False, download=True, transform=transform)\n    testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False)\n    return trainloader, testloader\n\n\ndef create_simple_nn():\n    """\n    Create a simple neural network model for image classification.\n\n    Returns:\n    - model (torch.nn.Module): The neural network model.\n\n    This function defines a simple neural network model for image classification tasks.\n    The model consists of a sequence of layers:\n    - Flatten layer: Reshapes the input image tensor into a 1D tensor.\n    - Fully connected (Linear) layer: Converts the flattened input into a hidden representation.\n      The input size is 28*28 (MNIST image size) and the output size is 128.\n    - ReLU activation function: Applies element-wise ReLU activation to introduce non-linearity.\n    - Fully connected (Linear) layer: Converts the hidden representation into class probabilities.\n      The input size is 128 (output of the previous layer) and the output size is 10 (number of classes).\n\n    Example Usage:\n    model = create_simple_nn()\n    """\n    model = nn.Sequential(\n        nn.Flatten(),\n        nn.Linear(28*28, 128),\n        nn.ReLU(),\n        nn.Linear(128, 10)\n    )\n    return model\n\ndef train_model(model, trainloader, optimizer, criterion, epochs=10):\n    """\n    Train a neural network model using the provided data loader, optimizer, and loss function.\n\n    Parameters:\n    - model (torch.nn.Module): The neural network model to train.\n    - trainloader (torch.utils.data.DataLoader): Data loader for the training dataset.\n    - optimizer (torch.optim.Optimizer): Optimizer to update the model parameters.\n    - criterion (torch.nn.Module): Loss function to compute the training loss.\n    - epochs (int): Number of epochs for training. Default is 10.\n\n    Returns:\n    - history (dict): Dictionary containing training history (loss and accuracy).\n\n    This function trains a neural network model using the provided data loader, optimizer, and\n    loss function. It iterates over the specified number of epochs and updates the model parameters\n    based on the training data. At each epoch, it computes the average loss and accuracy, and stores\n    them in a dictionary called \'history\'. The \'history\' dictionary contains two lists: \'loss\' and\n    \'accuracy\', which track the training loss and accuracy over epochs, respectively.\n\n    Example Usage:\n    history = train_model(model, trainloader, optimizer, criterion, epochs=10)\n    """\n    history = {\'loss\': [], \'accuracy\': []}\n    for epoch in range(epochs):\n        running_loss', start_char_idx=0, end_char_idx=3962, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')


基于关系的节点解析器


分层节点解析器

这种节点解析器将节点划分为层次结构,从而从单一输入中产生不同块大小的多个层次结构。每个节点都包含对其父节点的引用。


当与自动合并重取器(AutoMergingRetriever)一起使用时,当检索到大量子节点时,可自动用父节点替换检索到的节点。这种机制增强了为 LLM 提供的用于合成响应的上下文。


from llama_index.core.node_parser import HierarchicalNodeParser
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)


文章来源:https://medium.com/@bavalpreetsinghh/llamaindex-chunking-strategies-for-large-language-models-part-1-ded1218cfd30
欢迎关注ATYUN官方公众号
商务合作及内容投稿请联系邮箱:bd@atyun.com
评论 登录
热门职位
Maluuba
20000~40000/月
Cisco
25000~30000/月 深圳市
PilotAILabs
30000~60000/年 深圳市
写评论取消
回复取消