RAG优化：使用LLM对文本进行语义分块

2024年07月09日由 alex 发表 589 0

在这篇文章中，将学习到更先进的分块和嵌入技术。我们将介绍的机制包括分层分块、基于规则的分块和语义分块。

首先，语义搜索为什么需要分块？

在搜索文档桶中的语义时，可以为每份文档创建一个向量表示。这种方法适用于短小精悍的文档。涉及特定主题的新闻条目可以转化为单一向量，用于主题搜索。将其与撰写内容摘要进行比较。如果你能写出一份包含所有所需信息的非常简短的摘要，那么整个文档的矢量就很有可能发挥作用。试想一下，你要查找的是摘要中没有包含的小内容。在整个文档中使用一个向量也会忽略该事实的语义。

你可以使用分割器从整篇文章中创建分块来解决这个问题。两种著名的分块机制是：

句子分割--每个句子成为自己的语块
最大标记分割--每个语块拥有最大标记数

语块的挑战

每个语块都必须有足够的语义，以便在人们搜索时具有相关性。句子的问题在于，它可能不包含某个语义主题的所有相关信息。同样的情况也可能发生在最大标记块中；但是，由于最大标记块的存在，同一标记块中也可以讨论多个语义或知识项目。在一个向量中，平均值是为该语义块创建的。因此，即使问题确实包含相关信息，它也与语块不匹配。

克服信息块的语义问题

简而言之，小的语块可能具有上下文之外的语义。大的语块可以有多种语义。如果我们能让这些语块很好地分离知识或语义呢？LLM 能否帮助我们实现基于知识的分块？

我们在示例中使用的 SectionSplitter。

class SectionSplitter(Splitter):
    def split(self, input_document: InputDocument) -> List[Chunk]:
        sections = re.split(r"\n\s*\n", input_document.text)
        print(f"Num sections: {len(sections)}")
        chunks_ = []
        for i, section in enumerate(sections):
            chunk_ = Chunk(input_document.document_id, i, len(sections), section, input_document.properties)
            chunks_.append(chunk_)
        return chunks_
    @staticmethod
    def name() -> str:
        return SectionSplitter.__name__

父Splitter类来自 RAG4p 库。我们使用该分割器将文本分割成块，作为语义或知识库分割器的输入。下一个代码块展示了如何使用 OpenAI 创建知识块。

openai_client = OpenAI(api_key=key_loader.get_openai_api_key())

def fetch_knowledge_chunks(orig_chunk: Chunk) -> List[Chunk]:
    prompt = f"""Task: Extract Knowledge Chunks
    
    Please extract knowledge chunks from the following text. Each chunk should 
    capture distinct, self-contained units of information in a 
    subject-description format. Return the extracted knowledge chunks as 
    a JSON object or array, ensuring that each chunk includes both the 
    subject and its corresponding description. Use the format: 
    {{"knowledge_chunks": [{{"subject": "subject", "description": 
    "description"}}]}}
    
    Text:
    {orig_chunk.chunk_text}
    """
    completion = openai_client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[
            {"role": "system",
             "content": "You are an assistant that takes apart a piece of text into semantic chunks to be used in a RAG system."},
            {"role": "user", "content": prompt},
        ],
        stream=False,
    )
    
    answer = json.loads(completion.choices[0].message.content)
    chunks_ = []
    for index, kc in enumerate(answer["knowledge_chunks"]):
        chunk_ = Chunk(orig_chunk.get_id(), index, len(answer["knowledge_chunks"]), f'{kc["subject"]}: {kc["description"]}', {"original_text": orig_chunk.chunk_text, "original_chunk_id": orig_chunk.get_id(), "original_total_chunks": orig_chunk.total_chunks})
        chunks_.append(chunk_)
        
    return chunks_

请注意，OpenAI 客户端支持 response_format。现在，你可以告诉 OpenAI 返回一个 JSON 对象作为响应。。我不得不添加要使用的格式。在此之前，响应并不一致。我们可以将其与上一步生成的部分一起使用。第一个部分包含以下文本：

有没有想过建立自己的问答系统？就像 Siri、Alexa 或 Google Assistant 那样？好吧，我们已经为你准备好了很棒的东西！在我们的实践研讨会上，我们将指导你了解创建问答系统的来龙去脉。我们更倾向于使用 Python 来进行培训。我们准备了一个与 Python 兼容的图形用户界面。如果你喜欢其他语言，也可以参加培训班，但你将错过用于测试应用程序的图形用户界面。

以下知识块以 "主题：描述 "的格式提取：

构建问答系统：学习如何构建类似 Siri、Alexa 或 Google Assistant 的问答系统。
研讨会：参加实践研讨会，指导你创建问答系统。
编程语言偏好：研讨会首选 Python 编程语言。
Python 的图形用户界面：已准备好与 Python 配合使用的图形用户界面，用于测试你的应用程序。
编程语言灵活性：研讨会上也可以使用其他编程语言，但不提供用于测试的图形用户界面。将查询与答案上下文进行语义匹配

执行向量查询

我们可以重复使用 RAG4p 框架中的一些组件。我们使用内部内容存储（InternalContentStore）来存储块、创建嵌入和执行查询。

from rag4p.integrations.openai import EMBEDDING_SMALL
# Create an in memory content store to hold some chunks
openai_embedder = OpenAIEmbedder(api_key=key_loader.get_openai_api_key(), embedding_model=EMBEDDING_SMALL)
content_store = InternalContentStore(embedder=openai_embedder, metadata=None)
for chunk in chunks:
    knowledge_chunks = fetch_knowledge_chunks(chunk)
    content_store.store(knowledge_chunks)

有了内容存储，我们就可以开始执行查询了：

result = content_store.find_relevant_chunks("What are examples of a RAG system?")
for found_chunk in result:
    print(f"Score: {found_chunk.score:.3f}, Chunk: {found_chunk.get_id()}, \
        Num chunks: {found_chunk.total_chunks} \n {found_chunk.chunk_text}")

Finding relevant chunks for query: What are examples of a RAG system?
Score: 1.188, Chunk: input-doc_0_0, Num chunks: 5 
 Building a question-answering system: The process of creating a system similar to Siri, Alexa, or Google Assistant that can answer questions.
Score: 1.213, Chunk: input-doc_4_3, Num chunks: 4 
 Pipeline Creation: Tools for creating a pipeline include Langchain and Custom solutions.
Score: 1.223, Chunk: input-doc_4_1, Num chunks: 4 
 Large Language Model: Large Language Models that can be used include OpenAI, HuggingFace, Cohere, PaLM, and Bedrock.
Score: 1.223, Chunk: input-doc_0_1, Num chunks: 5 
 Workshop offering: A hands-on workshop that guides participants through creating a question-answering system.

当一个语块被匹配后，我们就可以从中获取原文。在将数据发送回 LLM 以生成所提供问题的答案时，这是一个很好的上下文基础。

使用 RAG（检索增强生成）回答问题

接下来，我们利用上下文来使用大型语言模型回答问题。下一个代码块展示了我们如何构建答案。

question = "What are examples of a q&a systems?"
result = content_store.find_relevant_chunks(question, max_results=1)
found_chunk = result[0]
context = found_chunk.properties["original_text"]
openai_answer_generator = OpenaiAnswerGenerator(openai_api_key=key_loader.get_openai_api_key())
answer = openai_answer_generator.generate_answer(question, context)

例如 Siri、Alexa 和谷歌助手。

对于 "我们将学习什么？

你将学习如何使用向量存储和大型语言模型，以及如何将这两种元素结合起来执行语义搜索，这超越了传统的基于关键字的搜索。

当有人提出一个需要多个知识块的问题时，难度就会大大增加，因为它要求的不仅仅是一个知识项。在这种情况下，我们必须对问题做同样的处理：提取其中的知识部分。

首先，我们创建一个函数，从提供的问题中提取子问题。

def fetch_knowledge_question_chunks(orig_text: str) -> List[str]:
    prompt = f"""Task: Extract Knowledge parts from question to use in a RAG system
        
        Please extract sub questions from the following question. Each sub-question 
        should ask for distinct, self-contained units of information. Return 
        the subquestions as a JSON array, ensuring that each item is a question. 
        Use the format: {{"sub_questions": ["question1", "question2"]}}
        
        Text:
        {orig_text}
        """
    completion = openai_client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[
            {"role": "system",
             "content": "You are an assistant that takes apart a question into sub-questions."},
            {"role": "user", "content": prompt},
        ],
        stream=False,
    )
    answer_ = json.loads(completion.choices[0].message.content)
    parts_ = []
    if "sub_questions" not in answer_:
        print(f"Error in answer: {answer_}")
        return parts_
    for know_part in answer_["sub_questions"]:
        parts_.append(know_part)
    return parts_

下一步，我们提取子问题，获取每个子问题的相关部分，将不同的原文合并为一个上下文，并要求 LLM 用创建的上下文回答完整的问题。

question = "What is semantic search and what vector stores are we using?"
query_parts = fetch_knowledge_question_chunks(question)
context_parts = []
for part in query_parts:
    print(part)
    result = content_store.find_relevant_chunks(part, max_results=1)
    found_chunk = result[0]
    context_parts.append(found_chunk.properties["original_text"])
    print(
        f"Score: {found_chunk.score:.3f}, Chunk: {found_chunk.get_id()}, Num chunks: {found_chunk.total_chunks} \n{found_chunk.chunk_text}")
context = " ".join(context_parts)
openai_answer_generator = OpenaiAnswerGenerator(openai_api_key=key_loader.get_openai_api_key())
answer = openai_answer_generator.generate_answer(question, context)
print(f"Context: \n{context}")
print(f"\nAnswer: \n{answer}")

What is semantic search?
Finding relevant chunks for query: What is semantic search?
Score: 0.678, Chunk: input-doc_1_2, Num chunks: 4 
Introduction to semantic search: Semantic search is the next big thing after 
traditional keyword-based searches.
What vector stores are we using?
Finding relevant chunks for query: What vector stores are we using?
Score: 0.881, Chunk: input-doc_4_0, Num chunks: 4 
Vector Store: Tools for storing vectors include OpenSearch, Elasticsearch, 
and Weaviate.
Context: 
You'll get your hands dirty with vector stores and Large Language Models, we 
help you combine these two in a way you've never done before. You've probably 
used search engines for keyword-based searches, right? Well, prepare to have 
your mind blown. We'll dive into something called semantic search, which is 
the next big thing after traditional searches. It’s like moving from asking 
Google to search "best pizza places" to "Where can I find a pizza place that 
my gluten-intolerant, vegan friend would love?" – you get the idea, right? 
Some of the highlights of the workshop: 
- Use a vector store (OpenSearch, Elasticsearch, Weaviate)
- Use a Large Language Model (OpenAI, HuggingFace, Cohere, PaLM, Bedrock)
- Use a tool for content extraction (Unstructured, Llama)
- Create your pipeline (Langchain, Custom)

Answer: 
Semantic search is a type of search that goes beyond traditional keyword-based 
searches and understands the context and intent of the query. For example, 
instead of just searching for "best pizza places," semantic search can 
understand and find "Where can I find a pizza place that my gluten-intolerant, 
vegan friend would love?"
The vector stores being used in the workshop are OpenSearch, Elasticsearch, 
and Weaviate.

文章来源：https://medium.com/@jettro.coenradie/rag-optimisation-use-an-llm-to-chunk-your-text-semantically-ac768f1566d0

标签：

llm

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇【指南】PyMuPDF：从多列页面中提取文本

下一篇 4M-21：适用于21种模式的Apple小型模型

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来