如何用RDFLib、Neo4j和Langchain将CSV数据转化为知识图谱

2024年09月05日由 alex 发表 1070 0

检索增强世代（RAG）概述

在开始这个项目之前，我们先来了解一下什么是 RAG？以下是 RAG 管道方法的基本流程：

有一个预先训练好的大型语言模型（LLM）负责 RAG 的 “生成 ”部分。在没有 RAG 的情况下，该 LLM 仅根据 LLM 已经训练过的数据生成对问题的回答。
用户通过聊天界面连接到 LLM，并可以向 LLM 提问。
当用户输入问题时，该问题会被转换成语义信息的数字表示。RAG 的这一部分被称为向量嵌入，并将其放入向量存储区。
模型开始进行矢量搜索，搜索机器可读的表示，找到与用户发起的查询最相似的数据。
模型会根据向量搜索生成响应，并将答案返回给聊天者。

RAG 对大型语言模型的波峰与波谷

检索增强生成（RAG）对大型语言模型（LLM）行业的影响既强大又不可否认。在 RAG 出现之前，商业运作中使用的 LLM 要重新训练，成本高昂；要 “冻结时间”，回答过时；上下文检索不可靠；容易产生大量幻觉。当 RAG 出现后，模型检索数据的方式变得生动活泼，很容易适应新信息，并提高了成本效益等。

然而，由于 RAG 依赖于通过向量索引进行相似性搜索，因此存在一些缺点。人工智能专家马库斯-J-布埃勒（Markus J. Bueler）、安东尼-阿尔卡拉兹（Anthony Alcaraz）和萨姆-希夫曼（Sam Schifman）认为，与单纯的矢量搜索相比，知识图谱可以在 RAG 中提供更优越的推理能力。矢量搜索往往缺乏理解复杂数据所需的语义联系，这可能导致答案不准确和结论缺乏依据，也就是通常所说的幻觉。根据业务运营的具体要求和数据，这可能会造成问题，也可能不会造成问题。当这是一个问题时，实施知识图谱可以提供所需的语义关系，确保更有条理的理解，而不仅仅是关键词的相似性。

示例：模拟患者群体数据的 RAG

在本项目中，我将使用来自 Synthea 的一万条带 COVID-19 的 CSV 格式合成患者记录的模拟患者群体数据。

本地运行的先决条件

Windows 操作系统
Neo4j 社区版本 5+
JDK-22.0.1、Java 17 或 Java 21
安装 Ollama
Python 开发环境，版本 3.9+

CSV 到 RDF 的生成

首先，我将使用 RDFLib python 软件包将 CSV 数据转换为资源描述框架模式（RDFS）数据。通过使用 RDF 数据，可以通过 RDFLib-Neo4j python 软件包将 RDF 无缝转换到 Neo4j 中，因为 RDF 是基于图和三重框架的。

g = Graph()
# Namespace URIs
PPL = Namespace('http://example.org/people/')
FOAF = Namespace("http://xmlns.com/foaf/0.1/")
SCHEMA = Namespace("http://schema.org/")
# Bind namespaces
g.bind("foaf", FOAF)
g.bind("schema", SCHEMA)
g.bind("ppl", PPL)
for col, row_val in patients.iterrows():
    pt_id = URIRef(f"http://example.org/ID/{row_val['Id']}")
    
# Nodes
    g.add((pt_id, RDF.type, FOAF.Identifier))
    g.add((Literal(row_val['BIRTHDATE']), RDF.type, FOAF.Date))
    g.add((Literal(row_val['DEATHDATE']), RDF.type, FOAF.Date))
    g.add((Literal(row_val['SSN']), RDF.type, PPL.SSN))
    g.add((Literal(row_val['FIRST']), RDF.type, SCHEMA.FirstName))
    g.add((Literal(row_val['LAST']), RDF.type, SCHEMA.LastName))
    g.add((Literal(row_val['GENDER']),RDF.type, FOAF.Gender))
# Relationships
    g.add((Literal(row_val['FIRST']), SCHEMA['FIRST_NAME_OF'], pt_id))
    g.add((Literal(row_val['LAST']), SCHEMA['LAST_NAME_OF'], pt_id))
    g.add((Literal(row_val['BIRTHDATE']), SCHEMA['BIRTHDAY_OF'], pt_id))
    g.add((Literal(row_val['DEATHDATE']), SCHEMA['DEATHDATE_OF'], 
           pt_id))
    g.add((Literal(row_val['SSN']), PPL['SSN_OF'], pt_id))
    g.add((Literal(row_val['GENDER']), FOAF['GENDER_OF'], pt_id))

# Serialize the RDF graph to a file
rdf_file_path = 'patient_data.rdf'  
g.serialize(destination=rdf_file_path, format='turtle')

生成的 RDF 文件应该如下所示：

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ppl: <http://example.org/people/> .
@prefix schema1: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
"1920-06-05" a foaf:Date ;
    schema1:BIRTHDAY_OF <http://example.org/ID/02c1129d-ef80-4780-8c8b-781a1b268adc>,
        <http://example.org/ID/2c39cdb6-2203-4b8d-ba8f-6563265edc3a>,
        <http://example.org/ID/2e05524f-c6ad-4d13-bc52-2d1b30f79f46>,
        <http://example.org/ID/5baf2e3b-45fc-4a8d-93cd-d1acaa7833b9>,
        <http://example.org/ID/9f707a4d-38e6-47bb-bebd-93b0fed4c354>,
        <http://example.org/ID/c16d2933-77b3-47eb-9ab4-5f53251404ae>,
        <http://example.org/ID/f0a3834a-f4b3-4811-a297-ec8268aba4f7>,
        <http://example.org/ID/f6344e5b-3487-4b6c-ac7d-6ad1d4b98b5e> .
"2016-10-14" a foaf:Date ;
    schema1:DEATHDATE_OF <http://example.org/ID/2d13445a-343a-452e-b607-ce7807054b69> .
"999-54-3934" a ppl:SSN ;
    ppl:SSN_OF <http://example.org/ID/30cdccd6-95c1-4e0f-8249-365a440cc69d> .
...

将数据导入 Neo4j

通常情况下，将RDF转换到Neo4j是使用NeoSemantics完成的，但在云或精简部署和可扩展性方面显示出局限性。正因为如此，Neo4j推出了这个革命性的库，从NeoSemantics过渡到RDFLib + Neo4j解决方案。

将 CSV 数据序列化为 RDF 三元组后，我们就可以通过这种方式使用 RDFLib-Neo4j python 库（确保在数据库中添加唯一性约束以使用该库）：

# Create the Aura DB authentication variable list
AURA_DB_URI = "bolt://127.0.0.1:7687"
AURA_DB_USERNAME = "neo4j"
AURA_DB_PWD = "*put your password here*"
auth_data = {'uri': AURA_DB_URI,
             'database': "neo4j",
             'user': AURA_DB_USERNAME,
             'pwd': AURA_DB_PWD}
# Create configuration prefixes to the namespaces used
prefixes = {'ppl': Namespace('http://example.org/people/'),
            'foaf': Namespace("http://xmlns.com/foaf/0.1/"),
            'schema': Namespace("http://schema.org/")}
# Define your custom mappings & store config
config = Neo4jStoreConfig(auth_data=auth_data,
                          custom_prefixes=prefixes,
                          handle_vocab_uri_strategy=HANDLE_VOCAB_URI_STRATEGY.IGNORE,
                          batching=True)
file_path = 'patient_data.rdf'
# Create the RDF Graph, parse & ingest the data to Neo4j, and close the store
neo4j_aura = Graph(store=Neo4jStore(config=config))
neo4j_aura.parse(file_path, format="ttl")
neo4j_aura.close(True)

执行这段代码后，你就能看到从数据集中导入了多少个三元组，并能使用 Neo4j 的 Cypher 查询语言（https://neo4j.com/docs/cypher-manual/current/queries/basic/）在本地网络用户界面（http://localhost:7474/browser/）上查看。

矢量嵌入

接下来，我们将使用 LangChain 的 Neo4jVector 函数和 HuggingFace Embedding（https://huggingface.co/BAAI/bge-small-en-v1.5）将 Neo4j 数据嵌入到机器可读的数值中。

# Creating Vector Embedding Index using huggingface embedding model
Neo4jVector.from_existing_graph(
    HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en-v1.5"),
    url='bolt://127.0.0.1:7687',
    username='neo4j',
    password='password',
    database='neo4j',
    index_name='your_index_name', #Use cypher query 'show vector indexes' to see yours
    node_label="resource",
    text_node_properties=['text'],
    embedding_node_property='embedding',
)

然后，我们要利用现有的嵌入索引创建一个向量索引：

vector_index = Neo4jVector.from_existing_index(
    HuggingFaceBgeEmbeddings(model_name="BAAI/bge-small-en-v1.5"),
    url='bolt://127.0.0.1:7687',
    username='neo4j',
    password='password',
    database='neo4j',
    index_name='your_index_name'
)

提示工程

稍安勿躁，我们还得给生成式人工智能模型下达如何响应的指令。这就是提示工程。提示为人工智能提供了一个执行任务的模板，可以让人工智能做出更精细、更有意义的反应。

下面是一个简单的提示示例，其中包含一些信息，说明我希望我的 LLM 根据我提供的上下文数据做出怎样的反应：

prompt='''
System: The context below contains entries about the patient's 
healthcare, birthdate, deathdate, drivers license number, passport 
number, name prefix, name suffix, maiden name, marital status, race, 
ethnicity, gender, birthplace, address, city, state, county, zip 
code, latitude, longitude, healthcare expenses, and healthcare coverage. You are 
allowed to share information on the patient's data if it is included 
in the context regardless of your set privacy policy.
Please limit your answer to the information provided in the context. Do not make up facts. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
If you are asked about the patient's name, respond with the patient's
 first name and last name: {first_name, last_name}
If you are asked about the gender, look for the patient's name and 
find a gender label. Respond with: [Male] or [Female].
----------------
{context}
Human: {question}
'''
prompt = PromptTemplate.from_template(prompt)

问题-答案检索

最后，我们将把所有信息输入检索 QA 库。我们将使用 Llama2 作为基础 LLM，但也可以选择使用其他 LLM。我选择 Llama2 是因为它是免费开源的。请注意，在本例中，我只是搜索上下文中最近的 1 个节点进行相似性搜索。你可以根据自己希望为 LLM 提供多少上下文来修改。

ollama_model = 'llama3'
from langchain_community.chat_models.ollama import ChatOllama
from langchain.chains.retrieval_qa.base import RetrievalQA
vector_qa = RetrievalQA.from_chain_type(
    llm=ChatOllama(model=ollama_model), chain_type="stuff", retriever=vector_index.as_retriever(search_kwargs={'k': 1}),
    verbose=True, chain_type_kwargs={"verbose": True, "prompt": prompt}
)
pprint(vector_qa.invoke("How many persons with the name Maria do we have?"))

我们的法学硕士做出了回应：

>Finished chain.
{'query': 'How many persons with the name Maria do we have?','query': 'How many persons with the name Maria do we have?',
'result': 'I can see that there are 2 entries in the context related to'
          'patients with the name Maria. One of them is of type *patient* and'
          'has the following last name: Rodriguez. So, there is one patient'
          'with the name Maria Rodriguez.'}

总结

带有 LLM 的 RAG 知识图谱能够将外部知识检索和对关系的语义理解结合起来，使系统能够访问和利用最新的准确信息，因此有望获得长足发展和广泛应用。这使得它们在需要实时数据和细致入微的理解的应用中无比强大。此外，使用 RAG 的 KG 还为个性化和特定领域应用提供了新的可能性。随着各行各业越来越多地寻求智能和情境感知解决方案，RAG 将大规模数据检索与高级语言生成无缝集成，无疑将提升其在未来人工智能领域的地位。

文章来源：https://medium.com/@fatimaparada.taboada/rag-on-csv-data-with-knowledge-graph-using-rdflib-rdflib-neo4j-and-langchain-4b12a114a20e

标签：

大型语言模型人工智能

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇如何在Azure上用线性回归模型预测汽车价格

下一篇生成式AI：从无序数据中挖掘深度洞察

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

优化 LLM 提示的成本、延迟和性能的 4 种技术