模型:
facebook/spar-wiki-bm25-lexmodel-query-encoder
这个模型是SPAR论文中的Wiki BM25词汇模型(Λ)的查询编码器。
Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One? Xilun Chen, Kushal Lakhotia, Barlas Oğuz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta和Wen-tau Yih Meta AI
相关的GitHub仓库可以在这里找到: https://github.com/facebookresearch/dpr-scale/tree/main/spar
这个模型是一个基于BERT-base大小的密集检索器,它通过模仿BM25的行为来训练维基百科文章。还有以下模型可用:
| Pretrained Model | Corpus | Teacher | Architecture | Query Encoder Path | Context Encoder Path |
|---|---|---|---|---|---|
| Wiki BM25 Λ | Wikipedia | BM25 | BERT-base | facebook/spar-wiki-bm25-lexmodel-query-encoder | facebook/spar-wiki-bm25-lexmodel-context-encoder |
| PAQ BM25 Λ | PAQ | BM25 | BERT-base | facebook/spar-paq-bm25-lexmodel-query-encoder | facebook/spar-paq-bm25-lexmodel-context-encoder |
| MARCO BM25 Λ | MS MARCO | BM25 | BERT-base | facebook/spar-marco-bm25-lexmodel-query-encoder | facebook/spar-marco-bm25-lexmodel-context-encoder |
| MARCO UniCOIL Λ | MS MARCO | UniCOIL | BERT-base | facebook/spar-marco-unicoil-lexmodel-query-encoder | facebook/spar-marco-unicoil-lexmodel-context-encoder |
这个模型应该与相关的上下文编码器一起使用,类似于 DPR 模型。
import torch
from transformers import AutoTokenizer, AutoModel
# The tokenizer is the same for the query and context encoder
tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder')
query = "Where was Marie Curie born?"
contexts = [
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
]
# Apply tokenizer
query_input = tokenizer(query, return_tensors='pt')
ctx_input = tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
# Compute embeddings: take the last-layer hidden state of the [CLS] token
query_emb = query_encoder(**query_input).last_hidden_state[:, 0, :]
ctx_emb = context_encoder(**ctx_input).last_hidden_state[:, 0, :]
# Compute similarity scores using dot product
score1 = query_emb @ ctx_emb[0] # 341.3268
score2 = query_emb @ ctx_emb[1] # 340.1626
由于Λ从一个稀疏的教师检索器中学习词汇匹配,它可以与标准的密集检索器(例如 DPR , Contriever )结合使用,以构建同时擅长词汇匹配和语义匹配的密集检索器。
在下面的示例中,我们展示了如何通过连接DPR和维基BM25 Λ的嵌入来构建开放域问答的SPAR-Wiki模型。
import torch
from transformers import AutoTokenizer, AutoModel
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
# DPR model
dpr_ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
dpr_ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
dpr_query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")
dpr_query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base")
# Wiki BM25 Λ model
lexmodel_tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
lexmodel_query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
lexmodel_context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder')
query = "Where was Marie Curie born?"
contexts = [
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
]
# Compute DPR embeddings
dpr_query_input = dpr_query_tokenizer(query, return_tensors='pt')['input_ids']
dpr_query_emb = dpr_query_encoder(dpr_query_input).pooler_output
dpr_ctx_input = dpr_ctx_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
dpr_ctx_emb = dpr_ctx_encoder(**dpr_ctx_input).pooler_output
# Compute Λ embeddings
lexmodel_query_input = lexmodel_tokenizer(query, return_tensors='pt')
lexmodel_query_emb = lexmodel_query_encoder(**query_input).last_hidden_state[:, 0, :]
lexmodel_ctx_input = lexmodel_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
lexmodel_ctx_emb = lexmodel_context_encoder(**ctx_input).last_hidden_state[:, 0, :]
# Form SPAR embeddings via concatenation
# The concatenation weight is only applied to query embeddings
# Refer to the SPAR paper for details
concat_weight = 0.7
spar_query_emb = torch.cat(
[dpr_query_emb, concat_weight * lexmodel_query_emb],
dim=-1,
)
spar_ctx_emb = torch.cat(
[dpr_ctx_emb, lexmodel_ctx_emb],
dim=-1,
)
# Compute similarity scores
score1 = spar_query_emb @ spar_ctx_emb[0] # 317.6931
score2 = spar_query_emb @ spar_ctx_emb[1] # 314.6144