模型:

microsoft/BiomedVLP-BioViL-T

任务:

特征提取

类库:

PyTorch Transformers

语言:

其他:

bert custom_code exbert

预印本库:

arxiv:2301.04558 arxiv:2204.09817

许可:

mit

模型介绍文件清单

英文

BioViL-T

BioViL-T 是一个领域特定的视觉语言模型，旨在分析胸部X射线图像（CXR）和放射学报告。它使用了一种时间多模态预训练的方法进行训练，使其与之前的模型（ BioViL ）有所区别。具体而言，BioViL-T利用数据点之间的时间结构，从而在多个基准测试中改善了下游性能，同时使用了与之前模型相同的训练数据集。特别是，所得模型在嵌入图像和文本模态中的时间信息（参见结果）以及联合空间中显示出显著的改进。规范模型可以用于单个和多个图像的下游应用，包括：自然语言推理、短语定位、图像/文本分类和语言解码。

相应的BERT语言模型由两个阶段训练：首先，我们使用掩蔽语言建模（MLM）从随机初始化的BERT模型预训练 CXR-BERT-general ，使用来自公开可用的 PubMed 和 MIMIC-III 的摘要和临床笔记。通用模型可以通过调整特定于目标域的参数来进行研究中的其他临床领域的微调。在第二阶段，BioViL-T通过使用放射学报告和一系列胸部X射线片，从CXR-BERT-general连续预训练BioViL-T，我们利用[CLS]标记的潜在表示来对齐文本和图像嵌入。

语言模型变体

Model	Model identifier on HuggingFace	Vocabulary	Note
CXR-BERT-general	1238321	PubMed & MIMIC	Pretrained for biomedical literature and clinical domains
CXR-BERT-specialized	1239321	PubMed & MIMIC	Static pretraining for the CXR domain
BioViL-T	12310321	PubMed & MIMIC	Static & temporal pretraining for the CXR domain

图像模型

图像模型与文本模型在多模态对比学习框架中联合训练。它是一个混合图像编码器，由Vision Transformer和ResNet-50组成，其中后者用作从每个时间点的图像中提取特征的骨干网络。设计中包括变压器用于聚合和比较沿时间维度提取的图像特征。可以通过我们的 HI-ML-Multimodal GitHub存储库访问相应的模型定义和加载函数。联合图像和文本模型，即 BioViL-T ，可以用于短语定位应用，如此Python笔记本 example 所示。此外，请查看 MS-CXR benchmark 以了解短语定位任务中联合图像和文本模型的更系统评估。

引用

相应的手稿已被接受在 Conference on Computer Vision and Pattern Recognition (CVPR) 2023 上发表。

@misc{https://doi.org/10.48550/arXiv.2301.04558,
  doi = {10.48550/ARXIV.2301.04558},
  url = {https://arxiv.org/abs/2301.04558},
  author = {Bannur, Shruthi and Hyland, Stephanie and Liu, Qianchu and Perez-Garcia, Fernando and Ilse, Maximilian and Castro, Daniel C and Boecking, Benedikt and Sharma, Harshita and Bouzid, Kenza and Thieme, Anja and Schwaighofer, Anton and Wetscherek, Maria and Lungren, Matthew P and Nori, Aditya and Alvarez-Valle, Javier and Oktay, Ozan}
  title = {Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing},
  publisher = {arXiv},
  year = {2023},
}

模型用途

预期用途

此模型仅用于（I）未来的视觉语言处理研究和（II）参考论文中报告的实验结果的可重现性。

主要预期用途

主要预期用途是支持基于此工作构建的AI研究人员。CXR-BERT及其相关模型应该对探索各种临床自然语言处理和视觉语言处理研究问题特别是在放射学领域非常有帮助。

超出范围的用途

目前，任何模型的部署用例，无论是商业还是其他，都超出了范围。虽然我们使用了广泛的公开可用的研究基准对模型进行了评估，但这些模型和评估并不适用于部署用例。在不确定的情况下，模型可能会产生不准确的预测并显示出局限性，这可能需要额外的缓解策略。因此，我们不建议将模型用于自动诊断或医疗设备上。请参阅 the associated paper 以获取更多详细信息。

如何使用

以下是如何使用此模型提取放射学句子嵌入并在联合空间（图像和文本）中获取它们的余弦相似度：

import torch
from transformers import AutoModel, AutoTokenizer

# Load the model and tokenizer
url = "microsoft/BiomedVLP-BioViL-T"
tokenizer = AutoTokenizer.from_pretrained(url, trust_remote_code=True)
model = AutoModel.from_pretrained(url, trust_remote_code=True)

# Input text prompts describing findings.
# The order of prompts is adjusted to capture the spectrum from absence of a finding to its temporal progression.
text_prompts = ["No pleural effusion or pneumothorax is seen.",
                "There is no pneumothorax or pleural effusion.",
                "The extent of the pleural effusion is reduced.",
                "The extent of the pleural effusion remains constant.",
                "Interval enlargement of pleural effusion."]

# Tokenize and compute the sentence embeddings
with torch.no_grad():
    tokenizer_output = tokenizer.batch_encode_plus(batch_text_or_text_pairs=text_prompts,
                                                   add_special_tokens=True,
                                                   padding='longest',
                                                   return_tensors='pt')
    embeddings = model.get_projected_text_embeddings(input_ids=tokenizer_output.input_ids,
                                                 attention_mask=tokenizer_output.attention_mask)

    # Compute the cosine similarity of sentence embeddings obtained from input text prompts.
    sim = torch.mm(embeddings, embeddings.t())

数据

此模型基于现有的公开可用数据集构建：

这些数据集涵盖了从生物医学摘要到重症监护病房笔记到胸部X射线放射学笔记的各种来源。

性能

通过更有效地利用训练期间的语义和话语特征，该模型在放射学自然语言推理方面取得了最先进的结果。实验是在RadNLI和MS-CXR-T基准测试中进行的，这些基准测试根据静态和时间语义来衡量文本嵌入的质量。BioViL-T与其他常用的SOTA领域特定BERT模型进行了对比（包括 PubMedBERT 和 CXR-BERT ）。下面的结果显示，BioViL-T在敏感性上提高了句子嵌入对时间内容（MS-CXR-T），同时更好地捕捉了静态内容（RadNLI）。

MS-CXR-T	MS-CXR-T	RadNLI (2 classes)	RadNLI (2 classes)
Accuracy	ROC-AUC	Accuracy	ROC-AUC
12322321	60.39	.542	81.38	.727
12323321	62.60	.601	87.59	.902
12324321	78.12	.837	89.66	.932
BioViL-T	87.77	.933	90.52	.947

该新的预训练框架还产生了更好的视觉语言表示。以下是在 MS-CXR 基准数据集上获得的零射短语定位性能，该数据集评估了图像-文本潜在表示的质量。

Vision–Language Pretraining Method	MS-CXR Phrase Grounding (Avg. CNR Score)	MS-CXR Phrase Grounding (mIoU)
BioViL	1.07 +- 0.04	0.229 +- 0.005
BioViL-L	1.21 +- 0.05	0.202 +- 0.010
BioViL-T	1.33 +- 0.04	0.240 +- 0.005

更多实验结果和讨论可以在相应的论文 "Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing", CVPR'23 中找到。

局限性

此模型是使用英文语料库开发的，因此只适用于英文。

训练数据集仅包含从重症监护病房（ICU）获取的医学图像和报告，其中经常在几小时或最多几天内收集连续图像。因此，在分析相隔较长时间（例如几年）采集的连续图像时，模型可能会显示出降低的性能，因为扫描之间观察到了显著的解剖变异。