使用Hugging Face Transformers和FiftyOne简化计算机视觉工作流程

2024年03月19日由 alex 发表 918 0

将Transformer模型直接应用于计算机视觉数据集

Transformer模型可能始于语言建模，但在过去几年中，视觉变换器（ViT）已成为计算机视觉工具箱中的重要工具。无论你是在处理图像分类或语义分割等传统视觉任务，还是在处理当下更多的零拍摄任务，Transformer模型要么具有竞争力，要么正在引领技术发展。Hugging Face 的变换器库使得加载、应用和操作这些模型变得异常简单。

现在，随着 Hugging Face 与用于数据整理和可视化的开源 FiftyOne 库的集成，将变换器模型直接集成到计算机视觉工作流程中比以往任何时候都更加容易。

在本文中，我们将向你展示如何将视觉数据和Transformer模型无缝连接起来。

设置

在本文中，你需要安装 Hugging Face 的变换器库、Voxel51 的 fiftyone 库以及 `torch` 和 `torchvision`：

pip install -U torch torchvision transformers fiftyone

什么是 FiftyOne？

FiftyOne 是用于计算机视觉数据整理和可视化的领先开源库。FiftyOne 的核心数据结构是 fiftyone.Dataset，它在逻辑上表示元数据、标签以及与图像、视频和点云等媒体文件相关的任何其他信息。

你可以直接从 FiftyOne Dataset Zoo 中加载数据集，也可以加载你自己的数据--内置支持从目录、glob 模式或 COCO 等常见格式中加载数据。

在本次演示中，我们将使用 Quickstart 数据集，它是 COCO 2017 验证拆分的一个子集：

import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
## just keep the ground truth labels
dataset.delete_sample_field("predictions")

一旦有了 FiftyOne.Dataset 数据集，就可以使用类似 pandas 的语法对其进行过滤。

你还可以在 FiftyOne App 中可视化和直观地检查数据：

session = fo.launch_app(dataset)

为什么使用 FiftyOne？FiftyOne 专为计算机视觉而设计。它将你的所有标签、特征和相关信息集中在一处，这样你就可以进行对比，保持条理清晰，并将你的数据视为一个有生命、有活力的对象！

Transformer集成概述

有了 fiftyone 和 Hugging Face 转换器之间的集成，你就可以将Transformer模型直接应用到你的数据上，无论是整个数据集，还是你选择的任何过滤子集，都无需编写任何自定义代码。

此外，该集成还支持直接计算嵌入和直接利用 Transformer 模型，用于任何利用嵌入的下游应用，如降低维度的可视化和语义/相似性搜索。

在嵌入计算/利用方面，支持所有暴露 last_hidden_state 属性的图像分类和对象检测模型，以及所有通过 get_image_features()暴露图像特征的零镜头图像分类/对象检测模型。

对于语义相似性搜索，只支持暴露文本和图像特征的零镜头分类/检测模型。

使用Transformer模型进行推理

在 FiftyOne 中，样本集合（fiftyone.Dataset 和 fiftyone.DatasetView 实例）有一个 apply_model() 方法，该方法将一个模型作为输入。这个模型可以是 FiftyOne Model Zoo 中的任何模型、任何 fiftyone.Model 实例，也可以是 Hugging Face 转换器模型！

传统图像推理任务

对于图像分类，你可以通过 Transformers 库，使用特定的架构构造函数加载 Transformers 模型，或者通过 AutoModelForImageClassification，使用 from_pretrained() 来指定检查点。以 BeiT 为例

## option 1
from transformers import BeitForImageClassification
model = BeitForImageClassification.from_pretrained(
    "microsoft/beit-base-patch16-224"
)
## option 2
from transformers import AutoModelForImageClassification
model = AutoModelForImageClassification.from_pretrained(
    "microsoft/beit-base-patch16-224"

加载模型后，可以直接将模型应用到数据集，通过 label_field 参数指定存储分类标签的字段名称：

dataset.apply_model(model, label_field="beit-base", batch_size=16)
session = fo.launch_app(dataset)

对象检测、语义分割和深度估计任务的工作方式类似；对于对象检测，使用 AutoModelForObjectDetection 或特定架构构造函数实例化一个模型，并使用相同的语法进行应用：

from transformers import DetrForObjectDetection
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
 
dataset.apply_model(model, label_field="detr")
session = fo.launch_app(dataset)

对于语义分割（Semantic Segmentation），只要模型的图像处理器有 post_process_semantic_segmentations()method 方法，就可以加载和应用构造函数中包含 ForInstanceSegmentation 或 ForUniversalSegmentation 的模型。

至于单目深度估计（Monocular Depth Estimation），你可以加载并应用在构造函数中含有 ForDepthEstimation 的模型。例如，要使用 DPT：

from transformers import DPTForDepthEstimation
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
 
dataset.apply_model(model, label_field="dpt_large")
session = fo.launch_app(dataset)

生成预测结果后，你可以在应用程序中根据标签类别和预测置信度进行筛选，也可以在 Python 中根据任意属性进行筛选。例如，过滤占图像不到 1/4 的边界框：

from fiftyone import ViewField as F
bbox_filter = F("bounding_box")[2] * F("bounding_box")[3] < 0.25
small_bbox_view = dataset.filter_labels("detr", bbox_filter, only_matches=True)
session = fo.launch_app(small_bbox_view)

你可以使用 FiftyOne 的评估应用程序接口（Evaluation API）对任何这些任务的预测结果进行数值评估。

零样本推理任务

对于零样本任务，建议从 FiftyOne Model Zoo 载入Hugging Face 模型。零样本图像分类的Transformer模型可以使用 load_zoo_model() 方法加载，指定模型类型（第一个参数）为 "zero-shot-classification-transformer-torch"，然后传入 name_or_path=<hf-name-or-path>。你可以在模型初始化时传递类列表，也可以稍后再设置模型的类。

import fiftyone.zoo as foz
model_type = "zero-shot-classification-transformer-torch"
name_or_path = "BAAI/AltCLIP" ## <- load AltCLIP
classes = ["cat", "dog", "bird", "fish", "turtle"] ## can override at any time
model = foz.load_zoo_model(
    model_type,
    name_or_path=name_or_path,
    classes=classes
)

然后，你就可以使用 apply_model()，像在传统图像分类设置中一样，将模型应用于图像分类。

零样本物体检测的工作方式与此相同，但模型类型为 "零镜头检测-变换器-炬"：

import fiftyone.zoo as foz

model_type = "zero-shot-classification-transformer-torch"
name_or_path = "BAAI/AltCLIP" ## <- load AltCLIP
classes = ["cat", "dog", "bird", "fish", "turtle"] ## can override at any time

model = foz.load_zoo_model(
    model_type,
    name_or_path=name_or_path,
    classes=classes
)

视频推理任务

这种集成最酷的地方之一就是保留了 FiftyOne 数据集和 Hugging Face 变换器模型固有的灵活性。不需要任何额外的工作，你就可以将上述图像任务中的任何模型应用到视频数据集（帧）中，而且就能正常工作！

这就是将 YOLOS 从Transformer库应用到视频数据集所需的全部代码：

import fiftyone.zoo as foz
## load video dataset
video_dataset = foz.load_zoo_dataset("quickstart-video")
## load YOLOS model
from transformers import YolosForObjectDetection
model = YolosForObjectDetection.from_pretrained("hustvl/yolos-tiny")
## apply model
video_dataset.apply_model(model, label_field="yolovs", batch_size=16)
## visualize the results
session = fo.launch_app(video_dataset)

使用Transformer嵌入

图像和补丁嵌入

就像我们可以将Hugging Face模型直接传递到 FiftyOne 样本集合的 apply_model() 方法中进行推理一样，我们也可以将Transformer模型直接传递到样本集合的 compute_embeddings() 方法中。例如，这将使用 Beit 模型计算所有图像的嵌入，并将其存储在样本的 "beit_embeddings "字段中：

from transformers import BeitForImageClassification
model = BeitForImageClassification.from_pretrained(
    "microsoft/beit-base-patch16-224"
)
dataset.compute_embeddings(model, embeddings_field="beit_embeddings", batch_size=16)

你还可以使用 compute_patch_embeddings()计算并存储数据集特定标签字段中每个对象补丁的嵌入。例如，使用 CLIP 计算地面实况对象补丁的嵌入：

from transformers import CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
dataset.compute_patch_embeddings(
    model,
    patches_field="ground_truth",
    embeddings_field="clip_embeddings"
)

利用降维技术实现嵌入可视化

Hugging Face Transformer 模型插入 FiftyOne 数据集进行嵌入计算的方式，也使其直接适用于利用嵌入进行的全数据集计算。其中一个应用就是降维。通过嵌入图像（或斑块），然后使用 t-SNE、UMAP 或 PCA 将嵌入降维到二维，我们可以直观地检查数据中的隐藏结构，并以新的方式与数据交互。

在 FiftyOne 中，降维是通过 FiftyOne Brain 的 compute_visualization() 方法进行的，该方法内置了对 t-SNE、UMAP 和 PCA 的支持。

只需通过 last_hidden_state 或 get_image_features()，将任何暴露图像嵌入的 Hugging Face 变换器模型传递到该方法中，同时还需要

指定结果保存位置的 brain_key 以及
使用的降维技术

import fiftyone.brain as fob
from transformers import AltCLIPModel
model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
fob.compute_visualization(
    dataset,
    model=model,
    method="umap",
    brain_key="altclip_umap_vis"
)
session = fo.launch_app(dataset)

然后，你就可以在应用程序中直观地看到降维嵌入和样本。

这是比较嵌入模型和降维技术的好方法！

通过相似性进行搜索

嵌入的另一个数据集级应用是为非结构化或半结构化数据编制索引。在 FiftyOne 中，这可以通过 FiftyOne Brain 的 compute_similarity() 方法来实现，而 Hugging Face 变换器模型也可以直接插入这些工作流中！

只需将Transformer模型直接传入 compute_similarity()调用，你就能查询数据集，找到相似的图像：

import fiftyone.brain as fob
## load model
from transformers import AutoModel
model = AutoModel.from_pretrained("google/siglip-base-patch16-224")
fob.compute_similarity(dataset, model=model, brain_key="siglip_sim")
session = fo.launch_app(dataset)

你还可以通过使用 patches_field 参数传递包含对象补丁的字段名称，在数据集中创建对象补丁的相似性索引。

如果想使用自然语言对图像进行语义搜索，可以利用多模态Transformer模型，该模型可同时显示图像和文本特征。要启用自然语言查询，请为模型参数传递模型类型，并通过 model_kwargs 传入模型的 name_or_path ：

import fiftyone.brain as fob
model_type = "zero-shot-classification-transformer-torch"
name_or_path = "openai/clip-vit-base-patch32" ## <- CLIP
model_kwargs = {"name_or_path": name_or_path}
fob.compute_similarity(
    dataset,
    model=model,
    model_kwargs=model_kwargs,
    brain_key="clip_sim"
)
session = fo.launch_app(dataset)
```
Then you can query with text in the app using the magnifying glass icon, or by passing a query text string into the dataset's sort_by_similarity() method in python:
```py
kites_view = dataset.sort_by_similarity(
    "kites flying in the sky",
    k=25,
    brain_key="clip_sim"
)

结论

Transformer模型已成为计算机视觉或多模态机器学习领域的中流砥柱，而且其影响似乎还在不断扩大。随着 Transformer 模型的多样性和通用性达到前所未有的高度，将这些模型与计算机视觉数据集无缝连接起来绝对是至关重要的。

文章来源：https://medium.com/voxel51/streamline-computer-vision-workflows-with-hugging-face-transformers-and-fiftyone-0b377d4ac745

标签：

计算机视觉人工智能

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇利用OpenAI API检测垃圾短信

下一篇 Elasticsearch：块大小如何影响语义检索结果

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

优化 LLM 提示的成本、延迟和性能的 4 种技术