模型:
asapp/sew-d-tiny-100k-ft-ls100h
基于16kHz采样的语音音频预训练的基础模型。在使用该模型时,请确保您的语音输入也是以16kHz采样。请注意,该模型应在下游任务(如自动语音识别、说话人识别、意图分类、情感识别等)上进行微调。
论文: Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition
作者: Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi
摘要:本文研究了自动语音识别(ASR)的预训练模型在性能和效率之间的权衡。我们关注wav2vec 2.0,并形式化了几种影响模型性能和效率的架构设计。总结我们的所有观察结果,我们引入了SEW(Squeezed and Efficient Wav2vec)这一预训练模型架构,在各种训练设置中在性能和效率两个维度上有了显著的改进。例如,在LibriSpeech的100h-960h半监督设置下,SEW相比wav2vec 2.0获得了1.9倍的推理加速,并且相对于错误率有13.5%的相对降低。在相似的推理时间内,SEW在不同的模型大小下将词错误率降低了25-50%。
原始模型可以在 https://github.com/asappresearch/sew#model-checkpoints 找到。
要转录音频文件,可以将该模型用作独立的声学模型,如下所示:
from transformers import Wav2Vec2Processor, SEWDForCTC
from datasets import load_dataset
import soundfile as sf
import torch
# load the model and preprocessor
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-tiny-100k-ft-ls100h")
model = SEWDForCTC.from_pretrained("asapp/sew-d-tiny-100k-ft-ls100h")
# load the dummy dataset with speech samples
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
# preprocess
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values # Batch size 1
# retrieve logits
logits = model(input_values).logits
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
此代码段显示如何对LibriSpeech的“clean”和“other”测试数据进行评估asapp/sew-d-tiny-100k-ft-ls100h。
from datasets import load_dataset
from transformers import SEWDForCTC, Wav2Vec2Processor
import torch
from jiwer import wer
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
model = SEWDForCTC.from_pretrained("asapp/sew-d-tiny-100k-ft-ls100h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-tiny-100k-ft-ls100h")
def map_to_pred(batch):
input_values = processor(batch["audio"][0]["array"], sampling_rate=16000,
return_tensors="pt", padding="longest").input_values
with torch.no_grad():
logits = model(input_values.to("cuda")).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
batch["transcription"] = transcription
return batch
result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])
print("WER:", wer(result["text"], result["transcription"]))
结果(WER):
| "clean" | "other" |
|---|---|
| 10.47 | 22.73 |