英文

用于英语语音识别的Fine-tuned wav2vec2大模型

在英语上使用训练集和验证集Fine-tuned facebook/wav2vec2-large 模型。使用此模型时,请确保语音输入采样率为16kHz。

这个模型是通过 OVHcloud 慷慨提供的GPU积分进行Fine-tuned的

用于训练的脚本可以在这里找到: https://github.com/jonatasgrosman/wav2vec2-sprint

使用方法

该模型可以直接使用(不带语言模型),如下所示...

使用 HuggingSound 库:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-english")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

编写自己的推理脚本:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-large-english"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)
Reference Prediction
"SHE'LL BE ALL RIGHT." SHELL BE ALL RIGHT
SIX SIX
"ALL'S WELL THAT ENDS WELL." ALLAS WELL THAT ENDS WELL
DO YOU MEAN IT? W MEAN IT
THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS. THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESTION
HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE? HOW IS MOSILLA GOING TO BANDL AND BE WHIT IS LIKE QU AND QU
"I GUESS YOU MUST THINK I'M KINDA BATTY." RUSTION AS HAME AK AN THE POT
NO ONE NEAR THE REMOTE MACHINE YOU COULD RING? NO ONE NEAR THE REMOTE MACHINE YOU COULD RING
SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER. SAUCE FOR THE GUCE IS SAUCE FOR THE GONDER
GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD. GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD

评估

可以按照以下方法在Common Voice的英语(en)测试数据上评估模型。

import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-large-english"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

test_dataset = load_dataset("common_voice", LANG_ID, split="test")

wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py

chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")

测试结果:

在下表中,我报告了模型的词错误率(WER)和字符错误率(CER)。我还对其他模型运行了上述评估脚本(于2021年06月17日)。请注意,下表中显示的结果可能与已报告的结果不同,这可能是由于使用的其他评估脚本的某些特定性导致的。

Model WER CER
jonatasgrosman/wav2vec2-large-xlsr-53-english 18.98% 8.29%
jonatasgrosman/wav2vec2-large-english 21.53% 9.66%
facebook/wav2vec2-large-960h-lv60-self 22.03% 10.39%
facebook/wav2vec2-large-960h-lv60 23.97% 11.14%
boris/xlsr-en-punctuation 29.10% 10.75%
facebook/wav2vec2-large-960h 32.79% 16.03%
facebook/wav2vec2-base-960h 39.86% 19.89%
facebook/wav2vec2-base-100h 51.06% 25.06%
elgeish/wav2vec2-large-lv60-timit-asr 59.96% 34.28%
facebook/wav2vec2-base-10k-voxpopuli-ft-en 66.41% 36.76%
elgeish/wav2vec2-base-timit-asr 68.78% 36.81%

引用

如果您要引用此模型,可以使用以下引用:

@misc{grosman2021wav2vec2-large-english,
  title={Fine-tuned wav2vec2 large model for speech recognition in {E}nglish},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-english}},
  year={2021}
}