英文

用于阿拉伯语语音识别的精调XLSR-53大型模型

精调的 facebook/wav2vec2-large-xlsr-53 在阿拉伯语上使用 Common Voice 6.1 Arabic Speech Corpus 的训练和验证样本。

感谢 OVHcloud 慷慨提供的GPU积分,这个模型已经进行了精调。

训练使用的脚本可以在此处找到: https://github.com/jonatasgrosman/wav2vec2-sprint

用法

该模型可以直接使用(无需语言模型),如下所示...

使用 HuggingSound 库:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-arabic")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

编写自己的推理脚本:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "ar"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)
Reference Prediction
ألديك قلم ؟ ألديك قلم
ليست هناك مسافة على هذه الأرض أبعد من يوم أمس. ليست نالك مسافة على هذه الأرض أبعد من يوم الأمس م
إنك تكبر المشكلة. إنك تكبر المشكلة
يرغب أن يلتقي بك. يرغب أن يلتقي بك
إنهم لا يعرفون لماذا حتى. إنهم لا يعرفون لماذا حتى
سيسعدني مساعدتك أي وقت تحب. سيسئدنيمساعدتك أي وقد تحب
أَحَبُّ نظريّة علمية إليّ هي أن حلقات زحل مكونة بالكامل من الأمتعة المفقودة. أحب نظرية علمية إلي هي أن حل قتزح المكوينا بالكامل من الأمت عن المفقودة
سأشتري له قلماً. سأشتري له قلما
أين المشكلة ؟ أين المشكل
وَلِلَّهِ يَسْجُدُ مَا فِي السَّمَاوَاتِ وَمَا فِي الْأَرْضِ مِنْ دَابَّةٍ وَالْمَلَائِكَةُ وَهُمْ لَا يَسْتَكْبِرُونَ ولله يسجد ما في السماوات وما في الأرض من دابة والملائكة وهم لا يستكبرون

评估

该模型可以按照以下方式在Common Voice的阿拉伯语测试数据上进行评估。

import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "ar"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                  "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                  "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                  "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
                  "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "'", "ʻ", "ˆ"]

test_dataset = load_dataset("common_voice", LANG_ID, split="test")

wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py

chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")

测试结果:

在下表中,我报告了模型的词错误率(WER)和字符错误率(CER)。我还对其他模型运行了上述评估脚本(于2021-05-14)。请注意,下表可能显示与已报告结果不同的结果,这可能是由于使用的其他评估脚本的某些特定性引起的。

Model WER CER
jonatasgrosman/wav2vec2-large-xlsr-53-arabic 39.59% 18.18%
bakrianoo/sinai-voice-ar-stt 45.30% 21.84%
othrif/wav2vec2-large-xlsr-arabic 45.93% 20.51%
kmfoda/wav2vec2-large-xlsr-arabic 54.14% 26.07%
mohammed/wav2vec2-large-xlsr-arabic 56.11% 26.79%
anas/wav2vec2-large-xlsr-arabic 62.02% 27.09%
elgeish/wav2vec2-large-xlsr-53-arabic 100.00% 100.56%

引用

如果您想引用这个模型,可以使用以下内容:

@misc{grosman2021xlsr53-large-arabic,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {A}rabic},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-arabic}},
  year={2021}
}