英文

挪威 Wav2Vec2 模型 - 1B Bokmål

此模型基于来自Facebook/Meta的功能提取器 XLS-R 进行微调。该微调模型在带有5克拉姆KenLM的测试集上取得以下结果。括号中的数字是不使用语言模型的结果:

  • WER: 0.0633 (0.0738)
  • CER: 0.0248 (0.0263)

模型描述

这是我们团队在?主办的 Robust Speech Event 活动期间创建的多个Wav2Vec模型之一。以下是我们的所有模型及其最终得分的完整列表:

Model Final WER
NbAiLab/nb-wav2vec2-1b-bokmaal (this model) 6.33
1234321 7.03
1235321 12.22

数据集

与该活动同时,团队还将 Norwegian Parliamentary Speech Corpus (NPSC) 转换为?数据集格式,并将其用作训练的主要来源。

代码

我们已经发布了在该活动期间开发的所有代码,以便挪威NLP社区在开发更好的挪威语自动语音识别模型时进行构建。这些模型的微调计算需求不高。按照这里的说明,您应该能够在平均GPU上用不到一天的时间训练自己的自动语音识别系统。

团队

以下人员为构建此模型做出了贡献:Rolv-Arild Braaten,Per Egil Kummervold,Andre Kåsen,Javier de la Rosa,Per Erik Solberg和Freddy Wetjen。

训练过程

要复现这些结果,我们强烈建议您遵循 instructions from ? 的步骤来训练一个简单的瑞典模型。

当您确认能够做到这一点后,创建一个全新的仓库。然后,您可以开始复制我们仓库中的run.sh和run_speech_recognition_ctc.py文件。运行这些文件将创建所有其他必需的文件,并让您能够复现我们的结果。通过对超参数进行一些调整,您甚至可能能够构建出更好的ASR。祝您好运!

语言模型

如得分所示,添加一个简单的5克拉姆语言模型将改善结果。 ?还提供了另一个 very nice blog 的解释,说明如何添加一个5克拉姆语言模型来改善ASR模型。您可以从自己的语料库中构建这个语言模型,例如从 Norwegian Colossal Corpus 中提取一些合适的文本。您也可以跳过指南中的一些步骤,并复制 5-gram model from this repo

参数

最终模型使用以下参数运行:

--dataset_name="NbAiLab/NPSC"
--model_name_or_path="facebook/wav2vec2-xls-r-1b"
--dataset_config_name="16K_mp3_bokmaal"
--output_dir="./"
--overwrite_output_dir
--num_train_epochs="40"
--per_device_train_batch_size="12"
--per_device_eval_batch_size="12" 
--gradient_accumulation_steps="2" 
--learning_rate="2e-5" 
--warmup_steps="2000" 
--length_column_name="input_length" 
--evaluation_strategy="steps" 
--text_column_name="text" 
--save_steps="500" 
--eval_steps="500" 
--logging_steps="100" 
--layerdrop="0.041" 
--attention_dropout="0.094" 
--activation_dropout="0.055" 
--hidden_dropout="0.047" 
--save_total_limit="3"
--freeze_feature_encoder 
--feat_proj_dropout="0.04" 
--mask_time_prob="0.082" 
--mask_time_length="10" 
--mask_feature_prob="0.25" 
--mask_feature_length="64" 
--gradient_checkpointing
--min_duration_in_seconds="0.5" 
--max_duration_in_seconds="30.0" 
    --ctc_zero_infinity=True 
--use_auth_token 
--seed="42" 
--fp16 
--group_by_length 
--do_train --do_eval 
--push_to_hub 
--preprocessing_num_workers="16"

使用这些设置,训练可能需要平均GPU上的3-4天时间。但是,通过调整这些参数,您可以获得一个不错的模型和更快的结果。

Parameter Comment
per_device_train_batch_size Adjust this to the maximum of available memory. 16 or 24 might be good settings depending on your system
gradient_accumulation_steps Can be adjusted even further up to increase batch size and speed up training without running into memory issues
learning_rate Can be increased, maybe as high as 1e-4. Speeds up training but might add instability
epochs Can be decreased significantly. This is a huge dataset and you might get a decent result already after a couple of epochs

引用

@inproceedings{de-la-rosa-etal-2023-boosting,
    title = "Boosting {N}orwegian Automatic Speech Recognition",
    author = "De La Rosa, Javier  and
      Braaten, Rolv-Arild  and
      Kummervold, Per  and
      Wetjen, Freddy",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.55",
    pages = "555--564",
    abstract = "In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against previous state-of-the-art ASR models, as well as on out-of-domain datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10{\%} to 7.60{\%}, with models achieving 5.81{\%} for Bokm{\aa}l and 11.54{\%} for Nynorsk. We also discuss the challenges and potential solutions for further improving ASR models for Norwegian.",
}

查看 https://arxiv.org/abs/2307.01672