模型:
NbAiLab/nb-wav2vec2-1b-bokmaal
此模型基于来自Facebook/Meta的功能提取器 XLS-R 进行微调。该微调模型在带有5克拉姆KenLM的测试集上取得以下结果。括号中的数字是不使用语言模型的结果:
这是我们团队在?主办的 Robust Speech Event 活动期间创建的多个Wav2Vec模型之一。以下是我们的所有模型及其最终得分的完整列表:
Model | Final WER |
---|---|
NbAiLab/nb-wav2vec2-1b-bokmaal (this model) | 6.33 |
1234321 | 7.03 |
1235321 | 12.22 |
与该活动同时,团队还将 Norwegian Parliamentary Speech Corpus (NPSC) 转换为?数据集格式,并将其用作训练的主要来源。
我们已经发布了在该活动期间开发的所有代码,以便挪威NLP社区在开发更好的挪威语自动语音识别模型时进行构建。这些模型的微调计算需求不高。按照这里的说明,您应该能够在平均GPU上用不到一天的时间训练自己的自动语音识别系统。
以下人员为构建此模型做出了贡献:Rolv-Arild Braaten,Per Egil Kummervold,Andre Kåsen,Javier de la Rosa,Per Erik Solberg和Freddy Wetjen。
要复现这些结果,我们强烈建议您遵循 instructions from ? 的步骤来训练一个简单的瑞典模型。
当您确认能够做到这一点后,创建一个全新的仓库。然后,您可以开始复制我们仓库中的run.sh和run_speech_recognition_ctc.py文件。运行这些文件将创建所有其他必需的文件,并让您能够复现我们的结果。通过对超参数进行一些调整,您甚至可能能够构建出更好的ASR。祝您好运!
如得分所示,添加一个简单的5克拉姆语言模型将改善结果。 ?还提供了另一个 very nice blog 的解释,说明如何添加一个5克拉姆语言模型来改善ASR模型。您可以从自己的语料库中构建这个语言模型,例如从 Norwegian Colossal Corpus 中提取一些合适的文本。您也可以跳过指南中的一些步骤,并复制 5-gram model from this repo 。
最终模型使用以下参数运行:
--dataset_name="NbAiLab/NPSC" --model_name_or_path="facebook/wav2vec2-xls-r-1b" --dataset_config_name="16K_mp3_bokmaal" --output_dir="./" --overwrite_output_dir --num_train_epochs="40" --per_device_train_batch_size="12" --per_device_eval_batch_size="12" --gradient_accumulation_steps="2" --learning_rate="2e-5" --warmup_steps="2000" --length_column_name="input_length" --evaluation_strategy="steps" --text_column_name="text" --save_steps="500" --eval_steps="500" --logging_steps="100" --layerdrop="0.041" --attention_dropout="0.094" --activation_dropout="0.055" --hidden_dropout="0.047" --save_total_limit="3" --freeze_feature_encoder --feat_proj_dropout="0.04" --mask_time_prob="0.082" --mask_time_length="10" --mask_feature_prob="0.25" --mask_feature_length="64" --gradient_checkpointing --min_duration_in_seconds="0.5" --max_duration_in_seconds="30.0" --ctc_zero_infinity=True --use_auth_token --seed="42" --fp16 --group_by_length --do_train --do_eval --push_to_hub --preprocessing_num_workers="16"
使用这些设置,训练可能需要平均GPU上的3-4天时间。但是,通过调整这些参数,您可以获得一个不错的模型和更快的结果。
Parameter | Comment |
---|---|
per_device_train_batch_size | Adjust this to the maximum of available memory. 16 or 24 might be good settings depending on your system |
gradient_accumulation_steps | Can be adjusted even further up to increase batch size and speed up training without running into memory issues |
learning_rate | Can be increased, maybe as high as 1e-4. Speeds up training but might add instability |
epochs | Can be decreased significantly. This is a huge dataset and you might get a decent result already after a couple of epochs |
@inproceedings{de-la-rosa-etal-2023-boosting, title = "Boosting {N}orwegian Automatic Speech Recognition", author = "De La Rosa, Javier and Braaten, Rolv-Arild and Kummervold, Per and Wetjen, Freddy", booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = may, year = "2023", address = "T{\'o}rshavn, Faroe Islands", publisher = "University of Tartu Library", url = "https://aclanthology.org/2023.nodalida-1.55", pages = "555--564", abstract = "In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against previous state-of-the-art ASR models, as well as on out-of-domain datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10{\%} to 7.60{\%}, with models achieving 5.81{\%} for Bokm{\aa}l and 11.54{\%} for Nynorsk. We also discuss the challenges and potential solutions for further improving ASR models for Norwegian.", }