模型:
NbAiLab/nb-wav2vec2-300m-bokmaal
该模型是在瑞典国家图书馆的特征提取器 VoxRex-model 的基础上进行微调的。经过微调的模型在使用5-gram KenLM测试集时达到以下结果。括号中的数字表示不使用语言模型的结果:
这是我们团队在?托管的 Robust Speech Event 活动期间创建的几个Wav2Vec模型之一。以下是我们模型及其最终评分的完整列表:
Model | Final WER |
---|---|
1234321 | 6.33 |
NbAiLab/nb-wav2vec2-300m-bokmaal (this model) | 7.03 |
1235321 | 12.22 |
与此活动同时进行的是,团队还将 Norwegian Parliamentary Speech Corpus (NPSC) 转换为?数据集格式,并将其用作主要的训练数据来源。
我们已经发布了活动期间开发的所有代码,以便挪威自然语言处理社区在开发更好的挪威语ASR模型时能够借鉴。这些模型的微调在计算上并不是非常复杂。按照这里的说明进行操作后,您应该能够在平均GPU上在不到一天的时间内训练出您自己的自动语音识别系统。
以下人员为构建此模型做出了贡献:Rolv-Arild Braaten,Per Egil Kummervold,Andre Kåsen,Javier de la Rosa,Per Erik Solberg和Freddy Wetjen。
为了复现这些结果,强烈建议您按照 instructions from ? 进行操作,以训练一个简单的瑞典语模型。
在验证您是否能够完成此操作后,创建一个全新的存储库。然后,您可以从我们的存储库中复制run.sh和run_speech_recognition_ctc.py这两个文件。运行这些文件将创建所有其他必要的文件,并应该能够复现我们的结果。通过对超参数进行一些调整,您甚至可能能够构建出更好的ASR模型。祝好运!
正如得分所示,即使是简单的5-gram语言模型也能改善结果。?提供了另一个 very nice blog 的说明,介绍了如何添加一个5-gram语言模型来改进ASR模型。您可以从自己的语料库中构建这个模型,例如从 Norwegian Colossal Corpus 中提取一些合适的文本。您还可以跳过指南中的一些步骤,并复制 5-gram model from this repo 。
最终模型是使用以下参数运行的:
--dataset_name="NbAiLab/NPSC" --model_name_or_path="KBLab/wav2vec2-large-voxrex" --dataset_config_name="16K_mp3_bokmaal" --output_dir="./" --overwrite_output_dir --num_train_epochs="15" --per_device_train_batch_size="16" --per_device_eval_batch_size="16" --gradient_accumulation_steps="2" --learning_rate="1e-4" --warmup_steps="2000" --length_column_name="input_length" --evaluation_strategy="steps" --text_column_name="text" --save_steps="500" --eval_steps="500" --logging_steps="100" --layerdrop="0.041" --attention_dropout="0.094" --activation_dropout="0.055" --hidden_dropout="0.047" --save_total_limit="3" --freeze_feature_encoder --feat_proj_dropout="0.04" --mask_time_prob="0.082" --mask_time_length="10" --mask_feature_prob="0.25" --mask_feature_length="64" --gradient_checkpointing --min_duration_in_seconds="0.5" --max_duration_in_seconds="30.0" --use_auth_token --seed="42" --fp16 --group_by_length --do_train --do_eval --push_to_hub --preprocessing_num_workers="32"
使用这些设置,训练可能需要3-4天才能在平均GPU上完成。但是,通过调整这些参数,您可以获得一个不错的模型和更快的结果。
Parameter | Comment |
---|---|
per_device_train_batch_size | Adjust this to the maximum of available memory. 16 or 24 might be good settings depending on your system |
gradient_accumulation_steps | Can be adjusted even further up to increase batch size and speed up training without running into memory issues |
learning_rate | Can be increased, maybe as high as 1e-4. Speeds up training but might add instability |
epochs | Can be decreased significantly. This is a huge dataset and you might get a decent result already after a couple of epochs |
@inproceedings{de-la-rosa-etal-2023-boosting, title = "Boosting {N}orwegian Automatic Speech Recognition", author = "De La Rosa, Javier and Braaten, Rolv-Arild and Kummervold, Per and Wetjen, Freddy", booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = may, year = "2023", address = "T{\'o}rshavn, Faroe Islands", publisher = "University of Tartu Library", url = "https://aclanthology.org/2023.nodalida-1.55", pages = "555--564", abstract = "In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against previous state-of-the-art ASR models, as well as on out-of-domain datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10{\%} to 7.60{\%}, with models achieving 5.81{\%} for Bokm{\aa}l and 11.54{\%} for Nynorsk. We also discuss the challenges and potential solutions for further improving ASR models for Norwegian.", }