英文

挪威语Wav2Vec2模型-300M-VoxRex-标准书面挪威语

该模型是在瑞典国家图书馆的特征提取器 VoxRex-model 的基础上进行微调的。经过微调的模型在使用5-gram KenLM测试集时达到以下结果。括号中的数字表示不使用语言模型的结果:

  • 词错误率(WER):0.0703(0.0979)
  • 字符错误率(CER):0.0269(0.0311)

模型描述

这是我们团队在?托管的 Robust Speech Event 活动期间创建的几个Wav2Vec模型之一。以下是我们模型及其最终评分的完整列表:

Model Final WER
1234321 6.33
NbAiLab/nb-wav2vec2-300m-bokmaal (this model) 7.03
1235321 12.22

数据集

与此活动同时进行的是,团队还将 Norwegian Parliamentary Speech Corpus (NPSC) 转换为?数据集格式,并将其用作主要的训练数据来源。

代码

我们已经发布了活动期间开发的所有代码,以便挪威自然语言处理社区在开发更好的挪威语ASR模型时能够借鉴。这些模型的微调在计算上并不是非常复杂。按照这里的说明进行操作后,您应该能够在平均GPU上在不到一天的时间内训练出您自己的自动语音识别系统。

团队

以下人员为构建此模型做出了贡献:Rolv-Arild Braaten,Per Egil Kummervold,Andre Kåsen,Javier de la Rosa,Per Erik Solberg和Freddy Wetjen。

训练过程

为了复现这些结果,强烈建议您按照 instructions from ? 进行操作,以训练一个简单的瑞典语模型。

在验证您是否能够完成此操作后,创建一个全新的存储库。然后,您可以从我们的存储库中复制run.sh和run_speech_recognition_ctc.py这两个文件。运行这些文件将创建所有其他必要的文件,并应该能够复现我们的结果。通过对超参数进行一些调整,您甚至可能能够构建出更好的ASR模型。祝好运!

语言模型

正如得分所示,即使是简单的5-gram语言模型也能改善结果。?提供了另一个 very nice blog 的说明,介绍了如何添加一个5-gram语言模型来改进ASR模型。您可以从自己的语料库中构建这个模型,例如从 Norwegian Colossal Corpus 中提取一些合适的文本。您还可以跳过指南中的一些步骤,并复制 5-gram model from this repo

参数

最终模型是使用以下参数运行的:

--dataset_name="NbAiLab/NPSC" 
--model_name_or_path="KBLab/wav2vec2-large-voxrex" 
--dataset_config_name="16K_mp3_bokmaal" 
--output_dir="./" 
--overwrite_output_dir 
--num_train_epochs="15" 
--per_device_train_batch_size="16" 
--per_device_eval_batch_size="16" 
--gradient_accumulation_steps="2" 
--learning_rate="1e-4" 
--warmup_steps="2000" 
--length_column_name="input_length" 
--evaluation_strategy="steps" 
--text_column_name="text" 
--save_steps="500" 
--eval_steps="500" 
--logging_steps="100" 
--layerdrop="0.041" 
--attention_dropout="0.094" 
--activation_dropout="0.055" 
--hidden_dropout="0.047" 
--save_total_limit="3" 
--freeze_feature_encoder 
--feat_proj_dropout="0.04" 
--mask_time_prob="0.082" 
--mask_time_length="10" 
--mask_feature_prob="0.25" 
--mask_feature_length="64" 
--gradient_checkpointing 
--min_duration_in_seconds="0.5" 
--max_duration_in_seconds="30.0" 
--use_auth_token 
--seed="42" 
--fp16 
--group_by_length 
--do_train --do_eval 
--push_to_hub 
--preprocessing_num_workers="32"

使用这些设置,训练可能需要3-4天才能在平均GPU上完成。但是,通过调整这些参数,您可以获得一个不错的模型和更快的结果。

Parameter Comment
per_device_train_batch_size Adjust this to the maximum of available memory. 16 or 24 might be good settings depending on your system
gradient_accumulation_steps Can be adjusted even further up to increase batch size and speed up training without running into memory issues
learning_rate Can be increased, maybe as high as 1e-4. Speeds up training but might add instability
epochs Can be decreased significantly. This is a huge dataset and you might get a decent result already after a couple of epochs

引用

@inproceedings{de-la-rosa-etal-2023-boosting,
    title = "Boosting {N}orwegian Automatic Speech Recognition",
    author = "De La Rosa, Javier  and
      Braaten, Rolv-Arild  and
      Kummervold, Per  and
      Wetjen, Freddy",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.55",
    pages = "555--564",
    abstract = "In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against previous state-of-the-art ASR models, as well as on out-of-domain datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10{\%} to 7.60{\%}, with models achieving 5.81{\%} for Bokm{\aa}l and 11.54{\%} for Nynorsk. We also discuss the challenges and potential solutions for further improving ASR models for Norwegian.",
}

请参阅 https://arxiv.org/abs/2307.01672