挪威语Wav2Vec2模型-300M-VoxRex-标准书面挪威语

该模型是在瑞典国家图书馆的特征提取器 VoxRex-model 的基础上进行微调的。经过微调的模型在使用5-gram KenLM测试集时达到以下结果。括号中的数字表示不使用语言模型的结果：

词错误率（WER）：0.0703（0.0979）
字符错误率（CER）：0.0269（0.0311）

模型描述

这是我们团队在🤗托管的 Robust Speech Event 活动期间创建的几个Wav2Vec模型之一。以下是我们模型及其最终评分的完整列表：

Model	Final WER
1234321	6.33
NbAiLab/nb-wav2vec2-300m-bokmaal (this model)	7.03
1235321	12.22

数据集

与此活动同时进行的是，团队还将 Norwegian Parliamentary Speech Corpus (NPSC) 转换为🤗数据集格式，并将其用作主要的训练数据来源。

代码

我们已经发布了活动期间开发的所有代码，以便挪威自然语言处理社区在开发更好的挪威语ASR模型时能够借鉴。这些模型的微调在计算上并不是非常复杂。按照这里的说明进行操作后，您应该能够在平均GPU上在不到一天的时间内训练出您自己的自动语音识别系统。

团队

以下人员为构建此模型做出了贡献：Rolv-Arild Braaten，Per Egil Kummervold，Andre Kåsen，Javier de la Rosa，Per Erik Solberg和Freddy Wetjen。

训练过程

为了复现这些结果，强烈建议您按照 instructions from 🤗 进行操作，以训练一个简单的瑞典语模型。

在验证您是否能够完成此操作后，创建一个全新的存储库。然后，您可以从我们的存储库中复制run.sh和run_speech_recognition_ctc.py这两个文件。运行这些文件将创建所有其他必要的文件，并应该能够复现我们的结果。通过对超参数进行一些调整，您甚至可能能够构建出更好的ASR模型。祝好运！

语言模型

正如得分所示，即使是简单的5-gram语言模型也能改善结果。🤗提供了另一个 very nice blog 的说明，介绍了如何添加一个5-gram语言模型来改进ASR模型。您可以从自己的语料库中构建这个模型，例如从 Norwegian Colossal Corpus 中提取一些合适的文本。您还可以跳过指南中的一些步骤，并复制 5-gram model from this repo 。

参数

最终模型是使用以下参数运行的：

--dataset_name="NbAiLab/NPSC" 
--model_name_or_path="KBLab/wav2vec2-large-voxrex" 
--dataset_config_name="16K_mp3_bokmaal" 
--output_dir="./" 
--overwrite_output_dir 
--num_train_epochs="15" 
--per_device_train_batch_size="16" 
--per_device_eval_batch_size="16" 
--gradient_accumulation_steps="2" 
--learning_rate="1e-4" 
--warmup_steps="2000" 
--length_column_name="input_length" 
--evaluation_strategy="steps" 
--text_column_name="text" 
--save_steps="500" 
--eval_steps="500" 
--logging_steps="100" 
--layerdrop="0.041" 
--attention_dropout="0.094" 
--activation_dropout="0.055" 
--hidden_dropout="0.047" 
--save_total_limit="3" 
--freeze_feature_encoder 
--feat_proj_dropout="0.04" 
--mask_time_prob="0.082" 
--mask_time_length="10" 
--mask_feature_prob="0.25" 
--mask_feature_length="64" 
--gradient_checkpointing 
--min_duration_in_seconds="0.5" 
--max_duration_in_seconds="30.0" 
--use_auth_token 
--seed="42" 
--fp16 
--group_by_length 
--do_train --do_eval 
--push_to_hub 
--preprocessing_num_workers="32"

使用这些设置，训练可能需要3-4天才能在平均GPU上完成。但是，通过调整这些参数，您可以获得一个不错的模型和更快的结果。

Parameter	Comment
per_device_train_batch_size	Adjust this to the maximum of available memory. 16 or 24 might be good settings depending on your system
gradient_accumulation_steps	Can be adjusted even further up to increase batch size and speed up training without running into memory issues
learning_rate	Can be increased, maybe as high as 1e-4. Speeds up training but might add instability
epochs	Can be decreased significantly. This is a huge dataset and you might get a decent result already after a couple of epochs

引用

@inproceedings{de-la-rosa-etal-2023-boosting,
    title = "Boosting {N}orwegian Automatic Speech Recognition",
    author = "De La Rosa, Javier  and
      Braaten, Rolv-Arild  and
      Kummervold, Per  and
      Wetjen, Freddy",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.55",
    pages = "555--564",
    abstract = "In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets. Additionally, we measure the performance of these models against previous state-of-the-art ASR models, as well as on out-of-domain datasets. We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10{\%} to 7.60{\%}, with models achieving 5.81{\%} for Bokm{\aa}l and 11.54{\%} for Nynorsk. We also discuss the challenges and potential solutions for further improving ASR models for Norwegian.",
}

请参阅 https://arxiv.org/abs/2307.01672 。

作者:

Nasjonalbiblioteket AI Lab

数据集大小:

5.14 GB