英文

wav2vec2-common_voice_13_0-eo-3,一个Esperanto语音识别器

这个模型是在 mozilla-foundation/common_voice_13_0 Esperanto数据集上, facebook/wav2vec2-large-xlsr-53 的fine-tuned版本。它在评估集上取得了以下结果:

  • 损失:0.2191
  • Cer:0.0208
  • Wer:0.0687

测试集中的前10个样本:

Actual Predicted CER
la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo 0.0
en la sekva jaro li ricevis premion en la sekva jaro li ricevis prenion 0.02857142857142857
ŝi studis historion ĉe la universitato de brita kolumbio ŝi studis historion ĉe la universitato de brita kolumbio 0.0
larĝaj ŝtupoj kuras al la fasado larĝaj ŝtupoj kuras al la fasado 0.0
la municipo ĝuas duan epokon de etendo kaj disvolviĝo la municipo ĝuas duonepokon de tendo kaj disvolviĝo 0.05660377358490566
li estis ankaŭ katedrestro kaj dekano li estis ankaŭ katedresto kaj dekano 0.02702702702702703
librovendejo apartenas al la muzeo librovendejo apartenas al la muzeo 0.0
ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵaro de arbaroj ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵo de arbaroj 0.02702702702702703
unue ili estas ruĝaj poste brunaj unue ili estas ruĝaj poste brunaj 0.0
la loĝantaro laboras en la proksima ĉefurbo la loĝantaro laboras en la proksima ĉefurbo 0.0

模型描述

参见 facebook/wav2vec2-large-xlsr-53

预期使用与限制

用于Esperanto的语音识别。基础模型是在16kHz采样的语音音频上进行预训练和微调的。使用该模型时,请确保语音输入也是以16kHz采样。

训练和评估数据

训练集分割设置为train[:15000],评估集分割设置为validation[:1500]。

训练过程

我使用 run_speech_recognition_ctc.py ,将以下train.json文件传递给它:

{
  "dataset_name": "mozilla-foundation/common_voice_13_0",
  "model_name_or_path": "facebook/wav2vec2-large-xlsr-53",
  "dataset_config_name": "eo",
  "output_dir": "./wav2vec2-common_voice_13_0-eo-3",
  "train_split_name": "train[:15000]",
  "eval_split_name": "validation[:1500]",
  "eval_metrics": ["cer", "wer"],
  "overwrite_output_dir": true,
  "preprocessing_num_workers": 8,
  "num_train_epochs": 100,
  "per_device_train_batch_size": 8,
  "gradient_accumulation_steps": 4,
  "gradient_checkpointing": true,
  "learning_rate": 3e-5,
  "warmup_steps": 500,
  "evaluation_strategy": "steps",
  "text_column_name": "sentence",
  "length_column_name": "input_length",
  "save_steps": 1000,
  "eval_steps": 1000,
  "layerdrop": 0.1,
  "save_total_limit": 3,
  "freeze_feature_encoder": true,
  "chars_to_ignore": "-!\"'(),.:;=?_`¨«¸»ʼ‑–—‘’“”„…‹›♫?",
  "chars_to_substitute": {
    "przy": "pŝe",
    "byn": "bin",
    "cx": "ĉ",
    "sx": "ŝ",
    "fi": "fi",
    "fl": "fl",
    "ǔ": "ŭ",
    "ñ": "nj",
    "á": "a",
    "é": "e",
    "ü": "ŭ",
    "y": "j",
    "qu": "ku"
  },
  "fp16": true,
  "group_by_length": true,
  "push_to_hub": true,
  "do_train": true,
  "do_eval": true
}

我检查了数据集,找到了非语音字符,并将它们放置在chars_to_ignore中。此外,还有一些字符序列可以转录为Esperanto音素,我将它们作为字典放置在chars_to_substitute中。这需要在程序中添加此参数:

def dict_field(default=None, metadata=None):
    return field(default_factory=lambda: default, metadata=metadata)

@dataclass
class DataTrainingArguments:
  ...
    chars_to_substitute: Optional[Dict[str, str]] = dict_field(
        default=None,
        metadata={"help": "A dict of characters to replace."},
    )

然后我复制了remove_special_characters来执行实际的替换:

    def remove_special_characters(batch):
        text = batch[text_column_name]
        if chars_to_ignore_regex is not None:
            text = re.sub(chars_to_ignore_regex, "", batch[text_column_name])
        batch["target_text"] = text.lower() + " "
        return batch

    def substitute_characters(batch):
        text: str = batch["target_text"]
        if data_args.chars_to_substitute is not None:
            for k, v in data_args.chars_to_substitute.items():
                text.replace(k, v)
        batch["target_text"] = text.lower()
        return batch

    with training_args.main_process_first(desc="dataset map special characters removal"):
        raw_datasets = raw_datasets.map(
            remove_special_characters,
            remove_columns=[text_column_name],
            desc="remove special characters from datasets",
        )

    with training_args.main_process_first(desc="dataset map special characters substitute"):
        raw_datasets = raw_datasets.map(
            substitute_characters,
            desc="substitute special characters in datasets",
        )

训练超参数

训练时使用了以下超参数:

  • learning_rate:3e-05
  • train_batch_size:8
  • eval_batch_size:8
  • seed:42
  • gradient_accumulation_steps:4
  • total_train_batch_size:32
  • optimizer:Adam,betas=(0.9,0.999),epsilon=1e-08
  • layerdrop:0.1
  • lr_scheduler_type:linear
  • lr_scheduler_warmup_steps:500
  • num_epochs:100
  • mixed_precision_training:Native AMP

训练结果

Training Loss Epoch Step Cer Validation Loss Wer
2.6416 2.13 1000 0.1541 0.8599 0.6449
0.2633 4.27 2000 0.0335 0.1897 0.1431
0.1739 6.4 3000 0.0289 0.1732 0.1145
0.1378 8.53 4000 0.0276 0.1729 0.1066
0.1172 10.67 5000 0.0268 0.1773 0.1019
0.1049 12.8 6000 0.0255 0.1701 0.0937
0.0951 14.93 7000 0.0253 0.1718 0.0933
0.0851 17.07 8000 0.0239 0.1787 0.0834
0.0809 19.2 9000 0.0235 0.1802 0.0835
0.0756 21.33 10000 0.0239 0.1784 0.0855
0.0708 23.47 11000 0.0235 0.1748 0.0824
0.0657 25.6 12000 0.0228 0.1830 0.0796
0.0605 27.73 13000 0.0230 0.1896 0.0798
0.0583 29.87 14000 0.0224 0.1889 0.0778
0.0608 32.0 15000 0.0223 0.1849 0.0757
0.0556 34.13 16000 0.0223 0.1872 0.0767
0.0534 36.27 17000 0.0221 0.1893 0.0751
0.0523 38.4 18000 0.0218 0.1925 0.0729
0.0494 40.53 19000 0.0221 0.1957 0.0745
0.0475 42.67 20000 0.0217 0.1961 0.0740
0.048 44.8 21000 0.0214 0.1957 0.0714
0.0459 46.93 22000 0.0215 0.1968 0.0717
0.0435 49.07 23000 0.0217 0.2008 0.0717
0.0428 51.2 24000 0.0212 0.1991 0.0696
0.0418 53.33 25000 0.0215 0.2034 0.0714
0.0404 55.47 26000 0.0210 0.2014 0.0684
0.0394 57.6 27000 0.0210 0.2050 0.0681
0.0399 59.73 28000 0.0211 0.2039 0.0700
0.0389 61.87 29000 0.0214 0.2091 0.0694
0.038 64.0 30000 0.0210 0.2100 0.0702
0.0361 66.13 31000 0.0215 0.2119 0.0703
0.0359 68.27 32000 0.0213 0.2108 0.0714
0.0354 70.4 33000 0.0211 0.2120 0.0699
0.0364 72.53 34000 0.0211 0.2128 0.0688
0.0361 74.67 35000 0.0212 0.2134 0.0694
0.0332 76.8 36000 0.0210 0.2176 0.0698
0.0341 78.93 37000 0.0208 0.2170 0.0688
0.032 81.07 38000 0.0209 0.2157 0.0686
0.0318 83.33 39000 0.0209 0.2166 0.0685
0.0325 85.47 40000 0.0209 0.2172 0.0687
0.0316 87.6 41000 0.0208 0.2181 0.0678
0.0302 89.73 42000 0.0208 0.2171 0.0679
0.0318 91.87 43000 0.0211 0.2179 0.0702
0.0314 94.0 44000 0.0208 0.2186 0.0690
0.0309 96.13 45000 0.0210 0.2193 0.0696
0.031 98.27 46000 0.0208 0.2191 0.0686

框架版本

  • Transformers 4.29.1
  • Pytorch 2.0.1+cu118
  • Datasets 2.12.0
  • Tokenizers 0.13.3

讨论

Nans和Infs

在调试其他使用Esperanto Common Voice数据集的训练会话时,一些损失计算返回inf或nan,我发现一些使用该模型训练的训练集具有非常高的CER。一些例子:

file Actual --- Predicted CER Comment
common_voice_eo_25365027.mp3 en la hansaj agentejoj komercistoj el la regiono renkontis kolegojn el aliaj regionoj --- a taaj keo eoj eejn kigos eegoj eioeegiooj 0.61 No audio
common_voice_eo_25365472.mp3 ili vendas armilojn kaj teknologiojn al la fanatikuloj por gajni monon monon monon --- ila mamato aiil ajn kno ion a a aotigojn pu aiooo aj knon 0.55 Barely any audio, distorted
common_voice_eo_25365836.mp3 industria apliko estas la kreado de modifitaj bakterioj kiuj produktas deziratan kemian substancon --- iiti sieetas la eeadooddddooiooaotooeioj aiicenon 0.67 Barely any audio, distorted
2600 ili akiras plenkreskan plumaron nur en la kvina jaro --- ili aaros peetaj patato a a sia ro 0.52 It's literally someone saying 'injabum'. Thanks, troll.
7333 poste sekvas difinoj de la termino --- po 0.94 No audio
7334 li gvidis multajn kursojn laŭ la csehmetodo --- po 0.98 No audio
7429 tamen pro la rekonstruo de kluzoj ne eblas trapasi komplete --- po 0.97 No audio
11662 lingvotesto estas postulata ekzemple por akceptiĝo en anglalingvaj altlernejoj --- linkonteto estastitot etateerteito en pootaeaje lgijoj 0.58 No audio

一些例子没有音频。这些数据集中的所有文件都是完全无用的,应该从训练集中删除。

可以看到模型在几乎没有音频的情况下试图想象目标内容。这对于真实地报告说了什么是非常糟糕的。我也希望有一些确定性的度量,并且可能只采用相对确定性较高的转录结果。然而,我找不到如何获取确切值的方法。

Common Voice数据集还包含票数和反对票数。在上述高CER的句子中,所有句子都有2个赞成票,其中有些句子没有反对票,有些句子有1个反对票。因此,我们不能依靠赞成或反对票来检测质量。

那该怎么办呢?

方案1

尽管存在这些零和低质量的文件,训练似乎还是可以正常进行的。但是,我们仍然需要解决损失变为inf或nan的问题,因为这会破坏计算。

通过运行run_speech_recognition_ctc,do_train=false,model_name_or_path="xekri/wav2vec2-common_voice_13_0-eo-3",将eval_split_name设置为test、validation或train,并按照以下方式修改trainer.py,我可以检查是否有任何损失为nan或inf:

        # To be JSON-serializable, we need to remove numpy types or zero-d tensors
        metrics = denumpify_detensorize(metrics)

        if all_losses is not None:
            loss_nan = np.where(np.isnan(all_losses))
            if len(loss_nan) != 0:
                print(f'LOSSES ARE NAN: {loss_nan}')
            loss_inf = np.where(np.isinf(all_losses))
            if len(loss_inf) != 0:
                print(f'LOSSES ARE INF: {loss_inf}')
            metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()

这样做可以发现测试集中的14913个样本中,有一个样本的损失为inf:

common_voice_eo_25167318.mp3

这个音频严重失真。这个样本应该从测试集中过滤掉。

验证集中没有样本的损失为inf或nan。

训练集中的143984个样本中,有18个样本的损失为inf:

common_voice_eo_25467641.mp3
common_voice_eo_25467723.mp3
common_voice_eo_25467791.mp3
common_voice_eo_25467820.mp3
common_voice_eo_25467943.mp3
common_voice_eo_25478612.mp3
common_voice_eo_25478623.mp3
common_voice_eo_25478631.mp3
common_voice_eo_25478756.mp3
common_voice_eo_25478762.mp3
common_voice_eo_25478768.mp3
common_voice_eo_25478769.mp3
common_voice_eo_25479150.mp3
common_voice_eo_25479203.mp3
common_voice_eo_25479229.mp3
common_voice_eo_25517673.mp3
common_voice_eo_25517677.mp3
common_voice_eo_25527739.mp3

这些文件没有音频。

方案2

另一个可能性是浏览音频文件,并丢弃峰值音频低于某个阈值的文件。

方案3

由于这个模型似乎足够好用,我可以对所有样本运行推理,并丢弃CER(由此模型确定)过高的样本,比如高于0.5。然后使用该筛选后的样本训练另一个模型。这些高CER的例子是:

测试集

测试集中的14913个样本中有71个样本显示出较高的CER。

common_voice_eo_25214319.mp3
common_voice_eo_25006596.mp3
common_voice_eo_27472721.mp3
common_voice_eo_27715088.mp3
common_voice_eo_27715091.mp3
common_voice_eo_26677019.mp3
common_voice_eo_26677023.mp3
common_voice_eo_20555291.mp3
common_voice_eo_25001942.mp3
common_voice_eo_25457354.mp3
common_voice_eo_25457355.mp3
common_voice_eo_25457365.mp3
common_voice_eo_25457373.mp3
common_voice_eo_25457396.mp3
common_voice_eo_25457397.mp3
common_voice_eo_25457409.mp3
common_voice_eo_25457410.mp3
common_voice_eo_25457412.mp3
common_voice_eo_25457442.mp3
common_voice_eo_25457444.mp3
common_voice_eo_25457445.mp3
common_voice_eo_25457577.mp3
common_voice_eo_25457578.mp3
common_voice_eo_28064453.mp3
common_voice_eo_25047803.mp3
common_voice_eo_25048418.mp3
common_voice_eo_25048419.mp3
common_voice_eo_25048421.mp3
common_voice_eo_25048423.mp3
common_voice_eo_25048428.mp3
common_voice_eo_25048574.mp3
common_voice_eo_25885643.mp3
common_voice_eo_25885645.mp3
common_voice_eo_26794882.mp3
common_voice_eo_27356529.mp3
common_voice_eo_25012640.mp3
common_voice_eo_25303457.mp3
common_voice_eo_18153931.mp3
common_voice_eo_18776206.mp3
common_voice_eo_18776208.mp3
common_voice_eo_18776219.mp3
common_voice_eo_18776220.mp3
common_voice_eo_18776222.mp3
common_voice_eo_18776223.mp3
common_voice_eo_18776236.mp3
common_voice_eo_18776238.mp3
common_voice_eo_18776244.mp3
common_voice_eo_18776248.mp3
common_voice_eo_18776285.mp3
common_voice_eo_18776287.mp3
common_voice_eo_18776297.mp3
common_voice_eo_18776298.mp3
common_voice_eo_25047998.mp3
common_voice_eo_25047999.mp3
common_voice_eo_25048000.mp3
common_voice_eo_25048001.mp3
common_voice_eo_25048002.mp3
common_voice_eo_25053113.mp3
common_voice_eo_25068355.mp3
common_voice_eo_25333056.mp3
common_voice_eo_25371639.mp3
common_voice_eo_25371640.mp3
common_voice_eo_25371641.mp3
common_voice_eo_25371642.mp3
common_voice_eo_25371643.mp3
common_voice_eo_22441946.mp3
common_voice_eo_26622121.mp3
common_voice_eo_25167318.mp3
common_voice_eo_25252685.mp3
common_voice_eo_25252698.mp3
common_voice_eo_25518636.mp3

注意其中两个例子:我们知道"saluton kiel vi fartas"("你好,你好吗")和"atendu momenton"("等一下")是学习Esperanto的良好起点,但如果没有这个文本要记录,你就没有真正帮助到别人。

验证集

验证集中的14909个样本中有17个样本显示出较高的CER。

common_voice_eo_25392669.mp3
common_voice_eo_25392674.mp3
common_voice_eo_25392675.mp3
common_voice_eo_25392676.mp3
common_voice_eo_25392678.mp3
common_voice_eo_25392693.mp3
common_voice_eo_25392694.mp3
common_voice_eo_25392695.mp3
common_voice_eo_25392697.mp3
common_voice_eo_25392701.mp3
common_voice_eo_25392702.mp3
common_voice_eo_25392708.mp3
common_voice_eo_25392709.mp3
common_voice_eo_25408881.mp3
common_voice_eo_25408882.mp3
common_voice_eo_25408885.mp3
common_voice_eo_27380623.mp3

我没有包括一些由于一段单词录音中的幻觉而导致CER较高的样本,其中有很多沉默在录音之前和之后。这些录音本身是正常的。

训练集

143984个样本中有135个样本产生较高的CER。我从该列表中删除了一些CER较高但听起来正常的样本。

common_voice_eo_25365027.mp3
common_voice_eo_25365472.mp3
common_voice_eo_25365480.mp3
common_voice_eo_25365532.mp3
common_voice_eo_25365695.mp3
common_voice_eo_25365744.mp3
common_voice_eo_25365804.mp3
common_voice_eo_25365836.mp3
common_voice_eo_25365855.mp3
common_voice_eo_25372587.mp3
common_voice_eo_25401060.mp3
common_voice_eo_25430837.mp3
common_voice_eo_25444509.mp3
common_voice_eo_25240777.mp3
common_voice_eo_24942754.mp3
common_voice_eo_24942755.mp3
common_voice_eo_24990372.mp3
common_voice_eo_24990385.mp3
common_voice_eo_24990390.mp3
common_voice_eo_24990397.mp3
common_voice_eo_24990413.mp3
common_voice_eo_24990427.mp3
common_voice_eo_24990429.mp3
common_voice_eo_24990435.mp3
common_voice_eo_24990441.mp3
common_voice_eo_24990454.mp3
common_voice_eo_24990457.mp3
common_voice_eo_24990459.mp3
common_voice_eo_24990490.mp3
common_voice_eo_25529345.mp3
common_voice_eo_25648750.mp3
common_voice_eo_28670472.mp3
common_voice_eo_27931966.mp3
common_voice_eo_28252265.mp3
common_voice_eo_25454951.mp3
common_voice_eo_25927616.mp3
common_voice_eo_25153203.mp3
common_voice_eo_25238543.mp3
common_voice_eo_25284237.mp3
common_voice_eo_25460131.mp3
common_voice_eo_25460185.mp3
common_voice_eo_25460186.mp3
common_voice_eo_25460188.mp3
common_voice_eo_25460189.mp3
common_voice_eo_25446723.mp3
common_voice_eo_26025150.mp3
common_voice_eo_26640189.mp3
common_voice_eo_26888468.mp3
common_voice_eo_24844824.mp3
common_voice_eo_25022506.mp3
common_voice_eo_25022507.mp3
common_voice_eo_25022516.mp3
common_voice_eo_25032858.mp3
common_voice_eo_25032859.mp3
common_voice_eo_25032865.mp3
common_voice_eo_25243988.mp3
common_voice_eo_25244009.mp3
common_voice_eo_25266094.mp3
common_voice_eo_25266141.mp3
common_voice_eo_25285278.mp3
common_voice_eo_25286768.mp3
common_voice_eo_25457171.mp3
common_voice_eo_25467641.mp3
common_voice_eo_25467723.mp3
common_voice_eo_25467791.mp3
common_voice_eo_25467820.mp3
common_voice_eo_25467943.mp3
common_voice_eo_25478612.mp3
common_voice_eo_25478623.mp3
common_voice_eo_25478631.mp3
common_voice_eo_25478756.mp3
common_voice_eo_25478762.mp3
common_voice_eo_25478768.mp3
common_voice_eo_25478769.mp3
common_voice_eo_25479150.mp3
common_voice_eo_25479203.mp3
common_voice_eo_25479229.mp3
common_voice_eo_25517673.mp3
common_voice_eo_25517677.mp3
common_voice_eo_25527739.mp3
common_voice_eo_25975149.mp3
common_voice_eo_26193748.mp3
common_voice_eo_28401039.mp3
common_voice_eo_28421315.mp3
common_voice_eo_28937347.mp3
common_voice_eo_24890414.mp3
common_voice_eo_25294479.mp3
common_voice_eo_25438966.mp3
common_voice_eo_28855568.mp3
common_voice_eo_29011007.mp3
common_voice_eo_24599888.mp3
common_voice_eo_26964252.mp3
common_voice_eo_26964496.mp3
common_voice_eo_26964510.mp3
common_voice_eo_25432789.mp3
common_voice_eo_26688158.mp3
common_voice_eo_28516354.mp3
common_voice_eo_24790865.mp3
common_voice_eo_24790897.mp3
common_voice_eo_24790898.mp3
common_voice_eo_24790899.mp3
common_voice_eo_24790900.mp3
common_voice_eo_25362713.mp3
common_voice_eo_27585084.mp3
common_voice_eo_24813131.mp3
common_voice_eo_25035262.mp3
common_voice_eo_26000289.mp3
common_voice_eo_26003943.mp3
common_voice_eo_26283983.mp3
common_voice_eo_28708931.mp3
common_voice_eo_28037217.mp3
common_voice_eo_29273106.mp3
common_voice_eo_26006657.mp3
common_voice_eo_25399924.mp3
common_voice_eo_27982431.mp3
common_voice_eo_25893779.mp3
common_voice_eo_27842061.mp3
common_voice_eo_25052385.mp3
common_voice_eo_25807395.mp3
common_voice_eo_25807985.mp3
common_voice_eo_25808039.mp3
common_voice_eo_25808407.mp3
common_voice_eo_25809036.mp3
common_voice_eo_27487795.mp3
common_voice_eo_28460556.mp3
common_voice_eo_28884851.mp3
common_voice_eo_24819719.mp3
common_voice_eo_25153594.mp3
common_voice_eo_25234585.mp3
common_voice_eo_25245164.mp3
common_voice_eo_27538877.mp3
common_voice_eo_24862771.mp3
common_voice_eo_25070167.mp3
common_voice_eo_26381720.mp3
common_voice_eo_28110376.mp3

方案3.1

对于没有或音频失真的文件,也许可以将它们的目标设置为空?除了'injabum'。

还有

由于可以在Common Voice上注册来审查Esperanto音频文件,我已经这样做了,希望能对质量做出一点贡献。