模型:
xekri/wav2vec2-common_voice_13_0-eo-3
这个模型是在 mozilla-foundation/common_voice_13_0 Esperanto数据集上, facebook/wav2vec2-large-xlsr-53 的fine-tuned版本。它在评估集上取得了以下结果:
测试集中的前10个样本:
| Actual Predicted | CER |
|---|---|
| la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo | 0.0 |
| en la sekva jaro li ricevis premion en la sekva jaro li ricevis prenion | 0.02857142857142857 |
| ŝi studis historion ĉe la universitato de brita kolumbio ŝi studis historion ĉe la universitato de brita kolumbio | 0.0 |
| larĝaj ŝtupoj kuras al la fasado larĝaj ŝtupoj kuras al la fasado | 0.0 |
| la municipo ĝuas duan epokon de etendo kaj disvolviĝo la municipo ĝuas duonepokon de tendo kaj disvolviĝo | 0.05660377358490566 |
| li estis ankaŭ katedrestro kaj dekano li estis ankaŭ katedresto kaj dekano | 0.02702702702702703 |
| librovendejo apartenas al la muzeo librovendejo apartenas al la muzeo | 0.0 |
| ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵaro de arbaroj ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵo de arbaroj | 0.02702702702702703 |
| unue ili estas ruĝaj poste brunaj unue ili estas ruĝaj poste brunaj | 0.0 |
| la loĝantaro laboras en la proksima ĉefurbo la loĝantaro laboras en la proksima ĉefurbo | 0.0 |
参见 facebook/wav2vec2-large-xlsr-53 。
用于Esperanto的语音识别。基础模型是在16kHz采样的语音音频上进行预训练和微调的。使用该模型时,请确保语音输入也是以16kHz采样。
训练集分割设置为train[:15000],评估集分割设置为validation[:1500]。
我使用 run_speech_recognition_ctc.py ,将以下train.json文件传递给它:
{
"dataset_name": "mozilla-foundation/common_voice_13_0",
"model_name_or_path": "facebook/wav2vec2-large-xlsr-53",
"dataset_config_name": "eo",
"output_dir": "./wav2vec2-common_voice_13_0-eo-3",
"train_split_name": "train[:15000]",
"eval_split_name": "validation[:1500]",
"eval_metrics": ["cer", "wer"],
"overwrite_output_dir": true,
"preprocessing_num_workers": 8,
"num_train_epochs": 100,
"per_device_train_batch_size": 8,
"gradient_accumulation_steps": 4,
"gradient_checkpointing": true,
"learning_rate": 3e-5,
"warmup_steps": 500,
"evaluation_strategy": "steps",
"text_column_name": "sentence",
"length_column_name": "input_length",
"save_steps": 1000,
"eval_steps": 1000,
"layerdrop": 0.1,
"save_total_limit": 3,
"freeze_feature_encoder": true,
"chars_to_ignore": "-!\"'(),.:;=?_`¨«¸»ʼ‑–—‘’“”„…‹›♫?",
"chars_to_substitute": {
"przy": "pŝe",
"byn": "bin",
"cx": "ĉ",
"sx": "ŝ",
"fi": "fi",
"fl": "fl",
"ǔ": "ŭ",
"ñ": "nj",
"á": "a",
"é": "e",
"ü": "ŭ",
"y": "j",
"qu": "ku"
},
"fp16": true,
"group_by_length": true,
"push_to_hub": true,
"do_train": true,
"do_eval": true
}
我检查了数据集,找到了非语音字符,并将它们放置在chars_to_ignore中。此外,还有一些字符序列可以转录为Esperanto音素,我将它们作为字典放置在chars_to_substitute中。这需要在程序中添加此参数:
def dict_field(default=None, metadata=None):
return field(default_factory=lambda: default, metadata=metadata)
@dataclass
class DataTrainingArguments:
...
chars_to_substitute: Optional[Dict[str, str]] = dict_field(
default=None,
metadata={"help": "A dict of characters to replace."},
)
然后我复制了remove_special_characters来执行实际的替换:
def remove_special_characters(batch):
text = batch[text_column_name]
if chars_to_ignore_regex is not None:
text = re.sub(chars_to_ignore_regex, "", batch[text_column_name])
batch["target_text"] = text.lower() + " "
return batch
def substitute_characters(batch):
text: str = batch["target_text"]
if data_args.chars_to_substitute is not None:
for k, v in data_args.chars_to_substitute.items():
text.replace(k, v)
batch["target_text"] = text.lower()
return batch
with training_args.main_process_first(desc="dataset map special characters removal"):
raw_datasets = raw_datasets.map(
remove_special_characters,
remove_columns=[text_column_name],
desc="remove special characters from datasets",
)
with training_args.main_process_first(desc="dataset map special characters substitute"):
raw_datasets = raw_datasets.map(
substitute_characters,
desc="substitute special characters in datasets",
)
训练时使用了以下超参数:
| Training Loss | Epoch | Step | Cer | Validation Loss | Wer |
|---|---|---|---|---|---|
| 2.6416 | 2.13 | 1000 | 0.1541 | 0.8599 | 0.6449 |
| 0.2633 | 4.27 | 2000 | 0.0335 | 0.1897 | 0.1431 |
| 0.1739 | 6.4 | 3000 | 0.0289 | 0.1732 | 0.1145 |
| 0.1378 | 8.53 | 4000 | 0.0276 | 0.1729 | 0.1066 |
| 0.1172 | 10.67 | 5000 | 0.0268 | 0.1773 | 0.1019 |
| 0.1049 | 12.8 | 6000 | 0.0255 | 0.1701 | 0.0937 |
| 0.0951 | 14.93 | 7000 | 0.0253 | 0.1718 | 0.0933 |
| 0.0851 | 17.07 | 8000 | 0.0239 | 0.1787 | 0.0834 |
| 0.0809 | 19.2 | 9000 | 0.0235 | 0.1802 | 0.0835 |
| 0.0756 | 21.33 | 10000 | 0.0239 | 0.1784 | 0.0855 |
| 0.0708 | 23.47 | 11000 | 0.0235 | 0.1748 | 0.0824 |
| 0.0657 | 25.6 | 12000 | 0.0228 | 0.1830 | 0.0796 |
| 0.0605 | 27.73 | 13000 | 0.0230 | 0.1896 | 0.0798 |
| 0.0583 | 29.87 | 14000 | 0.0224 | 0.1889 | 0.0778 |
| 0.0608 | 32.0 | 15000 | 0.0223 | 0.1849 | 0.0757 |
| 0.0556 | 34.13 | 16000 | 0.0223 | 0.1872 | 0.0767 |
| 0.0534 | 36.27 | 17000 | 0.0221 | 0.1893 | 0.0751 |
| 0.0523 | 38.4 | 18000 | 0.0218 | 0.1925 | 0.0729 |
| 0.0494 | 40.53 | 19000 | 0.0221 | 0.1957 | 0.0745 |
| 0.0475 | 42.67 | 20000 | 0.0217 | 0.1961 | 0.0740 |
| 0.048 | 44.8 | 21000 | 0.0214 | 0.1957 | 0.0714 |
| 0.0459 | 46.93 | 22000 | 0.0215 | 0.1968 | 0.0717 |
| 0.0435 | 49.07 | 23000 | 0.0217 | 0.2008 | 0.0717 |
| 0.0428 | 51.2 | 24000 | 0.0212 | 0.1991 | 0.0696 |
| 0.0418 | 53.33 | 25000 | 0.0215 | 0.2034 | 0.0714 |
| 0.0404 | 55.47 | 26000 | 0.0210 | 0.2014 | 0.0684 |
| 0.0394 | 57.6 | 27000 | 0.0210 | 0.2050 | 0.0681 |
| 0.0399 | 59.73 | 28000 | 0.0211 | 0.2039 | 0.0700 |
| 0.0389 | 61.87 | 29000 | 0.0214 | 0.2091 | 0.0694 |
| 0.038 | 64.0 | 30000 | 0.0210 | 0.2100 | 0.0702 |
| 0.0361 | 66.13 | 31000 | 0.0215 | 0.2119 | 0.0703 |
| 0.0359 | 68.27 | 32000 | 0.0213 | 0.2108 | 0.0714 |
| 0.0354 | 70.4 | 33000 | 0.0211 | 0.2120 | 0.0699 |
| 0.0364 | 72.53 | 34000 | 0.0211 | 0.2128 | 0.0688 |
| 0.0361 | 74.67 | 35000 | 0.0212 | 0.2134 | 0.0694 |
| 0.0332 | 76.8 | 36000 | 0.0210 | 0.2176 | 0.0698 |
| 0.0341 | 78.93 | 37000 | 0.0208 | 0.2170 | 0.0688 |
| 0.032 | 81.07 | 38000 | 0.0209 | 0.2157 | 0.0686 |
| 0.0318 | 83.33 | 39000 | 0.0209 | 0.2166 | 0.0685 |
| 0.0325 | 85.47 | 40000 | 0.0209 | 0.2172 | 0.0687 |
| 0.0316 | 87.6 | 41000 | 0.0208 | 0.2181 | 0.0678 |
| 0.0302 | 89.73 | 42000 | 0.0208 | 0.2171 | 0.0679 |
| 0.0318 | 91.87 | 43000 | 0.0211 | 0.2179 | 0.0702 |
| 0.0314 | 94.0 | 44000 | 0.0208 | 0.2186 | 0.0690 |
| 0.0309 | 96.13 | 45000 | 0.0210 | 0.2193 | 0.0696 |
| 0.031 | 98.27 | 46000 | 0.0208 | 0.2191 | 0.0686 |
在调试其他使用Esperanto Common Voice数据集的训练会话时,一些损失计算返回inf或nan,我发现一些使用该模型训练的训练集具有非常高的CER。一些例子:
| file | Actual --- Predicted | CER | Comment |
|---|---|---|---|
| common_voice_eo_25365027.mp3 | en la hansaj agentejoj komercistoj el la regiono renkontis kolegojn el aliaj regionoj --- a taaj keo eoj eejn kigos eegoj eioeegiooj | 0.61 | No audio |
| common_voice_eo_25365472.mp3 | ili vendas armilojn kaj teknologiojn al la fanatikuloj por gajni monon monon monon --- ila mamato aiil ajn kno ion a a aotigojn pu aiooo aj knon | 0.55 | Barely any audio, distorted |
| common_voice_eo_25365836.mp3 | industria apliko estas la kreado de modifitaj bakterioj kiuj produktas deziratan kemian substancon --- iiti sieetas la eeadooddddooiooaotooeioj aiicenon | 0.67 | Barely any audio, distorted |
| 2600 | ili akiras plenkreskan plumaron nur en la kvina jaro --- ili aaros peetaj patato a a sia ro | 0.52 | It's literally someone saying 'injabum'. Thanks, troll. |
| 7333 | poste sekvas difinoj de la termino --- po | 0.94 | No audio |
| 7334 | li gvidis multajn kursojn laŭ la csehmetodo --- po | 0.98 | No audio |
| 7429 | tamen pro la rekonstruo de kluzoj ne eblas trapasi komplete --- po | 0.97 | No audio |
| 11662 | lingvotesto estas postulata ekzemple por akceptiĝo en anglalingvaj altlernejoj --- linkonteto estastitot etateerteito en pootaeaje lgijoj | 0.58 | No audio |
一些例子没有音频。这些数据集中的所有文件都是完全无用的,应该从训练集中删除。
可以看到模型在几乎没有音频的情况下试图想象目标内容。这对于真实地报告说了什么是非常糟糕的。我也希望有一些确定性的度量,并且可能只采用相对确定性较高的转录结果。然而,我找不到如何获取确切值的方法。
Common Voice数据集还包含票数和反对票数。在上述高CER的句子中,所有句子都有2个赞成票,其中有些句子没有反对票,有些句子有1个反对票。因此,我们不能依靠赞成或反对票来检测质量。
那该怎么办呢?
尽管存在这些零和低质量的文件,训练似乎还是可以正常进行的。但是,我们仍然需要解决损失变为inf或nan的问题,因为这会破坏计算。
通过运行run_speech_recognition_ctc,do_train=false,model_name_or_path="xekri/wav2vec2-common_voice_13_0-eo-3",将eval_split_name设置为test、validation或train,并按照以下方式修改trainer.py,我可以检查是否有任何损失为nan或inf:
# To be JSON-serializable, we need to remove numpy types or zero-d tensors
metrics = denumpify_detensorize(metrics)
if all_losses is not None:
loss_nan = np.where(np.isnan(all_losses))
if len(loss_nan) != 0:
print(f'LOSSES ARE NAN: {loss_nan}')
loss_inf = np.where(np.isinf(all_losses))
if len(loss_inf) != 0:
print(f'LOSSES ARE INF: {loss_inf}')
metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
这样做可以发现测试集中的14913个样本中,有一个样本的损失为inf:
common_voice_eo_25167318.mp3
这个音频严重失真。这个样本应该从测试集中过滤掉。
验证集中没有样本的损失为inf或nan。
训练集中的143984个样本中,有18个样本的损失为inf:
common_voice_eo_25467641.mp3 common_voice_eo_25467723.mp3 common_voice_eo_25467791.mp3 common_voice_eo_25467820.mp3 common_voice_eo_25467943.mp3 common_voice_eo_25478612.mp3 common_voice_eo_25478623.mp3 common_voice_eo_25478631.mp3 common_voice_eo_25478756.mp3 common_voice_eo_25478762.mp3 common_voice_eo_25478768.mp3 common_voice_eo_25478769.mp3 common_voice_eo_25479150.mp3 common_voice_eo_25479203.mp3 common_voice_eo_25479229.mp3 common_voice_eo_25517673.mp3 common_voice_eo_25517677.mp3 common_voice_eo_25527739.mp3
这些文件没有音频。
另一个可能性是浏览音频文件,并丢弃峰值音频低于某个阈值的文件。
由于这个模型似乎足够好用,我可以对所有样本运行推理,并丢弃CER(由此模型确定)过高的样本,比如高于0.5。然后使用该筛选后的样本训练另一个模型。这些高CER的例子是:
测试集测试集中的14913个样本中有71个样本显示出较高的CER。
common_voice_eo_25214319.mp3 common_voice_eo_25006596.mp3 common_voice_eo_27472721.mp3 common_voice_eo_27715088.mp3 common_voice_eo_27715091.mp3 common_voice_eo_26677019.mp3 common_voice_eo_26677023.mp3 common_voice_eo_20555291.mp3 common_voice_eo_25001942.mp3 common_voice_eo_25457354.mp3 common_voice_eo_25457355.mp3 common_voice_eo_25457365.mp3 common_voice_eo_25457373.mp3 common_voice_eo_25457396.mp3 common_voice_eo_25457397.mp3 common_voice_eo_25457409.mp3 common_voice_eo_25457410.mp3 common_voice_eo_25457412.mp3 common_voice_eo_25457442.mp3 common_voice_eo_25457444.mp3 common_voice_eo_25457445.mp3 common_voice_eo_25457577.mp3 common_voice_eo_25457578.mp3 common_voice_eo_28064453.mp3 common_voice_eo_25047803.mp3 common_voice_eo_25048418.mp3 common_voice_eo_25048419.mp3 common_voice_eo_25048421.mp3 common_voice_eo_25048423.mp3 common_voice_eo_25048428.mp3 common_voice_eo_25048574.mp3 common_voice_eo_25885643.mp3 common_voice_eo_25885645.mp3 common_voice_eo_26794882.mp3 common_voice_eo_27356529.mp3 common_voice_eo_25012640.mp3 common_voice_eo_25303457.mp3 common_voice_eo_18153931.mp3 common_voice_eo_18776206.mp3 common_voice_eo_18776208.mp3 common_voice_eo_18776219.mp3 common_voice_eo_18776220.mp3 common_voice_eo_18776222.mp3 common_voice_eo_18776223.mp3 common_voice_eo_18776236.mp3 common_voice_eo_18776238.mp3 common_voice_eo_18776244.mp3 common_voice_eo_18776248.mp3 common_voice_eo_18776285.mp3 common_voice_eo_18776287.mp3 common_voice_eo_18776297.mp3 common_voice_eo_18776298.mp3 common_voice_eo_25047998.mp3 common_voice_eo_25047999.mp3 common_voice_eo_25048000.mp3 common_voice_eo_25048001.mp3 common_voice_eo_25048002.mp3 common_voice_eo_25053113.mp3 common_voice_eo_25068355.mp3 common_voice_eo_25333056.mp3 common_voice_eo_25371639.mp3 common_voice_eo_25371640.mp3 common_voice_eo_25371641.mp3 common_voice_eo_25371642.mp3 common_voice_eo_25371643.mp3 common_voice_eo_22441946.mp3 common_voice_eo_26622121.mp3 common_voice_eo_25167318.mp3 common_voice_eo_25252685.mp3 common_voice_eo_25252698.mp3 common_voice_eo_25518636.mp3
注意其中两个例子:我们知道"saluton kiel vi fartas"("你好,你好吗")和"atendu momenton"("等一下")是学习Esperanto的良好起点,但如果没有这个文本要记录,你就没有真正帮助到别人。
验证集验证集中的14909个样本中有17个样本显示出较高的CER。
common_voice_eo_25392669.mp3 common_voice_eo_25392674.mp3 common_voice_eo_25392675.mp3 common_voice_eo_25392676.mp3 common_voice_eo_25392678.mp3 common_voice_eo_25392693.mp3 common_voice_eo_25392694.mp3 common_voice_eo_25392695.mp3 common_voice_eo_25392697.mp3 common_voice_eo_25392701.mp3 common_voice_eo_25392702.mp3 common_voice_eo_25392708.mp3 common_voice_eo_25392709.mp3 common_voice_eo_25408881.mp3 common_voice_eo_25408882.mp3 common_voice_eo_25408885.mp3 common_voice_eo_27380623.mp3
我没有包括一些由于一段单词录音中的幻觉而导致CER较高的样本,其中有很多沉默在录音之前和之后。这些录音本身是正常的。
训练集143984个样本中有135个样本产生较高的CER。我从该列表中删除了一些CER较高但听起来正常的样本。
common_voice_eo_25365027.mp3 common_voice_eo_25365472.mp3 common_voice_eo_25365480.mp3 common_voice_eo_25365532.mp3 common_voice_eo_25365695.mp3 common_voice_eo_25365744.mp3 common_voice_eo_25365804.mp3 common_voice_eo_25365836.mp3 common_voice_eo_25365855.mp3 common_voice_eo_25372587.mp3 common_voice_eo_25401060.mp3 common_voice_eo_25430837.mp3 common_voice_eo_25444509.mp3 common_voice_eo_25240777.mp3 common_voice_eo_24942754.mp3 common_voice_eo_24942755.mp3 common_voice_eo_24990372.mp3 common_voice_eo_24990385.mp3 common_voice_eo_24990390.mp3 common_voice_eo_24990397.mp3 common_voice_eo_24990413.mp3 common_voice_eo_24990427.mp3 common_voice_eo_24990429.mp3 common_voice_eo_24990435.mp3 common_voice_eo_24990441.mp3 common_voice_eo_24990454.mp3 common_voice_eo_24990457.mp3 common_voice_eo_24990459.mp3 common_voice_eo_24990490.mp3 common_voice_eo_25529345.mp3 common_voice_eo_25648750.mp3 common_voice_eo_28670472.mp3 common_voice_eo_27931966.mp3 common_voice_eo_28252265.mp3 common_voice_eo_25454951.mp3 common_voice_eo_25927616.mp3 common_voice_eo_25153203.mp3 common_voice_eo_25238543.mp3 common_voice_eo_25284237.mp3 common_voice_eo_25460131.mp3 common_voice_eo_25460185.mp3 common_voice_eo_25460186.mp3 common_voice_eo_25460188.mp3 common_voice_eo_25460189.mp3 common_voice_eo_25446723.mp3 common_voice_eo_26025150.mp3 common_voice_eo_26640189.mp3 common_voice_eo_26888468.mp3 common_voice_eo_24844824.mp3 common_voice_eo_25022506.mp3 common_voice_eo_25022507.mp3 common_voice_eo_25022516.mp3 common_voice_eo_25032858.mp3 common_voice_eo_25032859.mp3 common_voice_eo_25032865.mp3 common_voice_eo_25243988.mp3 common_voice_eo_25244009.mp3 common_voice_eo_25266094.mp3 common_voice_eo_25266141.mp3 common_voice_eo_25285278.mp3 common_voice_eo_25286768.mp3 common_voice_eo_25457171.mp3 common_voice_eo_25467641.mp3 common_voice_eo_25467723.mp3 common_voice_eo_25467791.mp3 common_voice_eo_25467820.mp3 common_voice_eo_25467943.mp3 common_voice_eo_25478612.mp3 common_voice_eo_25478623.mp3 common_voice_eo_25478631.mp3 common_voice_eo_25478756.mp3 common_voice_eo_25478762.mp3 common_voice_eo_25478768.mp3 common_voice_eo_25478769.mp3 common_voice_eo_25479150.mp3 common_voice_eo_25479203.mp3 common_voice_eo_25479229.mp3 common_voice_eo_25517673.mp3 common_voice_eo_25517677.mp3 common_voice_eo_25527739.mp3 common_voice_eo_25975149.mp3 common_voice_eo_26193748.mp3 common_voice_eo_28401039.mp3 common_voice_eo_28421315.mp3 common_voice_eo_28937347.mp3 common_voice_eo_24890414.mp3 common_voice_eo_25294479.mp3 common_voice_eo_25438966.mp3 common_voice_eo_28855568.mp3 common_voice_eo_29011007.mp3 common_voice_eo_24599888.mp3 common_voice_eo_26964252.mp3 common_voice_eo_26964496.mp3 common_voice_eo_26964510.mp3 common_voice_eo_25432789.mp3 common_voice_eo_26688158.mp3 common_voice_eo_28516354.mp3 common_voice_eo_24790865.mp3 common_voice_eo_24790897.mp3 common_voice_eo_24790898.mp3 common_voice_eo_24790899.mp3 common_voice_eo_24790900.mp3 common_voice_eo_25362713.mp3 common_voice_eo_27585084.mp3 common_voice_eo_24813131.mp3 common_voice_eo_25035262.mp3 common_voice_eo_26000289.mp3 common_voice_eo_26003943.mp3 common_voice_eo_26283983.mp3 common_voice_eo_28708931.mp3 common_voice_eo_28037217.mp3 common_voice_eo_29273106.mp3 common_voice_eo_26006657.mp3 common_voice_eo_25399924.mp3 common_voice_eo_27982431.mp3 common_voice_eo_25893779.mp3 common_voice_eo_27842061.mp3 common_voice_eo_25052385.mp3 common_voice_eo_25807395.mp3 common_voice_eo_25807985.mp3 common_voice_eo_25808039.mp3 common_voice_eo_25808407.mp3 common_voice_eo_25809036.mp3 common_voice_eo_27487795.mp3 common_voice_eo_28460556.mp3 common_voice_eo_28884851.mp3 common_voice_eo_24819719.mp3 common_voice_eo_25153594.mp3 common_voice_eo_25234585.mp3 common_voice_eo_25245164.mp3 common_voice_eo_27538877.mp3 common_voice_eo_24862771.mp3 common_voice_eo_25070167.mp3 common_voice_eo_26381720.mp3 common_voice_eo_28110376.mp3
对于没有或音频失真的文件,也许可以将它们的目标设置为空?除了'injabum'。
由于可以在Common Voice上注册来审查Esperanto音频文件,我已经这样做了,希望能对质量做出一点贡献。