这是在 TalTech 语言技术实验室训练的通用爱沙尼亚语音识别模型。
该模型适用于通用的语音识别,如广播对话、采访、演讲等。
from espnet2.bin.asr_inference import Speech2Text
model = Speech2Text.from_pretrained(
"TalTechNLP/espnet2_estonian",
lm_weight=0.6, ctc_weight=0.4, beam_size=60
)
# read a sound file with 16k sample rate
import soundfile
speech, rate = soundfile.read("speech.wav")
assert rate == 16000
text, *_ = model(speech)
print(text[0])
限制和偏差由于该模型主要训练于广播语音和网络文本,因此在以下情况下可能无法正确解码:
声学训练数据:
| Type | Amount (h) |
|---|---|
| Broadcast speech | 591 |
| Spontaneous speech | 53 |
| Elderly speech corpus | 53 |
| Talks, lectures | 49 |
| Parliament speeches | 31 |
| Total | 761 |
语言模型训练数据:
标准 EspNet2 Conformer 配方。
| dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
|---|---|---|---|---|---|---|---|---|
| decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/aktuaalne2021.testset | 2864 | 56575 | 93.1 | 4.5 | 2.4 | 2.0 | 8.9 | 63.4 |
| decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.devset | 273 | 4677 | 93.9 | 3.6 | 2.4 | 1.2 | 7.3 | 46.5 |
| decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.testset | 818 | 11093 | 94.7 | 2.7 | 2.5 | 0.9 | 6.2 | 45.0 |
| decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.devset | 1207 | 13865 | 82.3 | 8.5 | 9.3 | 3.4 | 21.2 | 74.1 |
| decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.testset | 1648 | 22707 | 86.4 | 7.6 | 6.0 | 2.5 | 16.1 | 75.7 |
@inproceedings{watanabe2018espnet,
author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
title={{ESPnet}: End-to-End Speech Processing Toolkit},
year={2018},
booktitle={Proceedings of Interspeech},
pages={2207--2211},
doi={10.21437/Interspeech.2018-1456},
url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}