模型:
hyyoka/wav2vec2-xlsr-korean-senior
Futher fine-tuned fleek/wav2vec-large-xlsr-korean using the AIhub 자유대화 음성(노인남녀) .
When using this model, make sure that your speech input is sampled at 16kHz.
The script used for training can be found here: https://github.com/hyyoka/wav2vec2-korean-senior
import torchaudio
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import re
def clean_up(transcription):
hangul = re.compile('[^ ㄱ-ㅣ가-힣]+')
result = hangul.sub('', transcription)
return result
model_name "hyyoka/wav2vec2-xlsr-korean-senior"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
speech_array, sampling_rate = torchaudio.load(wav_file)
feat = processor(speech_array[0],
sampling_rate=16000,
padding=True,
max_length=800000,
truncation=True,
return_attention_mask=True,
return_tensors="pt",
pad_token_id=49
)
input = {'input_values': feat['input_values'],'attention_mask':feat['attention_mask']}
outputs = model(**input, output_attentions=True)
logits = outputs.logits
predicted_ids = logits.argmax(axis=-1)
transcription = processor.decode(predicted_ids[0])
stt_result = clean_up(transcription)
迅速微调使用
AIhub 자유대화 음성(노인남녀)
的
fleek/wav2vec-large-xlsr-korean
。总训练数据大小:808,642;总验证数据大小:159,970。使用此模型时,请确保您的语音输入采样率为16kHz。训练所用的脚本可在此处找到:
https://github.com/hyyoka/wav2vec2-korean-senior
。推断:import torchaudio
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import re
def clean_up(transcription):
hangul = re.compile('[^ ㄱ-ㅣ가-힣]+')
result = hangul.sub('', transcription)
return result
model_name "hyyoka/wav2vec2-xlsr-korean-senior"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
speech_array, sampling_rate = torchaudio.load(wav_file)
feat = processor(speech_array[0],
sampling_rate=16000,
padding=True,
max_length=800000,
truncation=True,
return_attention_mask=True,
return_tensors="pt",
pad_token_id=49
)
input = {'input_values': feat['input_values'],'attention_mask':feat['attention_mask']}
outputs = model(**input, output_attentions=True)
logits = outputs.logits
predicted_ids = logits.argmax(axis=-1)
transcription = processor.decode(predicted_ids[0])
stt_result = clean_up(transcription)
。