数据集:

cdminix/libritts-r-aligned

其他:

automatic-speech-recognition audio speech

许可:

cc-by-4.0

预印本库:

arxiv:2211.16049 arxiv:1904.02882

语言:

批注创建人:

crowdsourced

任务:

文本转语音

自动语音识别

数据集介绍文件清单

英文

这个数据集与 cdminix/libritts-aligned 完全相同，只是使用了新发布的LibriTTS-R语料库。请引用 Y. Koizumi, et al., "LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus", Interspeech 2023 。

在使用此数据集下载LibriTTS-R时，请确保您同意 https://www.openslr.org 上的条款。

LibriTTS-R强制对齐（和测量值）的数据集卡片

此数据集下载LibriTTS-R并在您的机器上对其进行预处理，使用 montreal forced aligner 进行对齐。使用此数据集之前，您需要运行pip install alignments phones。首次运行可能需要一到两个小时，但后续运行速度将非常快。

要求

pip install alignments phones（必需）
pip install speech-collator（可选）

注意：此语料库需要alignments的版本>=0.0.15。

示例项目

{
    'id': '100_122655_000073_000002.wav',
    'speaker': '100',
    'text': 'the day after, diana and mary quitted it for distant b.',
    'start': 0.0,
    'end': 3.6500000953674316, 
    'phones': ['[SILENCE]', 'ð', 'ʌ', '[SILENCE]', 'd', 'eɪ', '[SILENCE]', 'æ', 'f', 't', 'ɜ˞', '[COMMA]', 'd', 'aɪ', 'æ', 'n', 'ʌ', '[SILENCE]', 'æ', 'n', 'd', '[SILENCE]', 'm', 'ɛ', 'ɹ', 'i', '[SILENCE]', 'k', 'w', 'ɪ', 't', 'ɪ', 'd', '[SILENCE]', 'ɪ', 't', '[SILENCE]', 'f', 'ɜ˞', '[SILENCE]', 'd', 'ɪ', 's', 't', 'ʌ', 'n', 't', '[SILENCE]', 'b', 'i', '[FULL STOP]'], 
    'phone_durations': [5, 2, 4, 0, 5, 13, 0, 16, 7, 5, 20, 2, 6, 9, 15, 4, 2, 0, 11, 3, 5, 0, 3, 8, 9, 8, 0, 13, 3, 5, 3, 6, 4, 0, 8, 5, 0, 9, 5, 0, 7, 5, 6, 7, 4, 5, 10, 0, 3, 35, 9],
    'audio': '/dev/shm/metts/train-clean-360-alignments/100/100_122655_000073_000002.wav'
}

手机是IPA手机，手机持续时间以帧为单位（假设跳跃长度为256，采样率为22050，窗口长度为1024）。可以使用LibriTTSAlign的hop_length、sample_rate和window_length参数来更改这些属性。

数据整理器

该数据集附带了一个数据整理器，用于创建训练数据的批次。可以使用pip install speech-collator（ MiniXC/speech-collator ）进行安装，并按照以下方式使用：

import json
from datasets import load_dataset
from speech_collator import SpeechCollator
from torch.utils.data import DataLoader

dataset = load_dataset('cdminix/libritts-aligned', split="train")

speaker2ixd = json.load(open("speaker2idx.json"))
phone2ixd = json.load(open("phone2idx.json"))

collator = SpeechCollator(
    speaker2ixd=speaker2idx,
    phone2ixd=phone2idx ,
)
dataloader = DataLoader(dataset, collate_fn=collator.collate_fn, batch_size=8)

您可以从 here 下载speaker2idx.json和phone2idx.json文件，也可以使用以下代码自己创建它们：

import json
from datasets import load_dataset
from speech_collator import SpeechCollator, create_speaker2idx, create_phone2idx

dataset = load_dataset("cdminix/libritts-aligned", split="train")

# Create speaker2idx and phone2idx
speaker2idx = create_speaker2idx(dataset, unk_idx=0)
phone2idx = create_phone2idx(dataset, unk_idx=0)

# save to json
with open("speaker2idx.json", "w") as f:
    json.dump(speaker2idx, f)
with open("phone2idx.json", "w") as f:
    json.dump(phone2idx, f)

测量值

使用speech-collator时，您还可以使用measures参数指定要使用的测量值。以下示例会实时提取音高和能量。

import json
from torch.utils.data import DataLoader
from datasets import load_dataset
from speech_collator import SpeechCollator, create_speaker2idx, create_phone2idx
from speech_collator.measures import PitchMeasure, EnergyMeasure

dataset = load_dataset("cdminix/libritts-aligned", split="train")

speaker2idx = json.load(open("data/speaker2idx.json"))
phone2idx = json.load(open("data/phone2idx.json"))

# Create SpeechCollator
speech_collator = SpeechCollator(
    speaker2idx=speaker2idx,
    phone2idx=phone2idx,
    measures=[PitchMeasure(), EnergyMeasure()],
    return_keys=["measures"]
)

# Create DataLoader
dataloader = DataLoader(
    dataset,
    batch_size=8,
    collate_fn=speech_collator.collate_fn,
)

即将推出：关于如何使用这些测量值的详细文档，请查看 MiniXC/speech-collator 。

拆分

此数据集具有以下拆分：

train：所有训练数据，除了用于验证的每个发言者的一个样本。
dev：验证数据，每个发言者一个样本。
train.clean.100：从LibriSpeech的train-clean-100子集的原始材料中派生的训练集。
train.clean.360：从LibriSpeech的train-clean-360子集的原始材料中派生的训练集。
train.other.500：从LibriSpeech的train-other-500子集的原始材料中派生的训练集。
dev.clean：从LibriSpeech的dev-clean子集的原始材料中派生的验证集。
dev.other：从LibriSpeech的dev-other子集的原始材料中派生的验证集。
test.clean：从LibriSpeech的test-clean子集的原始材料中派生的测试集。
test.other：从LibriSpeech的test-other子集的原始材料中派生的测试集。

环境变量

可以设置几个环境变量：

LIBRITTS_VERBOSE：如果设置，将打印有关数据集创建过程的更多信息。
LIBRITTS_MAX_WORKERS：创建对齐时要使用的工作线程数。默认为cpu_count()。
LIBRITTS_PATH：下载LibriTTS的路径。默认为HF_DATASETS_CACHE的值。

引用

在使用LibriTTS-R时，请引用以下论文：

在使用测量值时，请引用以下论文（ours）：

Evaluating and reducing the distance between synthetic and real speech distributions

作者:

cdminix

数据集大小:

278.83 MB