数据集:
facebook/multilingual_librispeech
任务:
计算机处理:
multilingual大小:
100K<n<1M批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2012.03411许可:
This is a streamable version of the Multilingual LibriSpeech (MLS) dataset. The data archives were restructured from the original ones from OpenSLR to make it easier to stream.
MLS dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish
The datasets library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.
For example, to download the German config, simply specify the corresponding language config name (i.e., "german" for German):
from datasets import load_dataset
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train")
 Using the datasets library, you can also stream the dataset on-the-fly by adding a streaming=True argument to the load_dataset function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk.
from datasets import load_dataset
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train", streaming=True)
print(next(iter(mls)))
 Bonus : create a PyTorch dataloader directly with your own datasets (local/streamed).
Local:
from datasets import load_dataset
from torch.utils.data.sampler import BatchSampler, RandomSampler
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train")
batch_sampler = BatchSampler(RandomSampler(mls), batch_size=32, drop_last=False)
dataloader = DataLoader(mls, batch_sampler=batch_sampler)
 Streaming:
from datasets import load_dataset
from torch.utils.data import DataLoader
mls = load_dataset("facebook/multilingual_librispeech", "german", split="train", streaming=True)
dataloader = DataLoader(mls, batch_size=32)
 To find out more about loading and preparing audio datasets, head over to hf.co/blog/audio-datasets .
Train your own CTC or Seq2Seq Automatic Speech Recognition models on MultiLingual Librispeech with transformers - here .
A typical data point comprises the path to the audio file, usually called file and its transcription, called text . Some additional information about the speaker and the passage which contains the transcription is provided.
{'file': '10900_6473_000030.flac',
 'audio': {'path': '10900_6473_000030.flac',
  'array': array([-1.52587891e-04,  6.10351562e-05,  0.00000000e+00, ...,
          4.27246094e-04,  5.49316406e-04,  4.57763672e-04]),
  'sampling_rate': 16000},
 'text': 'więc czego chcecie odemnie spytałem wysłuchawszy tego zadziwiającego opowiadania broń nas stary człowieku broń zakrzyknęli równocześnie obaj posłowie\n',
 'speaker_id': 10900,
 'chapter_id': 6473,
 'id': '10900_6473_000030'}
 file: A filename .flac format.
audio: A dictionary containing the audio filename, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .
text: the transcription of the audio file.
id: unique id of the data sample.
speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples.
chapter_id: id of the audiobook chapter which includes the transcription.
| Train | Train.9h | Train.1h | Dev | Test | |
|---|---|---|---|---|---|
| german | 469942 | 2194 | 241 | 3469 | 3394 | 
| dutch | 374287 | 2153 | 234 | 3095 | 3075 | 
| french | 258213 | 2167 | 241 | 2416 | 2426 | 
| spanish | 220701 | 2110 | 233 | 2408 | 2385 | 
| italian | 59623 | 2173 | 240 | 1248 | 1262 | 
| portuguese | 37533 | 2116 | 236 | 826 | 871 | 
| polish | 25043 | 2173 | 238 | 512 | 520 | 
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
[More Information Needed]
[More Information Needed]
[Needs More Information]
[Needs More Information]
Public Domain, Creative Commons Attribution 4.0 International Public License ( CC-BY-4.0 )
@article{Pratap2020MLSAL,
  title={MLS: A Large-Scale Multilingual Dataset for Speech Research},
  author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert},
  journal={ArXiv},
  year={2020},
  volume={abs/2012.03411}
}
 Thanks to @patrickvonplaten and @polinaeterna for adding this dataset.