Dataset Card for multi_language_conversation

Dataset Summary

The dataset contains 12,000 hours of multi-language conversation speech data. It's recorded by native speakers, covering English, French, German, Russian, Spanish, Japanese, Korean, Hindi, Vietnamese etc. The speakers start the conversation around a familar topic, to ensure the smoothness and nature of the conversation. The format is 16kHz, 16bit, uncompressed wav, mono channel. The sentence accuracy is over 95%. For more details, please refer to the link: https://bit.ly/39UzIwI

Supported Tasks and Leaderboards

automatic-speech-recognition, audio-speaker-identification: The dataset can be used to train a model for Automatic Speech Recognition (ASR).

Languages

English, French, German, Russian, Spanish, Japanese, Korean, Hindi, Vietnamese etc.

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Commerical License: https://drive.google.com/file/d/1saDCPm74D4UWfBL17VbkTsZLGfpOQj1J/view?usp=sharing

Citation Information

[More Information Needed]

Contributions

作者:

Datatang

数据集大小:

73.66 MB