数据集:

persiannlp/parsinlu_translation_fa_en

任务:

翻译

语言:

计算机处理:

fa en

大小:

1K<n<10K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

extended

预印本库:

arxiv:2012.06154

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

中文

Dataset Card for PersiNLU (Machine Translation)

Dataset Summary

A Persian translation dataset (English -> Persian).

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The text dataset is in Persian ( fa ) and English ( en ).

Dataset Structure

Data Instances

Here is an example from the dataset:

{
    "source": "چه زحمت‌ها که بکشد تا منابع مالی را تامین کند اصطلاحات را ترویج کند نهادهایی به راه اندازد.", 
    "targets": ["how toil to raise funds, propagate reforms, initiate institutions!"],  
    "category": "mizan_dev_en_fa"
}

Data Fields

source : the input sentences, in Persian.
targets : the list of gold target translations in English.
category : the source from which the example is mined.

Data Splits

The train/dev/test split contains 1,622,281/2,138/47,745 samples.

Dataset Creation

Curation Rationale

For details, check the corresponding draft .

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

CC BY-NC-SA 4.0 License

Citation Information

@article{huggingface:dataset,
    title = {ParsiNLU: A Suite of Language Understanding Challenges for Persian},
    authors = {Khashabi, Daniel and Cohan, Arman and Shakeri, Siamak and Hosseini, Pedram and Pezeshkpour, Pouya and Alikhani, Malihe and Aminnaseri, Moin and Bitaab, Marzieh and Brahman, Faeze and Ghazarian, Sarik and others},
    year={2020}
    journal = {arXiv e-prints},
    eprint = {2012.06154},    
}

Contributions

Thanks to @danyaljj for adding this dataset.

作者:

persiannlp

数据集大小:

12.33 KB