数据集:
csebuetnlp/xnli_bn
任务:
语言:
计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
machine-generated源数据集:
extended许可:
This is a Natural Language Inference (NLI) dataset for Bengali, curated using the subset of MNLI data used in XNLI and state-of-the-art English to Bengali translation model introduced here .
from datasets import load_dataset
dataset = load_dataset("csebuetnlp/xnli_bn")
 One example from the dataset is given below in JSON format.
{
  "sentence1": "আসলে, আমি এমনকি এই বিষয়ে চিন্তাও করিনি, কিন্তু আমি এত হতাশ হয়ে পড়েছিলাম যে, শেষ পর্যন্ত আমি আবার তার সঙ্গে কথা বলতে শুরু করেছিলাম",
  "sentence2": "আমি তার সাথে আবার কথা বলিনি।",
  "label": "contradiction"
}
 The data fields are as follows:
| split | count | 
|---|---|
| train | 381449 | 
| validation | 2419 | 
| test | 4895 | 
The dataset curation procedure was the same as the XNLI dataset: we translated the MultiNLI training data using the English to Bangla translation model introduced here . Due to the possibility of incursions of error during automatic translation, we used the Language-Agnostic BERT Sentence Embeddings (LaBSE) of the translations and original sentences to compute their similarity. All sentences below a similarity threshold of 0.70 were discarded.
Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.
If you use the dataset, please cite the following paper:
@misc{bhattacharjee2021banglabert,
      title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
      author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
      year={2021},
      eprint={2101.00204},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
 Thanks to @abhik1505040 and @Tahmid for adding this dataset.