数据集:
embedding-data/QQP_triplets
This dataset will give anyone the opportunity to train and test models of semantic equivalence, based on actual Quora data. The data is organized as triplets (anchor, positive, negative).
Disclaimer: The team releasing Quora data did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Each example is a dictionary with three keys (query, pos, and neg) containing a list each (triplets). The first key contains an anchor sentence, the second a positive sentence, and the third a list of negative sentences.
{"query": [anchor], "pos": [positive], "neg": [negative1, negative2, ..., negativeN]}
{"query": [anchor], "pos": [positive], "neg": [negative1, negative2, ..., negativeN]}
...
{"query": [anchor], "pos": [positive], "neg": [negative1, negative2, ..., negativeN]}
This dataset is useful for training Sentence Transformers models. Refer to the following post on how to train them.
Install the 🤗 Datasets library with pip install datasets and load the dataset from the Hub with:
from datasets import load_dataset
dataset = load_dataset("embedding-data/QQP_triplets")
The dataset is loaded as a DatasetDict and has the format:
DatasetDict({
train: Dataset({
features: ['set'],
num_rows: 101762
})
})
Review an example i with:
dataset["train"][i]["set"]
Here are a few important things to keep in mind about this dataset:
Thanks to Kornél Csernai , Nikhil Dandekar , Shankar Iyer for adding this dataset.