测试斯坦福数据集

数据集概要

Bansal等人手动标注的斯坦福情感分析数据集。

语言

英文

数据集结构

数据示例

{
    "index": 1467856821,
    "hashtag": "therapyfail",
    "segmentation": "therapy fail",
    "gold_position": 8,
    "rank": {
        "position": [
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10,
            11,
            12,
            13,
            14,
            15,
            16,
            17,
            18,
            19,
            20
        ],
        "candidate": [
            "therap y fail",
            "the rap y fail",
            "t her apy fail",
            "the rap yfail",
            "t he rap y fail",
            "thera py fail",
            "ther apy fail",
            "th era py fail",
            "therapy fail",
            "therapy fai l",
            "the r apy fail",
            "the rapyfa il",
            "the rapy fail",
            "t herapy fail",
            "the rapyfail",
            "therapy f ai l",
            "therapy fa il",
            "the rapyf a il",
            "therapy f ail",
            "the ra py fail"
        ]
    }
}

数据字段

index：由Kodali等人注释的数字索引。
hashtag：原始标签。
segmentation：标签的金标准分割。
gold_position：金标准分割在segmentation字段中的位置。
rank：由基准词分段器（Segmentations Seeder Module）选择的每个候选人的排名。

数据集创建

此配置文件上的所有标签分割和标识符拆分数据集具有相同的基本字段：hashtag和分割或标识符和分割。
hashtag和分割之间或标识符和分割之间唯一的区别是空格字符。拼写检查，展开缩略语或将字符更正为大写字母会进入其他字段。
字母数字字符和任何特殊字符序列（例如_，：，~）之间始终有空格。
如果有任何用于命名实体识别和其他令牌分类任务的注释，则以spans字段形式给出。

其他信息

引用信息

@misc{bansal2015deep,
      title={Towards Deep Semantic Analysis Of Hashtags}, 
      author={Piyush Bansal and Romil Bansal and Vasudeva Varma},
      year={2015},
      eprint={1501.03210},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

贡献

该数据集是 @ruanchaves 在开发 hashformers 库时添加的。

作者:

ruanchaves

数据集大小:

8.53 KB