数据集:
clarin-pl/aspectemo
AspectEmo语料库是波兰客户评论的一个扩展版本,是公开可用的波兰语客户评论语料库PolEmo 2.0的一个扩展版本,在许多情感分析项目中使用不同方法。AspectEmo语料库由四个子语料库组成,分别包含以下领域的在线客户评论: 学校、医学、酒店和产品。所有文档都以方面级别进行注释,并标记六个情感类别: 强烈负面(minus_m)、弱负面(minus_s)、中性(zero)、弱正面(plus_s)、强烈正面(plus_m)。
| version | config name | description | default | notes |
|---|---|---|---|---|
| 1.0 | "1.0" | The version used in the paper. | YES | |
| 2.0 | - | Some bugs fixed. | NO | work in progress |
基于方面的情感分析(ABSA)是一种将数据按方面进行分类并识别分配给每个方面的情感的文本分析方法。这是一个序列标记的任务。
输入('tokens'列): 标记序列
输出('labels'列): 预测的标记序列类别("O"加上6个可能的类别: 强烈负面(a_minus_m)、弱负面(a_minus_s)、中性(a_zero)、弱正面(a_plus_s)、强烈正面(a_plus_m)、模糊(a_amb))
领域: 学校、医学、酒店和产品
度量: F1分数(seqeval)
示例:
输入: ['Dużo', 'wymaga', ',', 'ale', 'bardzo', 'uczciwy', 'i', 'przyjazny', 'studentom', '.', 'Warto', 'chodzić', 'na', 'konsultacje', '.', 'Docenia', 'postępy', 'i', 'zaangażowanie', '.', 'Polecam', '.']
输入(由DeepL翻译): '要求很多,但非常诚实和对学生友好。值得去咨询。赞赏进步和承诺。我推荐。'
输出: ['O', 'a_plus_s', 'O', 'O', 'O', 'a_plus_m', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'a_zero', 'O', 'a_plus_m', 'O', 'O', 'O', 'O', 'O', 'O']
| Subset | Cardinality (sentences) |
|---|---|
| train | 1173 |
| val | 0 |
| test | 292 |
| Class | train | validation | test |
|---|---|---|---|
| a_plus_m | 0.359 | - | 0.369 |
| a_minus_m | 0.305 | - | 0.377 |
| a_zero | 0.234 | - | 0.182 |
| a_minus_s | 0.037 | - | 0.024 |
| a_plus_s | 0.037 | - | 0.015 |
| a_amb | 0.027 | - | 0.033 |
@misc{11321/849,
title = {{AspectEmo} 1.0: Multi-Domain Corpus of Consumer Reviews for Aspect-Based Sentiment Analysis},
author = {Koco{\'n}, Jan and Radom, Jarema and Kaczmarz-Wawryk, Ewa and Wabnic, Kamil and Zaj{\c a}czkowska, Ada and Za{\'s}ko-Zieli{\'n}ska, Monika},
url = {http://hdl.handle.net/11321/849},
note = {{CLARIN}-{PL} digital repository},
copyright = {The {MIT} License},
year = {2021}
}
The MIT License
from pprint import pprint
from datasets import load_dataset
dataset = load_dataset("clarin-pl/aspectemo")
pprint(dataset['train'][20])
# {'labels': [0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 3, 0, 5, 0, 0, 0, 0, 0, 0],
# 'tokens': ['Dużo',
# 'wymaga',
# ',',
# 'ale',
# 'bardzo',
# 'uczciwy',
# 'i',
# 'przyjazny',
# 'studentom',
# '.',
# 'Warto',
# 'chodzić',
# 'na',
# 'konsultacje',
# '.',
# 'Docenia',
# 'postępy',
# 'i',
# 'zaangażowanie',
# '.',
# 'Polecam',
# '.']}
import random
from pprint import pprint
from datasets import load_dataset, load_metric
dataset = load_dataset("clarin-pl/aspectemo")
references = dataset["test"]["labels"]
# generate random predictions
predictions = [
[
random.randrange(dataset["train"].features["labels"].feature.num_classes)
for _ in range(len(labels))
]
for labels in references
]
# transform to original names of labels
references_named = [
[dataset["train"].features["labels"].feature.names[label] for label in labels]
for labels in references
]
predictions_named = [
[dataset["train"].features["labels"].feature.names[label] for label in labels]
for labels in predictions
]
# transform to BILOU scheme
references_named = [
[f"U-{label}" if label != "O" else label for label in labels]
for labels in references_named
]
predictions_named = [
[f"U-{label}" if label != "O" else label for label in labels]
for labels in predictions_named
]
# utilise seqeval to evaluate
seqeval = load_metric("seqeval")
seqeval_score = seqeval.compute(
predictions=predictions_named,
references=references_named,
scheme="BILOU",
mode="strict",
)
pprint(seqeval_score)
# {'a_amb': {'f1': 0.00597237775289287,
# 'number': 91,
# 'precision': 0.003037782418834251,
# 'recall': 0.17582417582417584},
# 'a_minus_m': {'f1': 0.048306148055207034,
# 'number': 1039,
# 'precision': 0.0288551620760727,
# 'recall': 0.1482194417709336},
# 'a_minus_s': {'f1': 0.004682997118155619,
# 'number': 67,
# 'precision': 0.0023701002734731083,
# 'recall': 0.19402985074626866},
# 'a_plus_m': {'f1': 0.045933014354066985,
# 'number': 1015,
# 'precision': 0.027402473834443386,
# 'recall': 0.14187192118226602},
# 'a_plus_s': {'f1': 0.0021750951604132683,
# 'number': 41,
# 'precision': 0.001095690284879474,
# 'recall': 0.14634146341463414},
# 'a_zero': {'f1': 0.025159400310184387,
# 'number': 501,
# 'precision': 0.013768389287061486,
# 'recall': 0.14570858283433133},
# 'overall_accuracy': 0.13970115681233933,
# 'overall_f1': 0.02328248652368391,
# 'overall_precision': 0.012639312620633834,
# 'overall_recall': 0.14742193173565724}