模型:
ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
任务:
数据集:
Multilingual_large_dataset_(multilarge) cnc/dm xsum mlsum cnewsum cnc sumeczech 3Asumeczech 3Acnc 3Acnewsum 3Amlsum 3Axsum 3Acnc/dm 3AMultilingual_large_dataset_(multilarge)语言:
其他:
m2m_100 摘要生成 abstractive summarization multilingual summarization m2m100_418M Czech text2text generation text generation AutoTrain Compatible abstractive+summarization multilingual+summarization text2text+generation text+generation许可:
该模型是在针对捷克文本的多语言大型摘要数据集上,使用 facebook/m2m100_418M 的Fine-tuned检查点进行微调的,以生成多语言摘要。
该模型处理八种不同语言的多句子摘要。通过添加其他外语文档,并拥有大量的捷克文档,我们旨在改进捷克语摘要的模型。支持的语言:'cs'、'en'、'de'、'es'、'fr'、'ru'、'tu'、'zh'
#使用假设您正在使用提供的MultilingualSummarizer.ipynb文件和来自git存储库的包含文件。
## Configuration of summarization pipeline
#
def summ_config():
cfg = OrderedDict([
## summarization model - checkpoint
# ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
# ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
# ctu-aic/mbart25-multilingual-summarization-multilarge-cs
("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
## language of summarization task
# language : string : cs, en, de, fr, es, tr, ru, zh
("language", "en"),
## generation method parameters in dictionary
#
("inference_cfg", OrderedDict([
("num_beams", 4),
("top_k", 40),
("top_p", 0.92),
("do_sample", True),
("temperature", 0.95),
("repetition_penalty", 1.23),
("no_repeat_ngram_size", None),
("early_stopping", True),
("max_length", 128),
("min_length", 10),
])),
#texts to summarize values = (list of strings, string, dataset)
("texts",
[
"english text1 to summarize",
"english text2 to summarize",
]
),
#OPTIONAL: Target summaries values = (list of strings, string, None)
('golds',
[
"target english text1",
"target english text2",
]),
#('golds', None),
])
return cfg
cfg = summ_config()
mSummarize = MultiSummarizer(**cfg)
summaries,scores = mSummarize(**cfg)
多语言大型摘要数据集包含10个子数据集,主要基于新闻和每日邮件。训练时使用了整个训练集和72%的验证集。
Train set: 3 464 563 docs Validation set: 121 260 docs
| Stats | fragment | avg document length | avg summary length | Documents | ||||
|---|---|---|---|---|---|---|---|---|
| dataset | compression | density | coverage | nsent | nwords | nsent | nwords | count |
| cnc | 7.388 | 0.303 | 0.088 | 16.121 | 316.912 | 3.272 | 46.805 | 750K |
| sumeczech | 11.769 | 0.471 | 0.115 | 27.857 | 415.711 | 2.765 | 38.644 | 1M |
| cnndm | 13.688 | 2.983 | 0.538 | 32.783 | 676.026 | 4.134 | 54.036 | 300K |
| xsum | 18.378 | 0.479 | 0.194 | 18.607 | 369.134 | 1.000 | 21.127 | 225K |
| mlsum/tu | 8.666 | 5.418 | 0.461 | 14.271 | 214.496 | 1.793 | 25.675 | 274K |
| mlsum/de | 24.741 | 8.235 | 0.469 | 32.544 | 539.653 | 1.951 | 23.077 | 243K |
| mlsum/fr | 24.388 | 2.688 | 0.424 | 24.533 | 612.080 | 1.320 | 26.93 | 425K |
| mlsum/es | 36.185 | 3.705 | 0.510 | 31.914 | 746.927 | 1.142 | 21.671 | 291K |
| mlsum/ru | 78.909 | 1.194 | 0.246 | 62.141 | 948.079 | 1.012 | 11.976 | 27K |
| cnewsum | 20.183 | 0.000 | 0.000 | 16.834 | 438.271 | 1.109 | 21.926 | 304K |
编码器(输入文本)的截断和填充设置为512个标记,解码器(摘要)设置为128个标记。
基于交叉熵损失进行训练。
Time: 3 days 10 hours Epochs: 1072K steps = 10 (from 10) GPUs: 4x NVIDIA A100-SXM4-40GB eloss: 2.824 - 1.745 tloss: 4.559 - 1.615
| ROUGE | ROUGE-1 | ROUGE-2 | ROUGE-L | ||||||
|---|---|---|---|---|---|---|---|---|---|
| dataset | Precision | Recall | Fscore | Precision | Recall | Fscore | Precision | Recall | Fscore |
| cnc | 30.13 | 22.56 | 25.21 | 10.53 | 8.01 | 8.9 | 22.47 | 16.92 | 18.86 |
| sumeczech- | 26.6 | 19.66 | 22.01 | 8.17 | 6.12 | 6.82 | 19.93 | 14.81 | 16.54 |
| cnndm | 41.8 | 38.41 | 38.94 | 18.74 | 17.14 | 17.4 | 29.69 | 27.33 | 27.68 |
| xsum | 38.27 | 33.62 | 35.16 | 14.39 | 12.69 | 13.25 | 30.77 | 27.05 | 28.29 |
| mlsum-tu | 52.44 | 44.36 | 46.39 | 36.98 | 31.51 | 32.86 | 46.04 | 39.04 | 40.8 |
| mlsum-de | 42.19 | 40.5 | 40.7 | 28.8 | 28.51 | 28.37 | 38.95 | 37.7 | 37.79 |
| mlsum-fr | 34.57 | 27.74 | 29.95 | 16.27 | 13.04 | 14.08 | 27.18 | 21.89 | 23.6 |
| mlsum-es | 30.93 | 26.41 | 27.66 | 11.42 | 9.85 | 10.28 | 25.12 | 21.59 | 22.55 |
| mlsum-ru | 0.65 | 0.52 | 0.56 | 0.15 | 0.15 | 0.15 | 0.65 | 0.52 | 0.56 |
| cnewsum | 25.14 | 26.56 | 24.45 | 6.89 | 7.54 | 6.78 | 24.77 | 26.15 | 24.08 |
soon