模型:

microsoft/xtremedistil-l12-h384-uncased

英文

XtremeDistilTransformers用于压缩大规模神经网络

XtremeDistilTransformers是一个经过蒸馏的任务无关的transformer模型,利用任务迁移来学习一个小型的通用模型,可以应用于任意任务和语言,如在论文中所述 XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

我们利用任务迁移和来自 XtremeDistil: Multi-stage Distillation for Massive Multilingual Models MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers 论文的多任务蒸馏技术,使用以下 Github code 进行实验。

这个包含6层,384隐藏大小,12注意力头的l6-h384检查点对应于2200万个参数,比BERT-base的速度提高了5.3倍。

其他可用的检查点有: xtremedistil-l6-h256-uncased xtremedistil-l6-h384-uncased

下表显示了在GLUE dev数据集和SQuAD-v2上的结果。

Models #Params Speedup MNLI QNLI QQP RTE SST MRPC SQUAD2 Avg
BERT 109 1x 84.5 91.7 91.3 68.6 93.2 87.3 76.8 84.8
DistilBERT 66 2x 82.2 89.2 88.5 59.9 91.3 87.5 70.7 81.3
TinyBERT 66 2x 83.5 90.5 90.6 72.2 91.6 88.4 73.1 84.3
MiniLM 66 2x 84.0 91.0 91.0 71.5 92.0 88.4 76.4 84.9
MiniLM 22 5.3x 82.8 90.3 90.6 68.9 91.3 86.6 72.9 83.3
XtremeDistil-l6-h256 13 8.7x 83.9 89.5 90.6 80.1 91.2 90.0 74.1 85.6
XtremeDistil-l6-h384 22 5.3x 85.4 90.3 91.0 80.9 92.3 90.0 76.6 86.6
XtremeDistil-l12-h384 33 2.7x 87.2 91.9 91.3 85.6 93.1 90.4 80.2 88.5

在tensorflow 2.3.1,transformers 4.1.1和torch 1.6.0进行了测试

如果您在您的工作中使用了这个检查点,请引用:

@misc{mukherjee2021xtremedistiltransformers,
      title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation}, 
      author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},
      year={2021},
      eprint={2106.04563},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}