模型:
microsoft/Multilingual-MiniLM-L12-H384
MiniLM是从论文" MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers "中提取出的模型。
有关MiniLM的预处理、训练和完整细节的信息,请在 original MiniLM repository 中查找。
请注意:此检查点使用BertModel和XLMRobertaTokenizer,因此AutoTokenizer无法与此检查点一起使用!
多语言MiniLM使用与XLM-R相同的分词器,但我们的模型的Transformer架构与BERT相同。我们在 huggingface/transformers 的基础上提供了基于XNLI的微调代码。请将transformers中的run_xnli.py替换为 ours 以对多语言MiniLM进行微调。
我们在跨语言自然语言推断基准(XNLI)和跨语言问答基准(MLQA)上评估多语言MiniLM。
Cross-Lingual Natural Language Inference- XNLI我们在从英文到其他语言的跨语言转移上评估我们的模型。按照 Conneau et al. (2019) 的规定,我们选取所有语言的联合开发集上的最佳单一模型。
| Model | #Layers | #Hidden | #Transformer Parameters | Average | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1238321 | 12 | 768 | 85M | 66.3 | 82.1 | 73.8 | 74.3 | 71.1 | 66.4 | 68.9 | 69.0 | 61.6 | 64.9 | 69.5 | 55.8 | 69.3 | 60.0 | 50.4 | 58.0 |
| 1239321 | 16 | 1280 | 315M | 70.7 | 83.2 | 76.7 | 77.7 | 74.0 | 72.7 | 74.1 | 72.7 | 68.7 | 68.6 | 72.9 | 68.9 | 72.5 | 65.6 | 58.2 | 62.4 |
| 12310321 | 12 | 768 | 85M | 74.5 | 84.6 | 78.4 | 78.9 | 76.8 | 75.9 | 77.3 | 75.4 | 73.2 | 71.5 | 75.4 | 72.5 | 74.9 | 71.1 | 65.2 | 66.5 |
| mMiniLM-L12xH384 | 12 | 384 | 21M | 71.1 | 81.5 | 74.8 | 75.7 | 72.9 | 73.0 | 74.5 | 71.3 | 69.7 | 68.8 | 72.1 | 67.8 | 70.0 | 66.2 | 63.3 | 64.2 |
此示例代码在XNLI上微调12层的多语言MiniLM。
# run fine-tuning on XNLI
DATA_DIR=/{path_of_data}/
OUTPUT_DIR=/{path_of_fine-tuned_model}/
MODEL_PATH=/{path_of_pre-trained_model}/
python ./examples/run_xnli.py --model_type minilm \
--output_dir ${OUTPUT_DIR} --data_dir ${DATA_DIR} \
--model_name_or_path microsoft/Multilingual-MiniLM-L12-H384 \
--tokenizer_name xlm-roberta-base \
--config_name ${MODEL_PATH}/multilingual-minilm-l12-h384-config.json \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_gpu_train_batch_size 128 \
--learning_rate 5e-5 \
--num_train_epochs 5 \
--per_gpu_eval_batch_size 32 \
--weight_decay 0.001 \
--warmup_steps 500 \
--save_steps 1500 \
--logging_steps 1500 \
--eval_all_checkpoints \
--language en \
--fp16 \
--fp16_opt_level O2
Cross-Lingual Question Answering-
MLQA
按照 Lewis et al. (2019b) 的规定,我们采用SQuAD 1.1作为训练数据,并使用MLQA英文开发数据进行早停。
| Model F1 Score | #Layers | #Hidden | #Transformer Parameters | Average | en | es | de | ar | hi | vi | zh |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1238321 | 12 | 768 | 85M | 57.7 | 77.7 | 64.3 | 57.9 | 45.7 | 43.8 | 57.1 | 57.5 |
| 12314321 | 12 | 1024 | 151M | 61.6 | 74.9 | 68.0 | 62.2 | 54.8 | 48.8 | 61.4 | 61.1 |
| 12310321 (Reported) | 12 | 768 | 85M | 62.9 | 77.8 | 67.2 | 60.8 | 53.0 | 57.9 | 63.1 | 60.2 |
| 12310321 (Our fine-tuned) | 12 | 768 | 85M | 64.9 | 80.3 | 67.0 | 62.7 | 55.0 | 60.4 | 66.5 | 62.3 |
| mMiniLM-L12xH384 | 12 | 384 | 21M | 63.2 | 79.4 | 66.1 | 61.2 | 54.9 | 58.5 | 63.1 | 59.0 |
如果您在研究中发现MiniLM有用,请引用以下论文:
@misc{wang2020minilm,
title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
year={2020},
eprint={2002.10957},
archivePrefix={arXiv},
primaryClass={cs.CL}
}