模型:
microsoft/deberta-v2-xxlarge
DeBERTa 使用解缠注意力和增强的掩码解码器对BERT和RoBERTa模型进行改进。在使用80GB训练数据的大多数NLU任务上,其性能优于BERT和RoBERTa。
请查看 official repository 获取更多详细信息和更新。
这是DeBERTa V2 xxlarge模型,具有48层,1536隐藏大小。总参数为15亿,并使用160GB原始数据进行训练。
我们呈现了SQuAD 1.1/2.0和几个GLUE基准任务的开发结果。
| Model | SQuAD 1.1 | SQuAD 2.0 | MNLI-m/mm | SST-2 | QNLI | CoLA | RTE | MRPC | QQP | STS-B |
|---|---|---|---|---|---|---|---|---|---|---|
| F1/EM | F1/EM | Acc | Acc | Acc | MCC | Acc | Acc/F1 | Acc/F1 | P/S | |
| BERT-Large | 90.9/84.1 | 81.8/79.0 | 86.6/- | 93.2 | 92.3 | 60.6 | 70.4 | 88.0/- | 91.3/- | 90.0/- |
| RoBERTa-Large | 94.6/88.9 | 89.4/86.5 | 90.2/- | 96.4 | 93.9 | 68.0 | 86.6 | 90.9/- | 92.2/- | 92.4/- |
| XLNet-Large | 95.1/89.7 | 90.6/87.9 | 90.8/- | 97.0 | 94.9 | 69.0 | 85.9 | 90.8/- | 92.3/- | 92.5/- |
| 1235321 1 | 95.5/90.1 | 90.7/88.0 | 91.3/91.1 | 96.5 | 95.3 | 69.5 | 91.0 | 92.6/94.6 | 92.3/- | 92.8/92.5 |
| 1236321 1 | -/- | -/- | 91.5/91.2 | 97.0 | - | - | 93.1 | 92.1/94.3 | - | 92.9/92.7 |
| 1237321 1 | 95.8/90.8 | 91.4/88.9 | 91.7/91.6 | 97.5 | 95.8 | 71.1 | 93.9 | 92.0/94.2 | 92.3/89.8 | 92.9/92.9 |
| 1238321 1,2 | 96.1/91.4 | 92.2/89.7 | 91.7/91.9 | 97.2 | 96.0 | 72.0 | 93.5 | 93.1/94.9 | 92.7/90.3 | 93.2/93.1 |
使用 Deepspeed 运行,
pip install datasets
pip install deepspeed
# Download the deepspeed config file
wget https://huggingface.co/microsoft/deberta-v2-xxlarge/resolve/main/ds_config.json -O ds_config.json
export TASK_NAME=mnli
output_dir="ds_results"
num_gpus=8
batch_size=8
python -m torch.distributed.launch --nproc_per_node=${num_gpus} \\
run_glue.py \\
--model_name_or_path microsoft/deberta-v2-xxlarge \\
--task_name $TASK_NAME \\
--do_train \\
--do_eval \\
--max_seq_length 256 \\
--per_device_train_batch_size ${batch_size} \\
--learning_rate 3e-6 \\
--num_train_epochs 3 \\
--output_dir $output_dir \\
--overwrite_output_dir \\
--logging_steps 10 \\
--logging_dir $output_dir \\
--deepspeed ds_config.json
您还可以使用 --sharded_ddp 运行
cd transformers/examples/text-classification/ export TASK_NAME=mnli python -m torch.distributed.launch --nproc_per_node=8 run_glue.py --model_name_or_path microsoft/deberta-v2-xxlarge \\ --task_name $TASK_NAME --do_train --do_eval --max_seq_length 256 --per_device_train_batch_size 8 \\ --learning_rate 3e-6 --num_train_epochs 3 --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16
如果您认为DeBERTa对您的工作有用,请引用以下论文:
@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}