qilowoq/AbLang_light | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

qilowoq/AbLang_light

任务:

填充掩码

类库:

PyTorch Transformers

其他:

roberta custom_code chemistry biology protein antibodies antibody light chain AbLang CDR OAS AutoTrain Compatible light+chain

许可:

bsd

模型介绍文件清单

英文

AbLang 轻链的模型

这是 AbLang 的一个 🤗 版本：一个用于抗体的语言模型。它于 this paper 年被引入，首次发布于 this repository 年。该模型是基于大写氨基酸进行训练的：它只适用于大写字母的氨基酸。

意图和限制

该模型可用于蛋白质特征提取或用于在下游任务上进行微调（TBA）。

如何使用

以下是如何在 PyTorch 中使用该模型获取给定抗体序列的特征的方法：

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('qilowoq/AbLang_light')
model = AutoModel.from_pretrained('qilowoq/AbLang_light', trust_remote_code=True)

sequence_Example = ' '.join("GSELTQDPAVSVALGQTVRITCQGDSLRNYYASWYQQKPRQAPVLVFYGKNNRPSGIPDRFSGSSSGNTASLTISGAQAEDEADYYCNSRDSSSNHLVFGGGTKLTVLSQ")
encoded_input = tokenizer(sequence_Example, return_tensors='pt')
model_output = model(**encoded_input)

序列嵌入可以通过以下方式生成：

def get_sequence_embeddings(encoded_input, model_output):
    mask = encoded_input['attention_mask'].float()
    d = {k: v for k, v in torch.nonzero(mask).cpu().numpy()} # dict of sep tokens
    # make sep token invisible
    for i in d:
        mask[i, d[i]] = 0
    mask[:, 0] = 0.0 # make cls token invisible
    mask = mask.unsqueeze(-1).expand(model_output.last_hidden_state.size())
    sum_embeddings = torch.sum(model_output.last_hidden_state * mask, 1)
    sum_mask = torch.clamp(mask.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

seq_embeds = get_sequence_embeddings(encoded_input, model_output)

微调

为了节省内存，我们建议使用 LoRA ：

pip install git+https://github.com/huggingface/peft.git
pip install loralib

LoRA 显著减少了可训练参数的数量，并且在性能上可以媲美或优于对整个模型进行微调。

from peft import LoraConfig, get_peft_model

def apply_lora_bert(model):
    config = LoraConfig(
        r=8, lora_alpha=32, 
        lora_dropout=0.3,
        target_modules=['query', 'value']
    )
    for param in model.parameters():
        param.requires_grad = False  # freeze the model - train adapters later
        if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
            param.data = param.data.to(torch.float32)
    model.gradient_checkpointing_enable()  # reduce number of stored activations
    model.enable_input_require_grads()
    model = get_peft_model(model, config)
    return model

model = apply_lora_bert(model)

model.print_trainable_parameters()
# trainable params: 294912 || all params: 85493760 || trainable%: 0.3449514911965505

引用

@article{Olsen2022,
  title={AbLang: An antibody language model for completing antibody sequences},
  author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane},
  journal={bioRxiv},
  doi={https://doi.org/10.1101/2022.01.20.477061},
  year={2022}
}

作者:

Oleg Dmitriev

数据集大小:

327.41 MB