模型:

tiiuae/falcon-7b-instruct

任务:

文本生成

类库:

PyTorch Core ML Transformers

数据集:

tiiuae/falcon-refinedweb 3Atiiuae/falcon-refinedweb

语言:

其他:

RefinedWebModel custom_code text-generation-inference

预印本库:

arxiv:2205.14135 arxiv:1911.02150 arxiv:2005.14165 arxiv:2104.09864 arxiv:2306.01116

许可:

apache-2.0

模型介绍文件清单

英文

✨ Falcon-7B-Instruct

Falcon-7B-Instruct是由 TII 基于 Falcon-7B 构建的7B参数因果解码器模型，并在聊天/指示数据集的混合训练上进行了微调。它在Apache 2.0许可下提供。

即将发布的论文 😊。

🤗 要开始使用Falcon（推理，微调，量化等），我们建议阅读 this great blogpost fron HF ！

为什么使用Falcon-7B-Instruct？

您正在寻找基于 Falcon-7B 的现成的聊天/指示模型。
Falcon-7B是一个强大的基础模型，优于可比的开源模型（例如 MPT-7B 、 StableLM 、 RedPajama 等），因为它是在1500B的 RefinedWeb 增强的语料库上训练的。请参阅 OpenLLM Leaderboard 。
它具有为推理优化的架构，使用FlashAttention（ Dao et al., 2022 ）和多查询（ Shazeer et al., 2019 ）。

💬 这是一个指示模型，对于进一步微调可能不理想。如果您有兴趣构建自己的指示/聊天模型，我们建议从 Falcon-7B 开始。

🔥 寻找一个更强大的模型？ Falcon-40B-Instruct 是Falcon-7B-Instruct的大哥！

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

💥 Falcon LLMs需要PyTorch 2.0才能与transformers一起使用！

要快速推理Falcon，请查看 Text Generation Inference ！在此 blogpost 中阅读更多信息。

您需要至少16GB的内存才能快速运行Falcon-7B-Instruct的推理。

Falcon-7B-Instruct的模型卡片

模型详情

模型描述

开发者： https://www.tii.ae ；
模型类型：因果解码器；
语言（NLP）：英语和法语；
许可证：Apache 2.0；
微调自模型： Falcon-7B 。

模型来源

论文：即将发布。

用途

直接使用

Falcon-7B-Instruct已在指示和聊天数据集的混合训练上进行了微调。

超出范围的使用

在未经充分评估风险和缓解措施的情况下进行生产使用；任何可能被视为不负责任或有害的用例。

偏差、风险和限制

Falcon-7B-Instruct主要使用英文数据进行训练，不适用于其他语言的泛化。此外，由于它是在代表网络的大规模语料库上进行训练的，因此它将带有常见的在线陈述和偏见。

建议

我们建议Falcon-7B-Instruct的用户制定防范措施，并对任何生产应用采取适当的预防措施。

如何开始使用该模型

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

训练详情

训练数据

Falcon-7B-Instruct在250M个令牌的指示/聊天数据集上进行了微调。

Data source	Fraction	Tokens	Description
12321321	65%	164M	chat
12322321	25%	62M	instruct
12323321	5%	11M	instruct
12324321	5%	13M	massive web crawl

数据使用Falcon- 7B / 40B 标记化器进行标记化。

评估

即将发布的论文。

请参阅 OpenLLM Leaderboard 获取初步结果。

请注意，此模型变体并未针对NLP基准进行优化。

技术规格

有关预训练的更多信息，请参阅 Falcon-7B 。

模型架构和目标

Falcon-7B是一个只有因果解码器模型，它在因果语言建模任务（即预测下一个令牌）上进行了训练。

该架构基本上是从GPT-3论文（ Brown et al., 2020 ）中改编而来，具有以下差异：

位置嵌入：旋转（ Su et al., 2021 ）；
注意力：多查询（ Shazeer et al., 2019 ）和FlashAttention（ Dao et al., 2022 ）；
解码器块：并行注意力/MLP，带有单层规范化。

Hyperparameter	Value	Comment
Layers	32
d_model	4544	Increased to compensate for multiquery
head_dim	64	Reduced to optimise for FlashAttention
Vocabulary	65024
Sequence length	2048

计算基础设施

硬件

Falcon-7B-Instruct是在AWS SageMaker上训练的，在P4d实例的32个A100 40GB GPU上进行训练。

软件

Falcon-7B-Instruct使用自定义的分布式训练代码库Gigatron进行训练。它使用三维并行主义方法，结合ZeRO和高性能的Triton内核（FlashAttention等）。

引用

即将发布的论文 😊。在此期间，您可以使用以下信息进行引用：

@article{falcon40b,
  title={{Falcon-40B}: an open large language model with state-of-the-art performance},
  author={Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme},
  year={2023}
}

要了解有关预训练数据集的更多信息，请参阅📓 RefinedWeb paper 。

@article{refinedweb,
  title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
  author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
  journal={arXiv preprint arXiv:2306.01116},
  eprint={2306.01116},
  eprinttype = {arXiv},
  url={https://arxiv.org/abs/2306.01116},
  year={2023}
}

许可证

Falcon-7B-Instruct基于Apache 2.0许可提供。

联系方式

falconllm@tii.ae

作者:

Technology Innovation Institute

数据集大小:

39.24 GB