英文

多语言CPV部门分类器

此模型是对 bert-base-multilingual-cased the Tenders Economic Daily Public Procurement Data 上进行微调的版本。模型在评估集上达到以下结果:

  • F1得分:0.686

模型描述

该模型接受以 104 languages 中的任何语言编写的采购描述,并将其分类为由 CPV(Common Procurement Vocabulary) 代码描述表示的45个部门类别,如下所示。

Common Procurement Vocabulary
Administration, defence and social security services. 👮‍♀️
Agricultural machinery. 🚜
Agricultural, farming, fishing, forestry and related products. 🌾
Agricultural, forestry, horticultural, aquacultural and apicultural services. 👨🏿‍🌾
Architectural, construction, engineering and inspection services. 👷‍♂️
Business services: law, marketing, consulting, recruitment, printing and security. 👩‍💼
Chemical products. 🧪
Clothing, footwear, luggage articles and accessories. 👖
Collected and purified water. 🌊
Construction structures and materials; auxiliary products to construction (excepts electric apparatus). 🧱
Construction work. 🏗️
Education and training services. 👩🏿‍🏫
Electrical machinery, apparatus, equipment and consumables; Lighting. ⚡
Financial and insurance services. 👨‍💼
Food, beverages, tobacco and related products. 🍽️
Furniture (incl. office furniture), furnishings, domestic appliances (excl. lighting) and cleaning products. 🗄️
Health and social work services. 👨🏽‍⚕️
Hotel, restaurant and retail trade services. 🏨
IT services: consulting, software development, Internet and support. 🖥️
Industrial machinery. 🏭
Installation services (except software). 🛠️
Laboratory, optical and precision equipments (excl. glasses). 🔬
Leather and textile fabrics, plastic and rubber materials. 🧵
Machinery for mining, quarrying, construction equipment. ⛏️
Medical equipments, pharmaceuticals and personal care products. 💉
Mining, basic metals and related products. ⚙️
Musical instruments, sport goods, games, toys, handicraft, art materials and accessories. 🎸
Office and computing machinery, equipment and supplies except furniture and software packages. 🖨️
Other community, social and personal services. 🧑🏽‍🤝‍🧑🏽
Petroleum products, fuel, electricity and other sources of energy. 🔋
Postal and telecommunications services. 📶
Printed matter and related products. 📰
Public utilities. ⛲
Radio, television, communication, telecommunication and related equipment. 📡
Real estate services. 🏠
Recreational, cultural and sporting services. 🚴
Repair and maintenance services. 🔧
Research and development services and related consultancy services. 👩‍🔬
Security, fire-fighting, police and defence equipment. 🧯
Services related to the oil and gas industry. ⛽
Sewage-, refuse-, cleaning-, and environmental services. 🧹
Software package and information systems. 🔣
Supporting and auxiliary transport services; travel agencies services. 🚃
Transport equipment and auxiliary products to transportation. 🚌
Transport services (excl. Waste transport). 💺

预期用途和限制

  • 输入描述应以MBERT支持的 the 104 languages 中的任何语言编写。
  • 该模型仅在22种语言上进行评估。因此,对于其他语言的性能没有信息。
  • 域还受限于欧盟颁发的中标采购通知描述。对整个文本进行评估可能会改变性能。

训练和评估数据

  • 全部数据包含744,360行。通过使用80%/20%方式进行洗牌和拆分为训练和验证集。
  • 每个描述代表2011年至2018年间的一份唯一合同通知描述。
  • 训练和验证数据中的合同通知描述均使用22种欧洲语言编写。(由于与整个数据相比稀缺,提取了马耳他语和爱尔兰语)

训练过程

训练过程已在Google Cloud V3-8 TPUs上完成。感谢 Google 给予云TPUs的使用权限

训练超参数

在训练过程中使用了以下超参数:

  • learning_rate:2e-05
  • num_epochs:3
  • gradient_accumulation_steps:8
  • batch_size_per_device:4
  • total_train_batch_size:32

训练结果

Epoch Step F1 Score
1 18,609 0.630
2 37,218 0.674
3 55,827 0.686
Language F1 Score Test Size
PL 0.759 13950
RO 0.736 3522
SK 0.719 1122
LT 0.687 2424
HU 0.681 1879
BG 0.675 2459
CS 0.668 2694
LV 0.664 836
DE 0.645 35354
FI 0.644 1898
ES 0.643 7483
PT 0.631 874
EN 0.631 16615
HR 0.626 865
IT 0.626 8035
NL 0.624 5640
EL 0.623 1724
SL 0.615 482
SV 0.607 3326
DA 0.603 1925
FR 0.601 33113
ET 0.572 458