FinBench 数据集卡片

数据集统计

[介绍]

任务统计

下表报告了任务描述、数据集名称（用于加载数据集）、训练集/验证集/测试集的数量和正样本比例、分类类别数量（均为2）以及特征数量。

Task	Description	Dataset	#Classes	#Features	#Train [Pos%]	#Val [Pos%]	#Test [Pos%]
Credit-card Default	Predict whether a user will default on the credit card or not.	cd1	2	9	2738 [7.0%]	305 [6.9%]	1305 [6.2%]
cd2	2	23	18900 [22.3%]	2100 [22.3%]	9000 [21.8%]
Loan Default	Predict whether a user will default on the loan or not.	ld1	2	12	2118 [8.9%]	236 [8.5%]	1010 [9.0%]
ld2	2	11	18041 [21.7%]	2005 [20.8%]	8592 [21.8%]
ld3	2	35	142060 [21.6%]	15785 [21.3%]	67648 [22.1%]
Credit-card Fraud	Predict whether a user will commit fraud or not.	cf1	2	19	5352 [0.67%]	595 [1.1%]	2550 [0.90%]
cf2	2	120	5418 [6.0%]	603 [7.3%]	2581 [6.0%]
Customer Churn	Predict whether a user will churn or not. (customer attrition)	cc1	2	9	4189 [23.5%]	466 [22.7%]	1995 [22.4%]
cc2	2	10	6300 [20.8%]	700 [20.6%]	3000 [19.47%]
cc3	2	21	4437 [26.1%]	493 [24.9%]	2113 [27.8%]

Task	#Train	#Val	#Test
Credit-card Default	21638	2405	10305
Loan Default	162219	18026	77250
Credit-card Fraud	10770	1198	5131
Customer Churn	14926	1659	7108
Total	209553	23288	99794

数据来源

Task	Dataset	Source
Credit-card Default	cd1	1236321
cd2	1237321
Loan Default	ld1	1238321
ld2	1239321
ld3	12310321
Credit-card Fraud	cf1	12311321
cf2	12312321
Customer Churn	cc1	12313321
cc2	12314321
cc3	12315321

语言：英语

数据集结构

数据字段

import datasets

datasets.Features(
    {
        "X_ml": [datasets.Value(dtype="float")],  # (The tabular data array of the current instance)
        "X_ml_unscale": [datasets.Value(dtype="float")],  # (Scaled tabular data array of the current instance)
        "y": datasets.Value(dtype="int64"),  # (The label / ground-truth)
        "num_classes": datasets.Value("int64"),  # (The total number of classes)
        "num_features": datasets.Value("int64"),  # (The total number of features)
        "num_idx": [datasets.Value("int64")],  # (The indices of the numerical datatype columns)
        "cat_idx": [datasets.Value("int64")],  # (The indices of the categorical datatype columns)
        "cat_dim": [datasets.Value("int64")],  # (The dimension of each categorical column)
        "cat_str": [[datasets.Value("string")]],  # (The category names of categorical columns)
        "col_name": [datasets.Value("string")],  # (The name of each column)
        "X_instruction_for_profile": datasets.Value("string"),  # instructions (from tabular data) for profiles
        "X_profile": datasets.Value("string"),  # customer profiles built from instructions via LLMs
    }
)

数据加载

HuggingFace 登录（可选）

# OR run huggingface-cli login
from huggingface_hub import login

hf_token = "hf_xxx"  # TODO: set a valid HuggingFace access token for loading datasets/models
login(token=hf_token)

加载数据集

from datasets import load_dataset

ds_name = "cd1"  # change the dataset name here
dataset = load_dataset("yuweiyin/FinBench", ds_name)

加载切分

from datasets import load_dataset

ds_name = "cd1"  # change the dataset name here
dataset = load_dataset("yuweiyin/FinBench", ds_name)

train_set = dataset["train"] if "train" in dataset else []
validation_set = dataset["validation"] if "validation" in dataset else []
test_set = dataset["test"] if "test" in dataset else []

加载实例

from datasets import load_dataset

ds_name = "cd1"  # change the dataset name here
dataset = load_dataset("yuweiyin/FinBench", ds_name)
train_set = dataset["train"] if "train" in dataset else []

for train_instance in train_set:
    X_ml = train_instance["X_ml"]  # List[float] (The tabular data array of the current instance)
    X_ml_unscale = train_instance["X_ml_unscale"]  # List[float] (Scaled tabular data array of the current instance)
    y = train_instance["y"]  # int (The label / ground-truth)
    num_classes = train_instance["num_classes"]  # int (The total number of classes)
    num_features = train_instance["num_features"]  # int (The total number of features)
    num_idx = train_instance["num_idx"]  # List[int] (The indices of the numerical datatype columns)
    cat_idx = train_instance["cat_idx"]  # List[int] (The indices of the categorical datatype columns)
    cat_dim = train_instance["cat_dim"]  # List[int] (The dimension of each categorical column)
    cat_str = train_instance["cat_str"]  # List[List[str]] (The category names of categorical columns)
    col_name = train_instance["col_name"]  # List[str] (The name of each column)
    X_instruction_for_profile = train_instance["X_instruction_for_profile"]  # instructions for building profiles
    X_profile = train_instance["X_profile"]  # customer profiles built from instructions via LLMs

贡献

[贡献]

引用

yin2023finbench

参考文献

[参考文献]

作者:

yuweiyin

数据集大小:

695.22 MB