数据集:
yuweiyin/FinBench
[介绍]
下表报告了任务描述、数据集名称(用于加载数据集)、训练集/验证集/测试集的数量和正样本比例、分类类别数量(均为2)以及特征数量。
| Task | Description | Dataset | #Classes | #Features | #Train [Pos%] | #Val [Pos%] | #Test [Pos%] | 
|---|---|---|---|---|---|---|---|
| Credit-card Default | Predict whether a user will default on the credit card or not. | cd1 | 2 | 9 | 2738 [7.0%] | 305 [6.9%] | 1305 [6.2%] | 
| cd2 | 2 | 23 | 18900 [22.3%] | 2100 [22.3%] | 9000 [21.8%] | ||
| Loan Default | Predict whether a user will default on the loan or not. | ld1 | 2 | 12 | 2118 [8.9%] | 236 [8.5%] | 1010 [9.0%] | 
| ld2 | 2 | 11 | 18041 [21.7%] | 2005 [20.8%] | 8592 [21.8%] | ||
| ld3 | 2 | 35 | 142060 [21.6%] | 15785 [21.3%] | 67648 [22.1%] | ||
| Credit-card Fraud | Predict whether a user will commit fraud or not. | cf1 | 2 | 19 | 5352 [0.67%] | 595 [1.1%] | 2550 [0.90%] | 
| cf2 | 2 | 120 | 5418 [6.0%] | 603 [7.3%] | 2581 [6.0%] | ||
| Customer Churn | Predict whether a user will churn or not. (customer attrition) | cc1 | 2 | 9 | 4189 [23.5%] | 466 [22.7%] | 1995 [22.4%] | 
| cc2 | 2 | 10 | 6300 [20.8%] | 700 [20.6%] | 3000 [19.47%] | ||
| cc3 | 2 | 21 | 4437 [26.1%] | 493 [24.9%] | 2113 [27.8%] | 
| Task | #Train | #Val | #Test | 
|---|---|---|---|
| Credit-card Default | 21638 | 2405 | 10305 | 
| Loan Default | 162219 | 18026 | 77250 | 
| Credit-card Fraud | 10770 | 1198 | 5131 | 
| Customer Churn | 14926 | 1659 | 7108 | 
| Total | 209553 | 23288 | 99794 | 
| Task | Dataset | Source | 
|---|---|---|
| Credit-card Default | cd1 | 1236321 | 
| cd2 | 1237321 | |
| Loan Default | ld1 | 1238321 | 
| ld2 | 1239321 | |
| ld3 | 12310321 | |
| Credit-card Fraud | cf1 | 12311321 | 
| cf2 | 12312321 | |
| Customer Churn | cc1 | 12313321 | 
| cc2 | 12314321 | |
| cc3 | 12315321 | 
import datasets
datasets.Features(
    {
        "X_ml": [datasets.Value(dtype="float")],  # (The tabular data array of the current instance)
        "X_ml_unscale": [datasets.Value(dtype="float")],  # (Scaled tabular data array of the current instance)
        "y": datasets.Value(dtype="int64"),  # (The label / ground-truth)
        "num_classes": datasets.Value("int64"),  # (The total number of classes)
        "num_features": datasets.Value("int64"),  # (The total number of features)
        "num_idx": [datasets.Value("int64")],  # (The indices of the numerical datatype columns)
        "cat_idx": [datasets.Value("int64")],  # (The indices of the categorical datatype columns)
        "cat_dim": [datasets.Value("int64")],  # (The dimension of each categorical column)
        "cat_str": [[datasets.Value("string")]],  # (The category names of categorical columns)
        "col_name": [datasets.Value("string")],  # (The name of each column)
        "X_instruction_for_profile": datasets.Value("string"),  # instructions (from tabular data) for profiles
        "X_profile": datasets.Value("string"),  # customer profiles built from instructions via LLMs
    }
)
 # OR run huggingface-cli login from huggingface_hub import login hf_token = "hf_xxx" # TODO: set a valid HuggingFace access token for loading datasets/models login(token=hf_token)
from datasets import load_dataset
ds_name = "cd1"  # change the dataset name here
dataset = load_dataset("yuweiyin/FinBench", ds_name)
 from datasets import load_dataset
ds_name = "cd1"  # change the dataset name here
dataset = load_dataset("yuweiyin/FinBench", ds_name)
train_set = dataset["train"] if "train" in dataset else []
validation_set = dataset["validation"] if "validation" in dataset else []
test_set = dataset["test"] if "test" in dataset else []
 from datasets import load_dataset
ds_name = "cd1"  # change the dataset name here
dataset = load_dataset("yuweiyin/FinBench", ds_name)
train_set = dataset["train"] if "train" in dataset else []
for train_instance in train_set:
    X_ml = train_instance["X_ml"]  # List[float] (The tabular data array of the current instance)
    X_ml_unscale = train_instance["X_ml_unscale"]  # List[float] (Scaled tabular data array of the current instance)
    y = train_instance["y"]  # int (The label / ground-truth)
    num_classes = train_instance["num_classes"]  # int (The total number of classes)
    num_features = train_instance["num_features"]  # int (The total number of features)
    num_idx = train_instance["num_idx"]  # List[int] (The indices of the numerical datatype columns)
    cat_idx = train_instance["cat_idx"]  # List[int] (The indices of the categorical datatype columns)
    cat_dim = train_instance["cat_dim"]  # List[int] (The dimension of each categorical column)
    cat_str = train_instance["cat_str"]  # List[List[str]] (The category names of categorical columns)
    col_name = train_instance["col_name"]  # List[str] (The name of each column)
    X_instruction_for_profile = train_instance["X_instruction_for_profile"]  # instructions for building profiles
    X_profile = train_instance["X_profile"]  # customer profiles built from instructions via LLMs
 [贡献]
yin2023finbench
[参考文献]