数据集:

laion/laion-synthetic-115m

英文

laion-synthetic-115m

这个数据集是 laion-400m 的一个版本,通过BLIP模型生成的标题替换了噪音/不准确的字幕。由salesforce在 BLIP 中提供。修改后与img2dataset工具兼容。

下载带标题的图片

注意:根据您的特定需求,您可能需要更改一些关键字参数。

# Download parquet file containing mapping of image-URL's -> captions
wget -c https://huggingface.co/datasets/laion/laion-synthetic-115m/resolve/main/laion_synthetic_filtered_large.parquet
pip install img2dataset
# Download as many URL's as possible into webdataset (tars of txt/jpg files). Can also specify `files` instead.
img2dataset laion_synthetic_filtered_large.parquet --image_size 320 --resize_mode 'keep_ratio' --caption_col 'caption'--input_format parquet --output_format webdataset

> Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!