数据集:
laion/laion-synthetic-115m
这个数据集是
laion-400m
的一个版本,通过BLIP模型生成的标题替换了噪音/不准确的字幕。由salesforce在
BLIP
中提供。修改后与img2dataset工具兼容。
注意:根据您的特定需求,您可能需要更改一些关键字参数。
# Download parquet file containing mapping of image-URL's -> captions wget -c https://huggingface.co/datasets/laion/laion-synthetic-115m/resolve/main/laion_synthetic_filtered_large.parquet pip install img2dataset # Download as many URL's as possible into webdataset (tars of txt/jpg files). Can also specify `files` instead. img2dataset laion_synthetic_filtered_large.parquet --image_size 320 --resize_mode 'keep_ratio' --caption_col 'caption'--input_format parquet --output_format webdataset > Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!