数据集:

laion/laion400m

许可:

cc-by-4.0
中文

LAION-400m_new

This datasets has two improvements compared to original LAION_400m dataset:

  • It uses a multilingual text filter to filter out malicious content
  • The better open_clip VitH model was used to detect potential harmful content in the images
  • All in all, we filtered out around 6 million additional image-text pairs - probably with a high false positive rate - in order to improve dataset safety.