A major AI training data set contains millions of examples of personal data
Source
Published
TL;DR
AI GeneratedA major AI training data set, DataComp CommonPool, contains millions of personal data examples, including images of passports, credit cards, and birth certificates, according to new research. The study revealed thousands of images with identifiable faces and identity documents within CommonPool, estimating hundreds of millions of such images in the dataset. The data set, released in 2023, consists of 12.8 billion image-text pairs and is used for training generative text-to-image models. Concerns were raised about the presence of personally identifiable information in the data set, highlighting privacy risks and the challenges of filtering such data effectively. Researchers emphasize the need for the machine-learning community to address privacy issues and reconsider the practice of indiscriminate web scraping.