Back to home

Articles tagged with "AI Training Data, Data Privacy, Machine Learning, Generative Models, Personal Data Security"

MIT Technology Review

A major AI training data set contains millions of examples of personal data

A major AI training data set, DataComp CommonPool, contains millions of personal data examples, including images of passports, credit cards, and birth certificates, according to new research. The study revealed thousands of images with identifiable faces and identity documents within CommonPool, estimating hundreds of millions of such images in the dataset. The data set, released in 2023, consists of 12.8 billion image-text pairs and is used for training generative text-to-image models. Concerns were raised about the presence of personally identifiable information in the data set, highlighting privacy risks and the challenges of filtering such data effectively. Researchers emphasize the need for the machine-learning community to address privacy issues and reconsider the practice of indiscriminate web scraping.

MIT Technology Review

No more articles to load

We use cookies

We use cookies to ensure you get the best experience on our website. For more information on how we use cookies, please see our cookie policy.