Technology

A major AI training data set contains millions of examples of personal data

Source

MIT Technology Review

Published

Jul 18, 2025

TL;DR

AI Generated

A major AI training data set, DataComp CommonPool, contains millions of personal data examples, including images of passports, credit cards, and birth certificates, according to new research. The study revealed thousands of images with identifiable faces and identity documents within CommonPool, estimating hundreds of millions of such images in the dataset. The data set, released in 2023, consists of 12.8 billion image-text pairs and is used for training generative text-to-image models. Concerns were raised about the presence of personally identifiable information in the data set, highlighting privacy risks and the challenges of filtering such data effectively. Researchers emphasize the need for the machine-learning community to address privacy issues and reconsider the practice of indiscriminate web scraping.

Read Full Article

YouTube is making its AI Deepfake detection tool available to all creators over the age of 18

YouTube is extending its AI deepfake detection tool to all creators above 18 years old to combat the rising threat of AI-generated content misusing individuals' likenesses. The tool allows creators to identify if their face has been used in unauthorized AI videos and request removal if necessary. This initiative aims to provide creators with peace of mind and early access to potentially harmful content. The tool, initially introduced in 2024 for Partner Program members, has now been expanded to include all creators over 18 who can enroll through YouTube Studio. Deepfakes have become a significant concern with the advancement of generative AI, and this tool is a step towards enhancing security and privacy on the platform.

TweakTown•

4 weeks ago

MIT Technology Review

AI chatbots are giving out people’s real phone numbers

AI chatbots, like Google's Gemini and OpenAI's ChatGPT, are inadvertently exposing people's real phone numbers due to the use of personally identifiable information (PII) in training data. This has led to instances where individuals receive calls or messages intended for others, causing privacy concerns. Companies like DeleteMe have seen a 400% increase in privacy requests related to generative AI tools. While efforts are made to filter out PII, the issue persists, highlighting the challenges in protecting personal data from being surfaced by AI chatbots.

MIT Technology Review•

4 weeks ago

MIT Technology Review

World Models: 10 Things That Matter in AI Right Now

The article discusses the emergence of world models as a significant area in AI, highlighting its importance in the current landscape. Executive editor Niall Firth explains the growing attention this field is receiving. MIT Technology Review is hosting a subscriber-only Roundtables discussion on how AI can better understand the real world and its implications for AI systems. The article also mentions related stories on AI advancements and the future vision for AI by experts like Yann LeCun.

MIT Technology Review•

1 month ago

SemiEngineering

What’s Really Needed For Advanced Test?

The article discusses the importance of data quality in advanced testing within the semiconductor industry. It highlights the challenges related to data plumbing and the need for clean, complete, and correctly associated data. The article emphasizes the significance of good data infrastructure, particularly in metadata consistency and direct data collection at the point of measurement. It also touches on the application of machine learning in test operations, pointing out the current limitations in real-time model inference. Overall, the article underscores the critical role of data quality in enabling intelligent testing and the need for investments in data infrastructure.

SemiEngineering•

1 month ago

YouTube is making its AI Deepfake detection tool available to all creators over the age of 18

TweakTown•

4 weeks ago

MIT Technology Review

AI chatbots are giving out people’s real phone numbers

MIT Technology Review•

4 weeks ago

MIT Technology Review

World Models: 10 Things That Matter in AI Right Now

MIT Technology Review•

1 month ago

SemiEngineering

What’s Really Needed For Advanced Test?

SemiEngineering•

1 month ago

A major AI training data set contains millions of examples of personal data

TL;DR

Similar Articles

YouTube is making its AI Deepfake detection tool available to all creators over the age of 18

AI chatbots are giving out people’s real phone numbers

World Models: 10 Things That Matter in AI Right Now

What’s Really Needed For Advanced Test?

We use cookies

A major AI training data set contains millions of examples of personal data

TL;DR

Similar Articles

YouTube is making its AI Deepfake detection tool available to all creators over the age of 18

AI chatbots are giving out people’s real phone numbers

World Models: 10 Things That Matter in AI Right Now

What’s Really Needed For Advanced Test?