AI Data Sourced from Web Scraping and Privacy Concerns: A Close Examination of CommonPool

AI Integration Today Extends Beyond Boundaries: From medicial chatbots guiding patients to generative tools aiding artists, writers, and developers, AI is an integral part of everyday life. Despite their advanced appearance, these systems are tethered to a fundamental element: data.Primarily,...

, and Administrator

2025 September 20 . 7:08 AM

2 min read

AI Datasets Web Scraped for Privacy Concerns: A Closer Look at CommonPool

AI Data Sourced from Web Scraping and Privacy Concerns: A Close Examination of CommonPool

In the rapidly evolving world of artificial intelligence (AI), a new dataset named CommonPool is making waves for its emphasis on transparency and community participation. Developed by a group dedicated to creating open, large-scale multimodal datasets with structured processes and transparency, CommonPool aims to address the issues that have plagued earlier AI datasets.

The dataset, which contains approximately 12.8 billion image-text pairs, is designed for multimodal AI research. It introduces a governance model, with versioned releases, structured metadata, and documented update cycles, ensuring a living dataset that evolves over time.

The construction of CommonPool follows a structured three-stage pipeline. In the first stage, data is collected primarily from the public internet by automated programs, similar to efforts seen in Common Crawl and LAION-5B. However, CommonPool seeks to set itself apart by excluding Personally Identifiable Information (PII) such as names, email addresses, and facial photographs, as well as NSFW content.

In the second stage, the dataset undergoes large-scale deduplication using Perceptual hashing and MinHash techniques to eliminate redundancies. This stage also ensures that the dataset remains traceable, retaining metadata such as source URLs and timestamps.

The third stage focuses on safety and compliance. Automated face detection and blurring, removal of personal identifiers, and detection of copyrighted materials are implemented to further protect privacy and intellectual property. CommonPool also retains metadata such as URLs and timestamps, supporting traceability and partial licensing checks.

To address concerns about sensitive content, CommonPool has a takedown protocol in place, allowing individuals and institutions to request the removal of such content. This protocol is a crucial step towards building trust and addressing ethical concerns that have arisen from the lack of consent in AI training.

The importance of CommonPool extends beyond the realm of AI research. The global value of AI datasets is estimated at $3.2 billion, with a potential growth to $16.3 billion by 2034. As AI models continue to rely on scraped data and datasets grow rapidly in size, the need for transparent, community-driven datasets like CommonPool becomes increasingly significant.

In the past, companies like OpenAI and Stability AI have faced lawsuits for using personal and copyrighted data without consent, highlighting the need for more stringent data collection and handling practices. By prioritising transparency and privacy, CommonPool is a step towards rebuilding public trust and ensuring ethical AI practices.

Latest

In this picture, we see the coin in gold and brown color. We see some text written as "The United...

Invest Smart, Save More

Silver and Gold Surge to Decade, Record Highs Amid Market Uncertainty

Silver prices climb to 2011 highs, gold surges past $4,000. Digital gold tokens like PAX Gold and Tether Gold gain popularity, driving demand for safe havens.

, and Administrator

2025 October 9

In this image there are two buildings, in which there is a fire in a building,and in the background...

Smart-home-devices

Firefighters Quickly Extinguish Blaze, Save Lives in Kamchatka

Firefighters' quick response saved lives. A faulty chandelier sparked the blaze, causing significant damage to an apartment.

, and Administrator

2025 October 9

Explore Latest Tech Trends!

Apple AirPods 4 Now Available at 20% Off During Amazon Prime Day 2025

Get the new AirPods 4 at an unbeatable price. Enjoy improved fit, noise cancellation, and advanced features during Amazon's Prime Day 2025.

, and Administrator

2025 October 9

there was a room in which people are sitting in the chairs,in front of a table looking into the...

Protect Your Gadgets from Cyber Threats

Telstra Confirms Data Breach Affecting 30,000 Employees

Telstra's data breach follows the recent Optus incident. 30,000 employees' data exposed, but no sensitive personal details. Stay vigilant against potential phishing attempts.

, and Administrator

2025 October 9

AI Data Sourced from Web Scraping and Privacy Concerns: A Close Examination of CommonPool

AI Data Sourced from Web Scraping and Privacy Concerns: A Close Examination of CommonPool

Read also:

Related

Latest