AI Data Sourced from Web Scraping and Privacy Concerns: A Close Examination of CommonPool
In the rapidly evolving world of artificial intelligence (AI), a new dataset named CommonPool is making waves for its emphasis on transparency and community participation. Developed by a group dedicated to creating open, large-scale multimodal datasets with structured processes and transparency, CommonPool aims to address the issues that have plagued earlier AI datasets.
The dataset, which contains approximately 12.8 billion image-text pairs, is designed for multimodal AI research. It introduces a governance model, with versioned releases, structured metadata, and documented update cycles, ensuring a living dataset that evolves over time.
The construction of CommonPool follows a structured three-stage pipeline. In the first stage, data is collected primarily from the public internet by automated programs, similar to efforts seen in Common Crawl and LAION-5B. However, CommonPool seeks to set itself apart by excluding Personally Identifiable Information (PII) such as names, email addresses, and facial photographs, as well as NSFW content.
In the second stage, the dataset undergoes large-scale deduplication using Perceptual hashing and MinHash techniques to eliminate redundancies. This stage also ensures that the dataset remains traceable, retaining metadata such as source URLs and timestamps.
The third stage focuses on safety and compliance. Automated face detection and blurring, removal of personal identifiers, and detection of copyrighted materials are implemented to further protect privacy and intellectual property. CommonPool also retains metadata such as URLs and timestamps, supporting traceability and partial licensing checks.
To address concerns about sensitive content, CommonPool has a takedown protocol in place, allowing individuals and institutions to request the removal of such content. This protocol is a crucial step towards building trust and addressing ethical concerns that have arisen from the lack of consent in AI training.
The importance of CommonPool extends beyond the realm of AI research. The global value of AI datasets is estimated at $3.2 billion, with a potential growth to $16.3 billion by 2034. As AI models continue to rely on scraped data and datasets grow rapidly in size, the need for transparent, community-driven datasets like CommonPool becomes increasingly significant.
In the past, companies like OpenAI and Stability AI have faced lawsuits for using personal and copyrighted data without consent, highlighting the need for more stringent data collection and handling practices. By prioritising transparency and privacy, CommonPool is a step towards rebuilding public trust and ensuring ethical AI practices.
Read also:
- pending European health data sharing framework to be introduced
- Companies exercise prudence towards AI adoption, ensuring secure implementation: Exploring safeguards and strategies.
- AI-Driven Stocks Poised for a Price Surge in Coming Times
- Growth in the Organic Electronics sphere propels the Semiconductor sector by 22.3%