Interview Questions for Kevin Yee, Co-Founder and CTO of BetterData
In the era of data-driven decision-making, privacy concerns have become a significant challenge. However, a host of innovative technologies is emerging to address these issues, ensuring data utility while protecting individual or sensitive information.
One such technique is Differential Privacy, a method used to prevent deep learning models from exposing users' private information in the datasets used to train them. The balance between privacy and accuracy in differential privacy is controlled by a parameter called ε. Companies like Amazon, Roche, Google Waymo, Ford, Deloitte, and American Express are among those using synthetic data for various purposes, such as training AI models, clinical research, autonomous vehicles, and fraud detection.
Beyond synthetic data, privacy-preserving technologies are widely used in real-world applications. Differential Privacy, for instance, adds statistical noise to datasets, allowing aggregate data analysis without revealing individual information. It is used in statistics and data release to ensure privacy.
Zero-Knowledge Proofs (ZKP) are another cryptographic innovation that proves knowledge or validity of data without revealing the data itself. They are applied in secure identity verification and blockchain transactions.
Trusted Execution Environments (TEE) provide specialized secure hardware areas that isolate sensitive computations and data from the rest of the system. They are used in mobile devices, IoT, and cloud for tasks like encryption, digital rights management, and secure payment processing. For example, Indonesia’s Ministry of Tourism used TEE to securely compute roaming data statistics while preserving user privacy.
Homomorphic Encryption (HE) allows computation on encrypted data without needing decryption, preserving confidentiality while enabling data analytics. It has been demonstrated in Intelligent Transportation Systems (ITS) for privacy-preserving traffic management.
Federated Learning (FL) is a decentralized machine learning approach where models are trained locally on users’ devices or across multiple institutions without sharing raw data. It is used extensively in healthcare, finance, mobile devices, and more, enabling collaborative disease detection, fraud detection, and personalization without sharing sensitive data.
Privacy-Preserving Collaborative Data Analysis Platforms, like PySyft, enable secure, privacy-protected statistical collaboration across organizations or countries without transferring sensitive data. Data owners can review and approve code and outputs to ensure data privacy during analysis.
Secure multiparty computation allows multiple parties to securely share their data and perform computations on it without revealing individual inputs, but it requires expensive cryptographic operations, resulting in high computation costs.
Fairness in AI models refers to ensuring that they do not discriminate against certain groups of people. A good remedy to reduce bias in a dataset could be ensuring demographic parity across protected subgroups, where membership in a protected subgroup has no correlation with the predictive outcome of a downstream AI/ML model.
Synthetic data, a substitute for real data in AI/ML development and data sharing, can help achieve demographic parity by allowing full control over the data generation process. Singapore-based startup betterdata, co-founded by Kevin Yee, is focusing on synthetic data to address these issues.
Yee emphasizes that in the context of AI practitioners, fairness is often viewed from a quantitative perspective, involving fairness constraints with sensitive and legally protected attributes like race, religion, job, income, gender, etc. Synthetic data can lead to "fairer" AI models by eliminating biases that may be present in real data, ensuring that AI models are trained on a more representative and diverse dataset.
Advanced AI techniques like generative adversarial networks (GANs) are used to produce synthetic data that maintains statistical properties of the original data while ensuring privacy.
Lastly, blockchain technology is expected to become increasingly important for tracking data provenance, transparency, and non-custodial ownership of personal data. It can help businesses innovate by making data freely accessible and portable, and it can be used as a substitute for real data in AI/ML development and data sharing without the risk and compliance hurdles of using real data.
These technologies provide practical solutions for privacy concerns in sectors like healthcare, finance, mobile computing, transportation, and government statistics, enabling data utility while protecting individual or sensitive information.
- Differential Privacy, a method used to prevent deep learning models from exposing users' private information, uses a parameter called ε to balance privacy and accuracy.
- Companies like Amazon, Google Waymo, Ford, and Deloitte are among those using synthetic data for various purposes, such as training AI models and fraud detection.
- Beyond synthetic data, privacy-preserving technologies like Differential Privacy, Zero-Knowledge Proofs (ZKP), and Homomorphic Encryption are used in real-world applications.
- Trusted Execution Environments (TEE) are used in mobile devices, IoT, and cloud for tasks like encryption, digital rights management, and secure payment processing.
- Federated Learning (FL) is a decentralized machine learning approach used extensively in healthcare, finance, mobile devices, and more, enabling collaborative disease detection and fraud detection without sharing sensitive data.
- Privacy-Preserving Collaborative Data Analysis Platforms like PySyft enable secure, privacy-protected statistical collaboration across organizations without transferring sensitive data.
- Singapore-based startup betterdata, co-founded by Kevin Yee, focuses on synthetic data to achieve demographic parity in AI models and eliminate biases.
- Blockchain technology is expected to become increasingly important for tracking data provenance, transparency, and non-custodial ownership of personal data, making it useful for AI/ML development and data sharing with reduced risk and compliance hurdles.