Generative AI in the Age of Synthetic Data: Challenges and Solutions

Speaker
Sina Alemohammad
Ph.D. candidate at Rice University
Title
"Generative AI in the Age of Synthetic Data: Challenges and Solutions"
Abstract
The field of artificial intelligence (AI) is facing a shortage of real-world data to train increasingly large generative models, leading to growing reliance on synthetic data. In my presentation, I will provide both theoretical insights and experimental evidence on the unintended consequences of this shift, along with potential strategies to mitigate the risks. I will show that training new models using synthetic data generated by current or previous models can create a self-consuming cycle, degrading the quality and diversity of the data, a phenomenon known as model autophagy disorder (MAD). Current views advise against using synthetic data for training to prevent this deterioration into MADness. However, I propose a different approach that treats synthetic data in a way that promotes self-improvement, prevents MAD, and allows for controlled data distribution.
Self-Improving Diffusion Models with Synthetic data (SIMS) is a novel training method for diffusion models that leverages self-generated data to provide corrective guidance, steering the generative process away from flawed synthetic data and towards real-world distributions. In my talk, I will demonstrate that SIMS not only avoids MAD but also achieves self-improvement, setting new benchmarks in the Fréchet inception distance (FID) metric for generating CIFAR-10 and ImageNet-64 data, while also delivering competitive results on FFHQ-64 and ImageNet-512. To our knowledge, SIMS is the first generative AI algorithm that can iteratively train on its own synthetic data without succumbing to MAD. Additionally, SIMS offers the capability to adjust the synthetic data distribution to match any target in-domain distribution, helping to reduce bias and ensure fairness.
About Speaker
Sina Alemohammad is a senior Ph.D. candidate at Rice University, specializing in deep learning theory. His research centers on the implications of synthetic data in light of recent advancements in generative AI. Specifically, he investigates self-consuming loops, focusing on watermarking and detection techniques for synthetic data, as well as the development of robust training algorithms designed for generative models that utilize synthetic data.