Most LLMs you use today already use synthetic data.
It’s not a thing of the future.
The large labs use a large model (e.g. gpt-4o) to generate training data for a smaller one (gpt-4o-mini).
This lets you build fast, cheap models that do one thing well.
This is “distillation”.
But the vision for synthetic data is much bigger.
Enable people to train specialized AI systems without having a lot of training data.
Today we are talking to Adrien Morisot, an ML engineer at Cohere.
We talk about how Cohere uses synthetic data to train their models, their learnings, and how you can use synthetic data in your training.
We are slightly diverging from our search focus, but I wanted to create a deeper dive into synthetic data after our episode with Saahil.
You could use it in a lot of places: generate hard negatives, generate training samples for classifiers and rerankers and much more.
Scaling Synthetic Data Creation: https://arxiv.org/abs/2406.20094
Adrien Morisot:
Nicolay Gerold:
00:00 Introduction to Synthetic Data in LLMs 00:18 Distillation and Specialized AI Systems 00:39 Interview with Adrien Morisot 02:00 Early Challenges with Synthetic Data 02:36 Breakthroughs and Rediscovery 03:54 The Evolution of AI and Synthetic Data 07:51 Data Harvesting and Internet Scraping 09:28 Generating Diverse Synthetic Data 15:37 Manual Review and Quality Control 17:28 Automating Data Evaluation 18:54 Fine-Tuning Models with Synthetic Data 21:45 Avoiding Behavioral Cloning 23:47 Ensuring Model Accuracy with Verification 24:31 Adapting Models to Specific Domains 26:41 Challenges in Financial and Legal Domains 28:10 Improving Synthetic Data Sets 30:45 Evaluating Model Performance 32:21 Using LLMs as Judges 35:42 Practical Tips for AI Practitioners 41:26 Synthetic Data in Training Processes 43:51 Quality Control in Synthetic Data 45:41 Domain Adaptation Strategies 46:51 Future of Synthetic Data Generation 47:30 Conclusion and Next Steps