Most LLMs you use today already use synthetic data.
It’s not a thing of the future.
The large labs use a large model (e.g. gpt-4o) to generate training data for a smaller one (gpt-4o-mini).
This lets you build fast, cheap models that do one thing well.
This is “distillation”.
But the vision for synthetic data is much bigger.
Enable people to train specialized AI systems without having a lot of training data.
Today we are talking to Adrien Morisot, an ML engineer at Cohere.
We talk about how Cohere uses synthetic data to train their models, their learnings, and how you can use synthetic data in your training.
We are slightly diverging from our search focus, but I wanted to create a deeper dive into synthetic data after our episode with Saahil.
You could use it in a lot of places: generate hard negatives, generate training samples for classifiers and rerankers and much more.
Scaling Synthetic Data Creation:
https://arxiv.org/abs/2406.20094Adrien Morisot:
Nicolay Gerold:
00:00 Introduction to Synthetic Data in LLMs
00:18 Distillation and Specialized AI Systems
00:39 Interview with Adrien Morisot
02:00 Early Challenges with Synthetic Data
02:36 Breakthroughs and Rediscovery
03:54 The Evolution of AI and Synthetic Data
07:51 Data Harvesting and Internet Scraping
09:28 Generating Diverse Synthetic Data
15:37 Manual Review and Quality Control
17:28 Automating Data Evaluation
18:54 Fine-Tuning Models with Synthetic Data
21:45 Avoiding Behavioral Cloning
23:47 Ensuring Model Accuracy with Verification
24:31 Adapting Models to Specific Domains
26:41 Challenges in Financial and Legal Domains
28:10 Improving Synthetic Data Sets
30:45 Evaluating Model Performance
32:21 Using LLMs as Judges
35:42 Practical Tips for AI Practitioners
41:26 Synthetic Data in Training Processes
43:51 Quality Control in Synthetic Data
45:41 Domain Adaptation Strategies
46:51 Future of Synthetic Data Generation
47:30 Conclusion and Next Steps