Elon Musk agrees with other AI experts that there is little real data left to train AI models on.
“We have now basically exhausted the cumulative sum of human knowledge… Musk said in a livestreamed conversation with Stagwell Chairman Mark Penn that was streamed on X late Wednesday. “That’s basically what happened last year.”
In a speech last December, Musk, who owns AI company xAI, echoed topics covered by former OpenAI chief scientist Ilya Sutskever at the machine learning conference NeurIPS. Sutskever, who said the AI industry has reached what he calls ‘peak data’, predicted that the lack of training data will change how models are developed today.
In fact, Musk suggested that synthetic data – data generated from AI models themselves – is the way forward. “The only way to complement (real data) is to use synthetic data from which AI generates (training data),” he said. “Through synthetic data… (AI) grades itself and goes through this self-learning process.”
Other companies, including tech giants like Microsoft, Meta, OpenAI, and Anthropic, are already using synthetic data to train their flagship AI models. Gartner estimates that 60% of the data used in AI and analytics projects by 2024 will be generated synthetically.
Microsoft’s Phi-4, released as open source on Wednesday morning, was trained on synthetic data as well as real data. The same was true for Google’s Gemma model. Anthropic used some synthetic data to develop one of the best-performing systems, Claude 3.5 Sonnet. And Meta used AI-generated data to fine-tune its latest Llama series models.
Training on synthetic data also has other benefits, such as cost savings. AI startup Writer claims that its Palmyra X 004 model, developed almost entirely using synthetic sources, cost just $700,000 to develop. This compares to an estimate of $4.6 million from a similarly sized OpenAI model.
But there are also disadvantages. Some studies have shown that synthetic data can lead to model collapse, which can result in models becoming less “creative” and more biased, which in turn can seriously impair their functionality. Because models generate synthetic data, if the data used to train these models have biases and limitations, their output will be similarly tainted.