Artificial intelligence companies have amassed billions in funding, hoovering up data across the internet to train their models. That data includes copyrighted music works—which Elon Musk confirmed at the DealBook Summit last year.
Anthropic has stated that it believes training AI models on copyrighted content is ‘fair use,’ so long as the output does not contain 100% copyrighted reproductions. Of course, major players in the music industry disagree with that assessment—and those disagreements will be fought and litigated in court. But is there any music that is safe from being scraped and trained on?
Former OpenAI Co-Founder and Chief Scientist Ilya Sutskever says ‘not really.’ Speaking at the Conference on Neural Information Processing Systems (NeurIPS) in Vancouver, Sutskever says the era of pre-training AI on data obtained from the internet will end soon. “Pre-training as we know it will unquestionably end,” Sutskever told the audience. That’s because the industry is tapped out on new data to train from.
“We’ve achieved peak data and there’ll be no more,” Sutskever says. “We have to deal with the data that we have. There’s only one internet.” And our ‘one’ internet contains readily available recorded music to download on piracy websites—which many of these AI companies liberally scraped for content to movies, books, music, and more.
Sutskever believes the future of AI will move beyond large language models (LLM) that can recite (and hallucinate) data back to you; instead the future of AI will move to an ‘agentic’ future. That’s simply defined as an autonomous AI system that can perform tasks, make decisions, and interact with software on its own. We saw glimpses of that future with hardware devices like what the Rabbit R1 promised to be—though the actual implementation of that device highlights how wrong AI can be.
Current AI models are mostly pattern-matching based on what the model has seen before. Training models fed pictures of butterflies can accurately label a butterfly, but show it a ladybug with no prior training data and it is clueless. Future AI is being trained to reason its way through step-by-step processes—more akin to how actual humans think.
“The new ladder to climb,” added Stanford’s Fei-Fei Li, “is the 3D ladder, which I call spatial intelligence.” Li then compares relying on 2D data from the internet akin to building an AI for a ‘flat earth.’ In other words, future AI development is focused on teaching models to reason the way humans do, rather than regurgitate information (LLMs) or create strings of similar sounding patterns based on previous input (Udio).