In a bid to make each LLM or large language model more powerful than the last, AI companies have used up almost all of the open internet and are running out of data. They may be forced to train their upcoming models on AI-generated data, which has its own problems

AI companies are facing a monumental challenge, one that would render all the billions of dollars that Big Tech is investing in them, pointless: they are running out of internet.

In the race to develop ever-larger and more advanced large language models, AI companies have practically consumed all of the open internet, and are now facing the imminent end of data, as reported by the Wall Street Journal.

This issue is pushing some firms to seek alternative sources for training data, such as publicly available video transcripts and the creation of AI-generated “synthetic data”. However, using AI-generated data to train AI models is a problem in and of itself — it leads to a higher chance of AI models hallucinating.

Furthermore, discussions around synthetic data, have raised some serious concerns regarding the potential consequences of training AI models on AI-generated data. Experts believe that relying too much on AI-generated data leads to digital “inbreeding” which could eventually result in the AI model collapsing on itself.

While entities like Dataology, founded by former Meta and Google DeepMind researcher Ari Morcos, are exploring methods to train expansive models with fewer data and resources, most major players are playing with some rather unconventional and contentious approaches to data training.

OpenAI, for example, is considering training its GPT-5 model using transcriptions from publicly available YouTube videos according to sources cited by the WSJ, even though the AI company is facing criticism for using such videos to train Sora, and may face lawsuits by video creators.

Nevertheless, companies like OpenAI and Anthropic, are planning to address this by developing superior synthetic data, although specifics regarding their methodologies remain still unclear.

Fears of AI companies have been running around for quite some time now. Despite predictions by some, like Epoch researcher Pablo Villalobos, estimating that AI could exhaust its usable training data in the coming years, there is a prevailing sentiment that significant breakthroughs could mitigate these concerns.

However, an alternative solution to this dilemma exists: AI companies could opt to refrain from pursuing larger and more advanced models, considering the environmental toll associated with their development, including significant energy consumption and the reliance on rare-earth minerals for computing chips.

(With inputs from agencies)


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *