Data is the lifeblood of artificial intelligence, and machine learning models are only as good as the data they are trained on. With ChatGPT hype still in place and Google’s attempts to catch up with their own version of AI chatbot Bard, we are convinced that what we are witnessing is only the beginning of the AI revolution. 

Yet, the lack of data might hit the brakes on the escalating AI development. 

The study carried out by Epoch researchers predicts that we may run out of data to fuel machine-learning models as soon as 2060. This blog explores the predictions outlined in the study and the possible consequences of data scarcity on AI.

AI developers use high-quality data from books, news articles, scientific papers, Wikipedia and filtered web content to train language models. The main distinction between high-quality and low-quality data is that high-quality data tend to be produced by professionals. While the rest of the data, coming from user-generated texts, such as social media posts, website comments and blogs, is attributed to the low-quality category. 

It’s no surprise that low-quality data outnumbers high-quality data. Twitter alone produces 2 to 20 trillion user-generated words annually. Yet, researchers are reluctant to exchange quality for quantity. A previous attempt by Microsoft to incorporate user-generated data in AI training ended within the first day after the AI chatbot Tay started copying Twitter users’ harmful behaviours. 

The researchers from Epoch have calculated that high and low-quality language data stock currently grows by 7% annually, yet it is predicted to fall to 1% by 2100. 

The study has identified three factors which could determine future data production rates. These are the human population, internet penetration rate, and the average amount of data produced by each user. These factors are more accurate in predicting the rate of low-quality data production but not as good when it comes to high-quality data. Some high-quality data, such as website content or Wikipedia, are user-generated and can be calculated using the same model. The other part of high-quality data is generated by the subject matter experts (scientists, authors) and is not directly affected by the factors used in the model.  

The growth of the global economy should lead to more extensive scientific investments, which would increase high-quality data production. Over the last 20 years, OECD countries have spent roughly 2% of their GDP on R&D. If the trend does not change, the data accumulation should be approximately proportional to the size of the world economy, which is predicted to grow on average by 2% annually

The latest AI prodigy ChatGPT was trained on internet-sourced databases that included 570GB of data from books, Wikipedia, research articles, websites and other forms of internet content, with approximately 300 billion words fed into the system. But will we have enough data to train the next generation of ChatGPT?

AI researchers could face data scarcity as early as 2026 if they keep using only high-quality data to train ML models. The study’s authors identified possible solutions, such as finding ways to train ML models on smaller datasets, extracting high-quality data from low-quality data or using synthetic data. Could data scarcity become a roadblock in the path of AI evolution? What steps can we take to prepare for this potential challenge and ensure the continued growth of this technology?

Recommended Posts