Why big data is becoming small
THE first two decades of this century witnessed people’s obsession with data. They tried to collect as much information as they could, apparently for the purpose of developing a data-driven ‘winning strategy’ for all facets of life, but often for what they didn’t know clearly. There was an insane rush to gather data from any available source without understanding how to leverage it. The advent of the Internet and social media spurred this trend.
Synthetic data is a helpful temporary solution, but it does not fully address the problem of data supply.
In recent years, generative artificial intelligence (AI) has taken centre stage. The gigantic volume of data that people stored but couldn’t use has found applications. The development and effectiveness of AI systems — their ability to learn, adapt and make informed decisions — are fuelled by data. About 570 gigabytes of text data or around 300 billion words were used to ‘train’ ChatGPT. Similarly, the AI image-generating applications DALL-E and Midjourney used a stable diffusion algorithm that was ‘trained’ on 5.8 billion image-text pairs. ‘Trained’ means that models are taught to identify patterns in data and then produce new, related data by applying these patterns. For instance, generative AI can produce meaningful sentences if it is ‘trained’ on English text, whereby it learns the statistical probability of a word coming after another.
An algorithm will produce inaccurate or low-quality output if it’s ‘trained’ on an insufficient amount of data. But it’s getting harder to find suitable ‘natural data’ — information derived from the real-world environment that has been either left unprocessed or is very lightly processed — which is essential for AI systems to advance. Thus, in the AI era, data is suddenly becoming scarce. Consequently, big data has shrunk to small data. The overall equation has been turned around by the AI economy’s impending data crisis.
A team of Epoch researchers published a paper in 2022 titled ‘Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning’. These researchers predicted that we would run out of high-quality text data before 2026 if the current AI ‘training’ trend continued. The researchers also estimated that low-quality language data would be exhausted sometime between 2030 and 2050 and low-quality image data between 2030 and 2060. “The current trend of ever-growing ML (Machine Learning) models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available,” they said.
In reality, to ‘train’ language models, AI developers use high-quality data from books, news stories, academic papers, Wikipedia and filtered Web content. The majority of high-quality data is created by professionals. The remaining data, which is derived from user-generated texts, is classified as low-quality. Examples of these texts include blog entries, social media posts and comments on websites. Text gleaned from social media platforms may also contain illegal content or be biased or prejudiced. Consequently, the model may be able to replicate these things. Overall, the models and, in effect, the industry as a whole may stagnate if the availability of natural data stagnates as well.
Are we then apprehending a forced pause in AI research? Incidentally, a plea to ‘Pause Giant AI Experiments’ was co-signed by many eminent people, including Tesla CEO Elon Musk, Apple co-founder Steve Wozniak and famed author Yuval Noah Harari, in an open letter last year. Well, is there a cause for concern regarding the trajectory of AI’s development because of the impending data shortage? And is there any solution?
It, however, seems plausible that with less data and perhaps even less computational power, high-performing AI systems will be able to be ‘trained’ in the coming years. This would also lessen AI’s carbon footprint. Several content producers have sued a few big companies for using their work to ‘train’ AI models. Paying people for their work might contribute to restoring some of the power imbalance that exists between AI companies and creatives. This might partially solve the data problem as well. When Hollywood actors were on strike last year, as per an MIT Technology Review article published in October, tech companies were offering a gig to out-of-work actors: get paid $150 an hour to act out a variety of emotions in front of a camera to ‘train’ AI.
Creating ‘synthetic’ or ‘simulated’ data to ‘train’ AI systems is another option. Developers can simply generate the data they need, curated to suit their AI model. According to Gartner, 60 per cent of the data for AI, up from 1 per cent in 2021, will be synthetic by the year-end.
Cost-effectiveness, the potential for data augmentation, privacy protection, the creation of scenarios and increased diversity and representativeness are some of the advantages of synthetic data. However, synthetic or simulated data might not precisely reflect real-world situations. This puts the dependability and efficiency of AI systems at serious risk. Problems regarding transparency and the threat of bias would still be present. Furthermore, to make sure synthetic data somewhat reflects real-world data and is appropriate for the intended use, it should be validated.
Synthetic data is a helpful temporary solution, but it doesn’t fully address the problem of data supply. It might work well for some uses, like face recognition, but might perform poorly for others, like natural language processing. Because of this, companies may need to adopt a more focused approach to data production, putting quality above quantity.
Generative models don’t just consume data; they produce data as well. It’s getting harder to separate good data from junk data produced by spam bots, image generators, hallucinations and deepfakes. And as the information haystacks get bigger, it gets harder to find finer signals if the input data contains garbage.
What happens if this data is used to ‘train’ AI systems? Wouldn’t the AI models be increasingly biased with such ‘training’? The shadow of uncertainty looms large.