Despite years of explosive data growth, there may not be enough for AI
In a world of so much data, it seems counterintuitive to think we might run out of it. Yet that is precisely the situation when it comes to the availability of datasets to train AI models, specifically large language models or LLMs. On the most aggressive estimates, we could hit exhaustion in as little as 18 months.
AI, at its core, has always been a ‘data problem’.
This is typically a reference to data preparation: having clean, structured data in one central repository, from where it could be more easily ingested into AI models. Organisations have data; just not in one place, in one format, and conforming to one set of standard definitions for it to be immediately usable as a training set. Recent ESG research suggests 77% of organisations still have these data quality issues.
Solving this removes a key roadblock and enables organisations to progress generative AI initiatives.
However, there is a second ‘data problem’ on the horizon that is as big — or bigger — than the first. Despite explosive growth over the years, data volumes suddenly appear finite compared to the voracious training appetites of LLMs. There is a real prospect that AI models will run out of data to train them.
Some of the shortages relate to privacy and consent. Just because a dataset exists doesn’t mean it can be used for AI model training, as some organisations are finding out.
Still, volume is anticipated to become a problem in and of itself. Estimates vary, but one study suggests exhaustion of text-based data between 2026 and 2032, and of image-based data from 2030 onwards.
Interest in AI is unlikely to wane between now and then, so the question becomes: what can organisations do to continue to innovate with AI in the face of training data exhaustion?
Organisations either need to find new ways to grow the amount of training data they can access, or find ways to make AI models much more focused and specific, reducing the size of the model and its training data needs. Efficient storage infrastructure that is optimised for AI use cases will aid organisations, no matter which path they choose.
Response 1: Generate synthetic data
The obvious response to a looming data shortage is to find more data to fill the gap. That data will have to come from somewhere, and that somewhere is likely to be AI itself: using the machine to build more data for the machine to consume.
AI-generated synthetic or simulated data holds the promise of producing enough volume of data to continue training LLMs. The finite amount of real data that is available can be used to validate the accuracy of outputs from a model trained largely on synthetic data. Careful attention, however, will need to be paid to the quality of the synthetic data to ensure that biases are not amplified through the way the data is generated.
While this will enable a continuation of the current status quo on LLM development and training, it will pose resource issues. AI today is resource-intensive, requiring arrays of high-powered GPUs in servers in high-density rack environments. These have high energy consumption and cooling requirements, and will increasingly be served out of data centre facilities specifically designed to run AI workloads. Efficient data storage infrastructure is required to be able to store this huge influx of synthetic data and to make it available to train the LLM.
Response 2: Shrink the models
While Australian enterprises may not be training LLMs themselves, they are big users of LLMs, so they will inevitably be impacted by what happens in that space. As training data exhaustion becomes an issue, and there are question marks over whether the gap can be accurately filled in by synthetic data, enterprises are likely to take matters into their own hands.
This is being observed with the rise of small-language models (SLMs) — smaller, more specialised AI models that are primed for more tailored task-based use cases in enterprise environments.
This approach is a natural evolution for many enterprises that ran their initial experiments through a single LLM. Rather than use one big model for everything, enterprises are seeing value in a multi-model approach, where different SLMs are used for each individual use case. While the data requirements for an SLM are far lower than for an LLM, even smaller models require a tremendous amount of data on a particular subject in order to create an effective model.
The overall volume of data needed is also significant because enterprises are not going to run just one SLM – they may run hundreds of such models. The number is only limited by the usual constraints around resourcing and imagination. While an SLM approach still requires a lot of data, what it does enable is for enterprises to run their AI workloads a lot more efficiently. Each SLM can operate on a more contained basis. Enterprises will want to maximise the efficiency for each one of these models. That way they can support more and more use cases without a linear or multiplier impact on infrastructure costs, and the operational and cost burdens that come with that. Aligning with a storage infrastructure provider that prioritises efficiency to support AI workloads can significantly help enterprises to navigate these emerging challenges. |
Eight steps to avoid breaking the bank on AI's seductive promise
Investments into AI that have a positive return are the result of establishing solid foundations...
Is the Australian tech skills gap a myth?
As Australia navigates this shift towards a skills-based economy, addressing the learning gap...
How 'pre-mortem' analysis can support successful IT deployments
As IT projects become more complex, the adoption of pre-mortem analysis should be a standard...