Data quality is the key to generative AI success
The quality of outputs created by generative AI applications and large language models (LLMs) is heavily determined by the quality of the data the models are trained on. This is why data is the most critical element determining the success of organisational generative AI projects.
When AI models are trained with poor-quality, biased or incomplete data, they can generate spurious results. Like any computer system, generative AI is subject to the ‘garbage in, garbage out’ principle. No matter how well programmed the algorithms in the models are, training data quality remains the key determinant in how well these tools work in practical application.
When Microsoft launched its chatbot Tay in 2016, the importance of training data was laid bare. While there’s no doubt the AI programming was excellent, Tay was allowed to ‘learn’ how to converse by using Twitter (now X) as one of its training data sources. Tay was switched off after 16 hours as the data it used from Twitter resulted in extremely offensive responses to questions. Tay’s successor, Zo, was eventually discontinued for going too far the other way and avoiding potentially controversial topics.
These early forays into generative AI highlight the importance of data as a foundation of this emerging technology.
Public-facing AI models such as ChatGPT and Google Gemini (formerly Google Bard) highlight what can happen when AI models, even when trained with vast arrays of data, get it wrong. Those models can interpolate data to fill in blanks where their training data is incomplete. AI hallucinations occur when the training data has gaps, and the software tries to fill them in.
The corporate sector, government agencies and other organisations that want to leverage generative AI and LLMs are often faced with a decision. Should they invest their efforts in finding data scientists to help refine their models or put their focus on ensuring the data they feed their models is the most accurate and up-to-date possible? Given the choice between better data and more data scientists, better data will deliver a greater return on investment.
The challenge for organisations is not whether they have enough data to adequately train their AI models. The challenge is knowing where the data is and making it securely accessible to AI models. And the data must be made available to the AI models quickly so new or changed information is integrated promptly.
In most organisations, data is stored in multiple systems, each with their own structure and security. The data may reside in cloud services and other offsite locations, as well as within on-premise repositories. This dispersion makes it challenging using legacy approaches to data management to make information available to AI models.
A modern data preparation studio can make data available to LLMs and generative AI models in near real time. Rather than copying data, which can be technically complex, time consuming and costly, a data preparation studio can make that data accessible to those models without duplication while respecting data governance and security frameworks.
When AI models and LLMs are trained with accurate and current data from your systems, the likelihood of errors and hallucinations is reduced.
While generative AI projects have a strong technical component, this does not remove people from the equation. It remains critical that any data that comes from a generative AI tool that is used for a business decision is vetted by a person. If an error is found, that needs to be fed back to the development team to refine the model and remove any erroneous data.
The success of generative AI projects is strongly dependent on the quality of the data the models are trained with. By using a data preparation studio that enables information to be used without copying or increasing security risks, enterprise generative AI projects can be delivered faster and with less risk, enabling a faster return on investment.
Why the information lifecycle will be vital to data privacy in 2025
Data accessibility, accountability, confidentiality and integrity are becoming increasingly...
You can't win the AI game without a playmaker captain
Kubernetes and containers promise to bring cohesion to the otherwise complex world of modern apps.
Fixing the cybersecurity skills gap in Australia
Industry needs to mend the broken pathway from cybersecurity education to employment.