Without a doubt, GenAI is in the hype cycle. Whether it has reached its early peak is hard to say, but the technology has the potential to fundamentally disrupt how we work and live. The number of GenAI use cases and where it can add value is infinite and transformative. Mark Cuban believes that the first trillionaire will be the innovator who optimizes AI monetization before anyone else. McKinsey and Company estimates that GenAI will have an impact on productivity that equates to $2.6 to $4.4 trillion.
But the question remains; will humans and AI work together seamlessly, and how will people monetize their creativity in an ecosystem dominated by rapidly evolving GenAI models? The road to answering these questions will be paved with challenges, failures, and innovation. Organizations need to be prepared for future disruption. The best way to do this is to ensure that your data, an organization's greatest asset in this new ecosystem, is ready for the future.
Last year, the first Large Language Model (LLM) models such as ChatGPT-3, Microsoft Copilot, and Google Gemini emerged, leading to an explosion of GenAI experimentation. This year, these experimental models will be perfected and moved into production. Organizations need access to diverse sets of high-quality data to scale these models successfully and perform effectively in production. Organizations may find that procuring this data is not easy, and much more work still needs to be done. According to Wavestone’s Data & Analytics survey, only 5% of organizations have implemented Generative AI in production at scale.
While organizations may have more traction with traditional AI, Generative AI is a different animal with different data requirements. Traditional AI relies on supervised learning, where curated data sets are used to train models to identify patterns and outcomes. Whereas GenAI leverages both structured and unstructured data and creates data on its own instead of simply predicting outcomes. This learning is unsupervised, so the model learns from any data it can access. GenAI is more like a complex black box where data scientists do not understand why models are making the decisions they are. This lack of observability makes it paramount that GenAI models have access to the highest-quality data.
GenAI application in the enterprise focuses on finetuning off-the-shelf third-party models like ChatGPT. Creating unique large LLM models is not economically feasible for most organizations, so many train existing models using enterprise data to implement Generative AI. This is known as tuning the model.
While tuning GenAI adapts models to domains, Retrieval Augmented Generation (RAG) is the mechanism that GenAI uses to source facts within the enterprise to support its answers. For example, if you ask a GenAI chatbot when your order will be delivered, it will use RAG to access the fulfillment system to get the answer.
For GenAI to effectively function in the enterprise to support model tunning and RAG data must be:
Broad access to data is the first requirement of your GenAI strategy. To tune your models, they need access to the relevant training data, and for RAG to work, models must have access to operational data.
For effective model tuning a broader and more diverse data set is required. If GenAI models are only exposed to narrow data sets, they tend to overfit the model and memorize the training data set without learning anything. For models to learn and differentiate between distinct characteristics, they must be trained on varied data. These data sets need to represent data from across the organization to create greater dimensionality. With greater representation, AI models will be less biased and more effective.
Using the right data sets that may exist anywhere in your organization is essential for tuning GenAI models. Smaller, high-quality data sets are better than large, low-quality data. Low-quality data sets create noise that confuses models and disrupts learning. Having access to all organizational data and understanding its quality will help you to find the correct training data for GenAI tuning.
For GenAI to be useful in the organization, it must have access to the appropriate information in the proper context to answer user queries. Data products are a great way to support these processes by providing greater context and personalization around user queries. By integrating data products with GenAI, customer-focused data products can provide prompts or inputs that can be fed into GenAI to deliver more personalization and contextual responses. Data products provide the access and the appropriate governance to ensure GenAI is leveraging the best data. For example, chatbots can leverage data products to insert personal greetings into communications or ask about previous purchases, thereby enhancing the experience.
The unique capability of GenAI to learn independently without supervision makes it revolutionary yet dangerous. The "black box" nature of the technology makes quality data paramount for successful GenAI implementations. Forty-two percent of data leaders cite data quality as the top data-related obstacle for the adoption of GenAI and large language models, according to Wakefield Research.
GenAI’s ability to learn from unstructured data also sets it apart from traditional AI. This data is usually the messiest and rarely cleaned or organized. To use this unstructured data in your RAG or training, preprocessing and normalization are required to help GenAI make sense of the data.
Cleaning unstructured data is different from cleaning structured data as, typically, this data is in text form and the cleaning process includes:
Generative AI‘s ability to process unstructured data is a game changer. However, the lack of consistency in the training data can lead to errors and hallucinations. To mitigate the errors, data labeling and effective metadata management strategies are required to provide more structure.
Creating more structure around unstructured data makes the data less noisy and conflicting. Humans are much better at resolving these conflicts than machines. A robust metadata strategy that manages metadata across all your databases helps create a single source of truth that AI can rely on. Incorporating mechanisms that enable humans to work with AI to label and categorize data helps organizations ensure their enterprise data is ready for GenAI.
Letting GenAI loose on your secure and personal data requires additional control. GenAI's hunger for data drives the technology to use any data it can access. RAG or training processes will breach protocols if limits on access to personal data are not in place. Walling off all your data limits the effectiveness of GenAI. To prepare your data and systems for GenAI, enterprises need a strategy for granular access controls and data masking to teach models what is off-limits and ensure that models do not inappropriately share private data.
Preparing data for innovative GenAI technology is no simple task. The power of technology requires skilled humans to monitor it and ensure it operates correctly. When GenAI bots become the gateway between data and users, analysts, who traditionally controlled access to insights, are cut out of the process. They lose control of what data is accessed and if it is of good quality. This shift demands new and more robust governance strategies that incorporate input and oversight from around the organization.
Teams managing these processes will require a diverse set of skills. They will need to understand how models work and the underlying technology and grasp the business implications and requirements of these models.
The great thing about preparing data for GenAI is that GenAI can help in the process. AI tools can help humans tag data and automatically correct spelling or expand abbreviations. GenAI can also create synthetic data to fill gaps in data sets. This capability is where GenAI can fabricate data that closely mimics real-world conditions.
Generative AI can learn from itself, but it needs to start somewhere. Where you start will have a profound impact on where you end up. Starting with the best quality data will put you in the best position for great outcomes.