The race is on for every organization to be more data-driven. Why? Because companies that are more likely to use data to inform decision-making perform better. But today’s data management technologies still have a long way to go to break down data silos and make data accessible to all. An emerging ecosystem of technologies built on data virtualization can improve data access and usability.
For an analyst to get access to the data they need, they typically must approach a busy data engineer who has the SQL, Python, or Java skills to build a database query and pull a data set. The engineer must also be familiar with the relevant metadata and data model to know what data to query. With different departments using their own unique data models, additional complexity needs to be managed. If the data needs to be transformed and merged with another data table, then more technical skills are required to build a pipeline. Once ETL pipelines are built, maintaining them is also a challenge, as they tend to be very rigid. When changes are needed, they must be re-engineered and tested, which not only makes changes hard, but also means it is difficult to use a single pipeline for more than one purpose.
With the demand for data growing so quickly, this model will not withstand the building pressure. Organizations can't continue to endlessly add new engineers to their data engineering team. Not only because this is cost prohibitive, but also because there aren't enough of them in the market.
The net result of these challenges is that organizations are slow to make business decisions, placing them at a competitive disadvantage.
Data virtualization services provide the foundation for a new approach to data access. A data visualization tool provides middleware that creates a virtual representation of data to make it available for analysis. Unlike approaches using ETL that move data to where it is analyzed, virtualized data stays in place. Data does not have to be moved from its source system to a data lake and then to another system for analysis, a common practice. While the actual data stays in place, the metadata is separated and consolidated in a central repository.
With a data virtualization strategy, separating logic and the underlying data makes changes to data queries much easier. When metadata is embedded in the data source and ETL pipelines need to change, engineers must understand not only the data model but how the connections are set up, and if dependencies need to be considered. Adding data sources when data is virtualized is much more straightforward. Just referencing the metadata and tweaking the query will get the job done when data is virtualized. With greater flexibility, data products or data assets can evolve iteratively to generate much more value for data consumers.
When we decouple metadata from the data it describes and centralize it, numerous new capabilities are enabled. Data federation is one of them. This is when metadata from multiple sources is organized to make data accessible through a uniform data model. By consolidating metadata, a universal data model makes it much easier to understand the underlying data distributed across disparate databases, making the process of accessing it much simpler.
A consolidated metadata layer also allows analysts to create a single query to pull data from multiple databases simultaneously, no matter where the data is stored, whether in the cloud or on-prem. The ability to access multiple databases and aggregate and transform data in real-time opens up a whole new world of capabilities.
With a unified data model available via data federation, a universal semantics layer can be built on top to make data more self-serve. When you adopt a single data model that represents multiple data stores and lists your data assets in a single data catalog, it is much easier to explore data to pinpoint the facts you need. This enables greater innovation because, without the enhanced visibility of universal semantics, analysts would not be able to browse, experiment with, or discover new data easily. For greater usability, a universal virtualization layer might include additional resources, such as business glossaries that standardize business terminology and metrics. This makes data even more accessible to business users who can find data assets with little understanding of how data is organized or where it is stored.
Data governance is defined as everything you do to ensure data is secure, private, accurate, available, and usable. Emerging modern data technologies improve data governance along all these objectives.
The virtualized layer enables a single gateway to enforce centralized data governance and security
By keeping data in place, where it can be better controlled, data virtualization can manage access across multiple data sources. With consolidated metadata, fine-grained access controls can be used to mask data at the column level to obscure identities.
By keeping data in one place, your data can be more accurate. No need to synch databases or move data reducing potential errors that occur during the process. When duplicate copies of data aren't scattered around the organization, the data in the source system becomes the single source of truth, reducing conflicting data caused by aging data sets.
Data virtualization makes data available in real-time. It also enables federated data governance, which provides business domains more autonomy to authorize access for those who need it.
The semantic layer enabled by data virtualization enables business users to access data through common definitions across business domains, making it more usable.
With the data virtualization layer functioning as a single gateway to data, it is much easier to control and monitor who has access to which data sets. With this oversight, authority can be distributed to data domains while IT still retains high-level governance. Federated data governance and universal semantics enable data mesh architectures that are domain-oriented and centered on data products. Read More about Data Mesh here
Data fabrics are also built on data virtualization, data federation, and universal semantics layers. They are different from a data mesh because they don’t incorporate federated data governance into the approach. In this model, IT retains responsibility for the organization's data and data discovery enabled by knowledge graphs.
Data virtualization and the growing ecosystem of technologies around it constitute a transformative innovation because they build on the strengths of the platform that they run on – the cloud. Data lakes and ETL technologies were designed for an on-prem ecosystem, not taking into account the capabilities of the cloud. As data has moved to the cloud, new approaches should be considered that are enabled by this new environment. The always-on interconnectivity and instant scalability of the cloud are features that need to be considered when designing a modern data management strategy.
Why wait for batch processes when you can get data in real-time? Why not spin up a VM to store your data while you analyze it? Why not interconnect all your data and access it from a single place?
Adapting the old way of doing things to new platforms is a common trend in technology transformation and platform adoption. When the mobile device platform emerged, enterprises modified their enterprise applications and web applications to run on the mobile OS. While this worked, they were not designed for a device that had limited power and bandwidth and was mobile. The standard quickly became applications built in the native operating system that considered the constraints and opportunities enabled by the platform. Same with moving applications to the cloud. The first iteration was moving whole monolith applications to a container and calling it cloud-native. The reality is that applications are only truly cloud-native if they were designed and built to run in multiple different containers, leveraging the always-on interconnectivity and scalability of the cloud. Now it is data management’s turn to be cloud-native, and data virtualization is the foundational technology.
Data virtualization is a powerful technology and is only the foundation of an infinitely more complex modern data strategy.