You cannot manage and optimize what you cannot see. You need observability to understand how a system works and if it is operating effectively. Data products are delivering a new model for data access, and those creating data products need to track their quality and utility.
Great raw materials are required to constitute great products. The quality of the data that goes into data products is critical to a successful data product strategy. Superior outcomes require tracking data quality from source to consumption and observing data systems that manage the process.
Many organizations are adopting a data product strategy that builds reusable data products instead of creating a one-off data pipeline for each use case. Data products are easy-access data sets created once and adapted to multiple use cases.
A data product approach requires data engineers to think more proactively and consider data deliverables as products. This strategy relies on creators considering their users' needs and pain points. To inform product feature decisions producers need information on how their data products are used. This feedback enables creators to improve their existing portfolio and build better data products for the future.
Data product observability tracks who is using different data products and how they use them. Understanding the roles of users can help producers better understand which cohorts are getting the most value from their data products and which ones are underserved. Insight into how data products are used to support models, dashboards, and analysis can also help spark innovative ideas for new data products. By understanding these trends, data product creators will be able to be more proactive, so data is ready for users when they need it.
Data product producers can also improve their products by gathering direct feedback on data products. Creating a culture of teamwork and implementing formal user feedback channels is a great tactic to increase value. Implementing forums where users and producers can interact, provide feedback, identify issues, and suggest new data products enhances the worth of the data product ecosystem.
Tracking costs and FinOps are another vital component of data product observability. Are data products using cloud resources efficiently? Could they be optimized to reduce resources? This type of data tracking is critical to a profitable data product strategy. Identifying which data products consume the most memory is one example of cost observability.
Mechanisms that provide visibility into data products also must extend across business domains. Typically, producers and users may not interact regularly with managers and analysts in different business units. This separation limits the value and breadth a data product can deliver. A central forum for all to gather virtually is paramount for greater data product engagement and visibility.
Data producers are critical components of data product strategies, and their productivity should also be tracked. Who is creating the most data products, and in what domain provides greater visibility into your people's effectiveness?
While tracking data product usage is important for success, so is ensuring data products are trustworthy. For data products to be trustworthy, analysts and users need to be able to observe their quality. These could include tracking metrics on fuzzy matching, data sensibility, and referential identity.
this test measures the similarities of different rows in data products. This test tracks the probability that duplicate rows exist in a data product. The test does not identify exact matches but flags similarities that require additional investigation to avoid duplication. This test is helpful when joining multiple data sets in a data product that may have similar duplicate data.
this test measures the completeness of the data in the data product. This test counts the number of rows in a table in the data product and compares that to a reference standard. A completeness test will determine if this number is within the specified range. If the number is off, you may be missing data, or erroneous data may have been inserted or duplicated.
this test checks to see if a child table's key matches the parent table's primary key. If keys change in parent tables, this test will also ensure that the change is reflected in child tables.
Lineage data also provides greater insights into the trustworthiness of data products. Users can view the source of the data in a data product and judge the quality. If data originates from reputable sources, decision-makers can be confident they are accessing quality data within the data products.
Observing and testing data products this way helps ensure that you are only bringing top-quality data products to your users. Trust scores that summarize quality metrics and user feedback are a great way for data product users to have some visibility into the quality of data products.
Observing the functioning of your data products is important, but visibility into the systems that produce data for your data products is also imperative. Organizations need to have strategies in place to monitor, understand, and troubleshoot data and systems that produce and store data. Organizations need to be able to observe several important factors that support data integrity. These factors include freshness, quality, volume, schema, and lineage.
Freshness represents how long ago your data was updated. Stale data is low-quality data and can not be trusted.
Quality tracks value and correctness. Quality data tests can help you gain better observability of your data. Metrics such as,
Volume tests count the number of rows in your data set. Too few or too many can indicate a problem. Tests that measure volume include,
Schema defines the organization of your data. If this organization is changed, it can lead to errors. Tracking who made changes to data schema and when is vital to tracking data health.
Lineage details how data assets are connected and how data tables are related. It also tracks the flow from data source to consumption. When there are issues, you need to be able to observe data lineage to track down root causes.
Observing data throughout your data stack is essential to keep your data clean. Identifying errors promptly reduces the potential they can cause harm. If bad data reaches decision-makers, managers lose trust in the integrity of company data. This loss of trust reduces the organizations' ability to make decisions. Once trust is lost, it is hard to regain.
Good data observability solutions will not only identify errors but help you identify the source of these errors. These tools can help to reduce the mean time to error resolution and identify bottlenecks to optimize system functionality.
Gaining end-to-end observability throughout your data stack can be a challenge. Complex data pipelines and distributed data silos make it difficult to observe data as it moves throughout your data systems. Different departments and data teams may be using various tools to observe data in their domain, making consistent observability across all these silos much more challenging. This fragmentation also makes it hard to trace the root causes of errors across different systems and pipelines.
The emergence of data federation and robust consolidated metadata management tools is helping to connect data visibility across these data silos. Data federation links each data silo to a centralized metadata management database. Metadata tracks information on data sets such as schema, freshness, and volume, key components of data observability. Centralizing this data enables observability across data silos, which is much harder in an ETL pipeline where data may make multiple stops, and the original source metadata may not be loaded into target databases.
Innovations in metadata management also incorporate automation to automatically record metadata changes when they change in the source data. This data is tracked in a central platform, which can support better reporting and error resolution.
Observability is critical to quality and valuable data products. In an age where data is driving more of our decision-making and fueling AI, tracking the health of our data and systems is vital to getting the most out of this asset.