ETL, virtualization, or write-back: Which data batch processing method fits your needs?

Listen to this blog

Disclaimer

A myriad of different approaches to move and merge data that support business processes and decision-making have been created over the years. Each with its virtual unique capabilities and benefits. Understanding these capabilities and how they relate to the desired outcome of your data strategy is essential for designing an optimal system.

Multiple variables and requirements must be considered when building a strategy to keep all your data optimized, accessible, and integrated, including:

How quickly do you need fresh data?
What are your bandwidth and storage constraints?
What type of technical resources are available?
How many copies of the same data do you need to maintain?
How large are your data sets?

Batch Processing

Batch processing transfers data at regular intervals or in batches. These patterns include Extract Transform Load (ETL), Extract Load Transform (ELT), Reverse ETL, Mirroring or Replication, and Write-back tools.

ETL & ELT

ETL is the traditional approach to moving data. Data is extracted from a data source, transformed based on business rules, and loaded into a target database. This approach can be very secure but also inflexible. This rigidity requires data sets to be very structured and documented, which makes complex transformations easier. With legacy systems databases typically very structured and rigid, ETL is a good fit for these systems. To build these processes, developers must have a deep knowledge of data structure and programming languages.

ELT is similar to ETL, but the transformation occurs after loading it into the target database. This more contemporary approach enables more flexibility because raw data remains on the target database after it is transformed. Consequently, the transformations can be run iteratively or retroactively without pulling data from the original data source, reducing the resources required to extract a fresh data set. ELT is more suitable for more extensive, less structured data sets where data wrangling and transformations can be run closer to the end user.

Reverse ETL

Reverse ETL is another batch process, but data flows in the opposite direction of an ETL pipeline. Data is extracted from an operational third-party application and loaded into a central database. This approach enables an organization to consolidate one version of a data set in a central warehouse and enables operational applications to use the same data that other applications or analyses may be using. Reverse ETL supports a "single source of truth" for the organization.

The challenge is that Reverse ETL operates in batches. In dynamic organizations where data is constantly changing, distinct groups may use different versions of the same data due to different update schedules. Reverse ETL synchronization strategies can also overwhelm operational systems, as large amounts of data are loaded into operational systems from multiple sources. This can lead to conflicts and data versions of the same data.

Mirroring and Replication

Mirroring and replication save data to a separate database without any transformation and are often implemented to back up data sets in case of data loss. They can also be valuable in optimizing data management and creating a persistent data set when data is in motion. Mirroring replicates not only the data but also the entire database structure and management system.

Replication is slightly different from mirroring because it does not copy the database management system, only the data. Accessing data from various systems is much easier with replication, as a database management system does not define it.

Mirroring is a better option for load balancing. The source and copied databases are identical, so data can be accessed from either source easily. Disaster recovery and optimizing resource usage are common use cases for mirroring. It is also suitable for capturing a snapshot of data in motion. For example, virtualized data that is used regularly can be mirrored to a persistent data store.

Real-time

Integrating and managing data in real time creates many complexities, particularly when merging diverse data sets. With different data sets constantly changing, discrepancies between the source and merged data will always exist. Real-time data is more applicable in operational systems tracking sensors or financial data.

Change Data Capture (CDC)

CDC is an events-based pattern where changes in one database are automatically reflected in another when defined events occur. This pattern supports two-way data flows so source and target databases can exchange data to ensure synchronization. CDC is a much more efficient way to move data than ETL, as only the data that changes is transferred to the target database, not the entire data set.

CDC works well with databases or applications that cannot be integrated. CDC processes can stage data in a separate database, and changes can then be shared from the staged data to target databases. This method requires maintaining a separate dynamic data source, creating excessive complexity and leading to greater potential for errors.

Streaming Data

Streaming data involves constant ingestion, processing, and movement of data to its destination. Typically, this means sensor data of financial service data. In many cases, streaming is transformed en route to its target database.

While streaming data is fast, there are a few drawbacks. Streaming data is constantly moving and taking different paths across the internet, so ensuring that data is processed in the correct order is not guaranteed. Some data may arrive to be processed after more recent data, creating confusion about the most accurate data. Proper orchestration needs to be implemented to ensure that data is processed in the appropriate order.

Streaming data also exposes the data model of the data source to downstream users. When downstream applications are connected directly to the source data model, changes to this model cause havoc downstream.

If streaming data is packaged within a data product, internal source data models can be mapped to a data product's data model. This structure makes it easier for data to be shared with external groups.

Write-back Capabilities

Write-back capabilities of front-end BI tools are another method for making changes to a source database. This emerging pattern allows analysts working with BI tools to include changes to the source database directly from the BI tool. These changes are immediately reflected in the data source and analysts' work. This allows the analyst who has the best understanding of the data to adjust or correct the original data source.

Write-back capabilities also reduce data analysts’ reliance on Excel spreadsheets, as they now have the power to update a database as quickly as they can with Excel.

Application Integration

SaaS applications usually share data via REST APIs. These are straightforward processes for pulling data from an application database. On their own, APIs cannot transform data, but if run through an iPaaS platform, automated transformation processes can be performed on the data before sending it to the target application.

REST APIs work well for simple data exchanges between an application or two, but this type of approach does not scale well. If an app changes its API, downstream applications can break.

Data Virtualization

Data virtualization is the practice of running code that creates a new virtualized dataset of data pulled from connected databases. This new dataset is created each time the code is run, but data is only kept for as long as needed. This is the pattern that the Avrio platform uses to generate data sets for analysis. There are several benefits to using this technology for data integration:

First, data virtualization separates the underlying database and structure from the data itself. This makes data virtualization much more scalable and flexible.

Second, data virtualization does not create a persistent copy of the database. Since a new data set is created each time, the code is run, and the freshest data is pulled from the source database. This approach avoids conflicts between multiple versions of the same data. Storage costs can also be minimized.

Third, sophisticated connectors, federated query engines, and virtualization can query, transform, and merge data from multiple databases at once. With a consolidated metadata store and a unified data model layered on top, this approach allows data analysts to pull data from various data stores as if they were a single database.

Finally, data virtualization enables more control of your data. Since data is moving through this virtualization layer, data quality checks can be run, and granular access controls can be implemented.

Avrio and Data Integration Patterns

Each integration and synchronization approach has its benefits and drawbacks. Regardless of your approach, attention to data governance, security, and quality is paramount for a healthy data architecture. Using suitable approaches together in a consolidated platform can result in a powerful and flexible solution.

Core to Avrio’s capability is data virtualization, which, is used to integrate data for analysis to create a new dataset each time the code is executed. When persistent data is required, but there are regular changes, Avrio also features CDC capabilities that will only mirror changes in the source database to the mirrored environment.

Additionally, Avrio has mirroring capabilities to create persistent data sets from virtualized data sets. When data does not change often but is used regularly, mirroring can reduce the usage of bandwidth resources

The Avrio platform enables third-party access to data products through an API. This front-end layer of Avrio also features write-back capabilities to incorporate changes made in a BI tool with the back-end database connected to the Avrio platform.

Combining multiple integration and synchronization patterns to achieve a business outcome makes strategic sense. Each situation is different and requires unique capabilities. The Avrio platform combines the right technologies and patterns to make data accessible across multiple data silos with self-serve data products that have built-in governance and security—designed for the AI age, rapid analysis, and better decision-making.

Avrio also fits well with your existing integration tools and infrastructure. If you are streaming data to a data lake, Avrio can create a data product to capture a snapshot of streaming data for analysis. If you have built ETL pipelines that are well established and do not change much, Avrio can incorporate this data into a data product that can make it more accessible and integrate data sets from other systems.