Reading time: 7 min
Introduction: Data is of paramount importance for numerous reasons – from analyzing client behaviors and patterns, to identifying options to optimize operations and making strategic decisions about the business and service offerings. Most businesses, especially those who weren’t born-digital, still have an enterprise applications stack that contains legacy systems working in tandem with newer, modern off-the-shelf solutions.There’s another group of businesses that are undergoing transformation, either through legacy application modernization or completely migrating to newer and more powerful solutions. These and other similar scenarios create a growing need for effective, efficient and accurate data migration between similar or different databases. This need is further complicated by the growing volume, type, and patterns of data.
Migrating data from one source to another has always been challenging. Recalling the days from the ’50s when migrating data meant migrating the actual physical infrastructure from one location to another. This involved a huge effort in terms of cost and resources.
Here’s how hard it was to Move a 5MB IBM Hard Drive in 1956.
- Creating a database for report generation, which should be isolated from the transactional database.
- Consolidating data in a warehouse for data analysis and BI reports.
- Archiving historical data that is not accessed frequently.
Irrespective of the why, migrating complete database(s) to other destinations periodically will be, in most cases, an uphill task.
Today, an organization can have a large variety of data repositories as part of their legacy applications, custom-built solution or off-the-shelf applications. Adding to the complexity, different departments functioning within an organization can also have distributed databases/multi-database systems. In such situations, opting for a full load migration requires more time, bandwidth and also increases possibility of losing mission-critical data. The best solution in these instances is incremental data migration which ensures efficiency, accuracy, speed, data retention, and more.
What is Incremental Data Load?
It is an activity of loading only new or modified records from the source to the destination database. To execute this type of data load, the destination needs to have all data from the source database before the incremental data load process initiates.
Why do we need Incremental Data Load?
The method is more efficient as compared to full data load when working with a huge volume of data. Example: We want an existing BI report to be populated with near real-time data. The data source of the BI report is not a transactional database, so loading full data every time in BI data source will not be possible.
Incremental data load can be performed in one of 3 ways:
- Insert only - Destination will be appended with incoming records even if they are pre-existing.
- Upsert records – New records will be inserted and modified records will be updated in the Destination.
- Upsert & Delete records - Deleting records is a very rare scenario, but a valid need. Here, new records will be inserted, modified records will be updated and missing data will be deleted in the Destination.
Incremental Data Migration for MS SQL Databases.
Let’s look at how incremental data migration can be carried out for MS SQL databases through incremental data load. Microsoft SQL Server provides following options to execute an incremental data load process:
- Timestamp/Rowversion datatypes columns
- Created/Modified Date & Time columns
- Change Data Capture (CDC)
Option 1: Timestamp/Rowversion datatype columns
Microsoft SQL Server has a rowversion datatype (timestamp is the synonym for the rowversion data type and is subject to the behavior of data type synonyms) that generates unique binary numbers within the database. These unique binary numbers are stamped as versions of the table rows having a rowversion column.
So, if a new row is inserted into the table, a new unique binary number is generated and assigned to that row. In case an existing row is modified, a new unique binary number is generated and assigned to the modified row. The MSDN document provides more details on this topic.
This feature of SQL Server can be leveraged to fetch the incremental data to be loaded to the destination. We will have to preserve value of the last rowversion which is loaded in the destination, which will then be used in the next iteration to get the incremental data. The flow diagram below depicts the working of rowversion and incremental data load.
- Row versioning is automatically taken care of by SQL Server engine once we add a column in a table with data type rowversion/timestamp.
- Binary number stored against a column can be casted to a whole number which can ease the filtration process.
- Rowversion column needs to be added in every table to be migrated.
- Last rowversion value needs to be preserved for every migrated table.
Option 2: Created / Modified Date & Time columns
It is a very common practice to have CreatedOn & ModifiedOn Date & Time column while designing a table in a database. So, whenever a new row is inserted, CreatedOn column is populated with insertion date & time value. Similarly, ModifiedOn column gets populated with date & time value when row(s) are modified.
Ideally, these columns are used for auditing purpose but can also be leveraged to fetch incremental data to be loaded into the destination. The last load date & time needs to be preserved so that the next data load is identified beyond the last date & time. The below flow diagram depicts incremental data load using this method.
- We only need to preserve last load date & time, not table-wise.
- Date & Time value ease rows traversing process if any data missed
- In each insert or update operation, the values in CreatedOn & ModifiedOn columns have to be updated along with the other set of columns of respective table.
Option 3: Change Data Capture (CDC)
Change Data Capture (CDC), a feature added in SQL Server 2008, is widely used for audit logging of the data. It can also be used for incremental data load in the destination from the source database. Unlike the above options, CDC does not work on table columns. Instead, it is a database feature that needs to be enabled to capture changes in any table of that database.
First, we need to enable CDC on the database, then we can enable it on the table for which we need to capture the changes. A new table is created in cdc schema named cdc.dbo_<originaltablename>_CT where all the changes are logged. CDC comes with pre-built functions to retrieve changed records (fn_cdc_get_all_changes_<originaltablename>, fn_cdc_get_net_changes_<originaltablename>) and an agent job is created to clear CDC tables. This job can be scheduled to periodically clear CDC tables.
The __$operation column in CDC tables can be used to trace changed records, where value 1 = delete, 2 = insert and 3 = update (old row) 4 = update (new row). This flag can then be used to retrieve records and load them in the destination. Once changes are loaded in the destination, the cleanup agent job can be executed to clean up the CDC table.
- Configuration is easy as it is database level settings.
- With pre-built function and stored procedures, it is easy to maintain changes and retrieve the data.
- In case of a heavy transactional database, CDC will shoot up memory utilization by the database due to redundant record logging.
Option 4: Replication
Replication is a technology to copy and distribute data and database objects from one database (source) to another. Source database (Publisher) publishes modified data to the distributor whose responsibility it is to distribute the data to the destination database (Subscriber). Three types of replications are supported by SQL server:
- Snapshot Replication - Snapshots of Publisher are scheduled, which then move the data to another server.
- Transactional Replication - Whole data is copied from Publisher to Subscriber initially when replication is setup. Then, every time Publisher is modified, the Subscriber is also updated with required changes.
- Merge Replication - Data from multiple publishers are consolidated in a single database and then changes are distributed to Subscribers. This type is preferable when subscribers in a remote location will have intermittent connectivity.
- No changes need to be implemented in the database schema.
- Quick access to the data on destination (subscriber).
- Unlike the 3 previous options, the destination for replication can only be a database.
- Overhead to maintain Distributor.
Incremental data load is a highly effective and efficient solution in cases where incremental data needs to be synced with either local or remote databases. Microsoft SQL Server allows for designing an incremental data load in various ways as discussed in this blog. Based on the source database design & business requirements, one of the four optimal methods for incremental data load can be chosen. If your enterprise has systems that exchange data, contact us to discuss the best ways to implement incremental data load.