Change Data Capture (CDC) is an important part of responsive integration and data replication solution. As the name implies, the CDC tool or process detects the changes made in one system, then sends them to the messaging or middleware platforms, which in turn deliver the change events to registered consumer apps.
When architecting infoCDC solution, the most obvious performance metric is latency, which could be measured as the time it takes between the transaction committed to the system of record, and the time this change appears in target systems and/or shared event streams. When implementing or choosing the integration and data replication solution, it is important to understand the underlying infoCDC methods and the pros and cons associated with each. In this article, we will review the infoCDC approaches and options specifically for IBM i (AS/400, iSeries) based applications.
Pollers, triggers, or listeners?
At a high level, the infoCDC solutions either periodically poll the database tables, receive a change based on trigger action, or listen for change events (journal entries). As with many things in life, there’s no size-fit solution. Each option comes with pros and considerations and should be carefully evaluated to avoid any post-go-live surprises.
Polling is almost always easier to implement and tune for DIY solutions, but results in higher latency and empty compute cycles burned when there is no change.
Triggers fire right away and could often be considered good candidates for capturing the change and pushing it to the external systems, however, they execute within each process (job) that performs the change, and could significantly slow down the batch processes that modify a large number of records within a short period of time.
Listeners are typically harder to implement, but once developed and tested this approach results in better compromise for latency and IBM i system resource utilization.
Finally, when database change originates within specific back-end programs, some teams choose to modify the application to explicitly pass the change event to the messaging infrastructure or external systems right from the application flow. This can be a good option but it would miss any changes applied directly to the database tables (for example as part of a mass update script executed outside of IBM i application layer).
Database polling, naturally, involves periodically executing the queries to pull the latest changes from the database. On the surface it’s an easy way to detect and send the changes in IBM i tables, as long as there’s a reliable last changed date/time stamp column. However, as we start digging a bit deeper into design the solution quickly becomes more involved. Most polling solutions involve tagging the successfully processed records, or at a minimum saving the last time stamp and/or a table key processed during the previous poll, so the next poll would start where the previous run left off. This could be problematic for legacy databases especially for the tables without primary keys (yes there are quite a few of these still out there!).
The processing lag is directly correlated with the polling interval. For business processes that require near real-time integrations, the polling intervals should be fairly short, but should not be too short – otherwise, we are running a risk of starting a new poller instance before the previous one completes.
The journal polling requires periodic dumping and analysis of DB2 journal entries added since the last polling run. This method may solve some of the issues of DB polls and determine the changes even if the table has no date/time columns. Conceptually it follows the DB polling paradigm with similar pros and cons, however, the complexity of extracting, parsing, and processing the journal entries makes it a less popular option. Several commercial CDC tools rely on this approach (as opposed to real-time listeners or triggers).
This is one of the easiest options to implement. DB2 triggers can be built in a variety of languages, including SQL, RPGLE, Cobol, C / C++. The program is created and tested, the trigger is registered with DB2 table, and that’s it! The database management system fires the trigger program for every changed record and provides a reference to the before and after values of the columns. The trigger program can easily implement the filtering rules (for example only send the changes in the Order Status field but ignore the changes in other columns not relevant to this replication process). The main downside of the triggers is the potential performance penalty specifically for bulk updates or other batch processes that process a considerable number of records in a brief period of time. As a result, the CDC trigger programs should be as lightweight as possible, ideally not including any i/o or heavy logic operations or offloading it to a separate “worker” process(es), which quickly leads to increased solution complexity and operational overhead. While there are some tools that help generate and manage database triggers, for the most part, the IBM i dev teams are responsible for developing and operating their own trigger programs.
The alternative option to database triggers, the journal listener-based CDC programs can be registered to receive entries immediately when the change is performed. It is more difficult to make sense of the journal entries’ details and select the required change events. There are quite a few edge cases to consider when developing a resilient and high-performant journal listener-based CDC, including transaction management/commitment control, journal receiver management, resequencing, bulk operations such as table reorganization commands etc. The main benefits include near real-time processing of DB changes, and all work performed in a separate dedicated IBM i job (process). The latter addresses one of the major concerns with the trigger-based approach (the performance impact for the job that makes the database changes). As with triggers, the processing program can implement the filtering logic, only including relevant changes and ignoring the rest. Some commercial CDC tools leverage this approach for IBM i data replication and HA solutions.
Embed CDC into application programs
When companies have full control over the application source code and can reliably identify all programs that change the data to be replicated, one of the options is to insert a CDC logic into these programs (or better yet call a common process) to send these events to external consumers. This option is good primarily because it could deal with “functional” data models instead of the raw data stored in the database. For example, if the system of record includes complex on-the-fly calculations for stock availability, the trigger or journal-based component, or worse yet the target applications, would have to implement the same calculations. The obvious downside of this approach is that it would never catch any changes applied by external programs or direct database updates outside of the programs updated with CDC logic.
Tools vs DIY
There are several commercial data replication solutions for IBM i platform that includes the CDC component for detecting and sending the changes to the target platforms. Most vendors provide all-inclusive solutions that bundles CDC, data staging and transformation, message processing, and delivery to a number of pre-defined target systems. These comprehensive solutions focus primarily on data replication use cases and include tools to work with entire DB snapshots, near real-time incremental changes, replays, and other advanced features. Not surprisingly these solutions also happen to be pretty expensive, and complex and require considerable effort to implement and operate.
The alternative solution is to roll out your own custom-made CDC component based on the polls, triggers, or journals. Companies with competent IBM i dev teams with spare capacity should have no problem implementing such a solution in a few months. For example, our IBM i services team has completed several such projects for our customers, which eventually motivated us to build our own lightweight and flexible CDC tool. Our CDC, scheduled to be released in May 2022, is an easy to implement product that focuses exclusively on infoCDC capabilities while delegating the event processing to message infrastructures such as infoConnect for Kafka, or API integration platforms such as infoConnect for MuleSoft.
The longevity and relevance of the IBM i based systems of record can be greatly improved by adding the event streaming capabilities. We reviewed the most popular IBM i change data capture options, and of course managed to slip in a sales pitch promoting our own lightweight infoCDC tool :). The choice would depend on many factors, such as business process sensitivity to a time lag, the alignment of the underlying database model with a logical view of the data and complexity of calculated fields, the available tools, budgets, skills, and development resources. Near real-time integrations may not be easy to achieve with legacy systems and it takes experienced architects and capable teams to implement, but in the end, it almost always benefits the business as real-time always beats the slow ad-hoc batch integrations and delivers superior customer experience. Contact us to find out more!