Pekey‘s Blog

Implementation of Change Data Capture in ETL Process for Data Warehouse Using...

2018/04/18 Share


  • 没什么干货,参考文献中的几篇有价值
  • 可以参考这篇文章关于etl基本术语的介绍
  • snapshot difference技术 快照差分


  • Apache Spark can process data in large amounts using a relational scheme that can be manipulated to achieve maximum performance.
  • Apache Hadoop was used to facilitate distributed storage in order to be able to run parallel processing.
  • MapReduce programming was used for simple processes such as transfer of data from the database to the HDFS using Apache Spoon.


parallel processing in the CDC method can be implemented by employing Spark SQL. Spark SQL is a module from Apache Spark that integrates relational processes with functions in API Apache Spark. Spark SQL has the capacity to utilize query such as data processing using database. Spark SQL can be utilized to run the CDC method using commonly used operations such as JOIN, FILTER, and OUTER-JOIN, so that CDC processing using Spark SQL can be more easily implemented compared to using MapReduce.

在CDC方法中的并行处理可以通过使用Spark SQL来实现。 Spark SQL是Apache Spark的一个模块,它将关系流程与API Apache Spark中的函数集成在一起。 Spark SQL有能力利用查询,如使用数据库进行数据处理。 通过使用常用操作(如JOIN,FILTER和OUTER-JOIN),可以利用Spark SQL运行CDC方法,与使用MapReduce相比,使用Spark SQL进行CDC处理更容易实现。


The whole ETL process are performed using Pentaho Data Integration (PDI). Meanwhile, the proposed ETL process extract new and changed data using map-reduce framework and Apache Spoon. The transformation and loading process are performed using PDI. This section elaborates our big data cluster environment and the implementation of CDC.

整个ETL过程使用Pentaho数据集成(PDI)执行。 同时,提议的ETL过程使用map-reduce框架和Apache Spoon提取新的和更改的数据。 转换和加载过程使用PDI执行。 本部分阐述了我们的大数据集群环境和CDC的实施。


The CDC method can be implemented using MapReduce by adopting the divide and conquer principle similar to that conducted in the study by [9]. The data are divided into several parts, each to be processed separately. Then, each data processed will enter the reduce phase which will detect changes in the data.


CDC方法可以通过采用类似于[9]研究中进行的分而治之原则,使用MapReduce来实现。 数据分成几个部分,每个部分分别处理。 然后,每个处理的数据将进入减少阶段,这将检测数据的变化。

sprak sql 的应用

In this study, the method used was by comparing data and finding differences between two data. The CDC method using snapshot difference was divided into several stages as illustrated in Figure 8. The first process was to take the data from the source. The data were taken by full load, which took the entire data from the beginning to the final condition. The results of the extraction was entered into the HDFS. The data was processed through the program created to run the CDC process using the snapshot difference technique. Snapshot difference was implemented using outer-join function from SparkSQL. The program was run by looking at the null value of the outer-join results of the two records, which will be the parameter for the newest data. If in the process damage to the data source was detected, the program could automatically change the data reference to compare and store the old data as a snapshot.

在这项研究中,所使用的方法是通过比较数据和发现两个数据之间的差异。 使用快照差异的CDC方法分为几个阶段,如图8所示。第一个过程是从源头获取数据。 数据是以满负荷的方式获取的,这些数据从开始到最终的状态都采用了整个数据。 提取结果被输入到HDFS中。 数据通过创建的程序进行处理,以使用快照差异技术运行CDC进程。 快照差异是使用SparkSQL的外连接函数实现的。 该程序通过查看两条记录的外连接结果的空值来运行,这将成为最新数据的参数。 如果在检测到数据源损坏的过程中,程序可以自动更改数据引用以比较旧数据并将其存储为快照。

  1. 1. 介绍
  2. 2. 分布式系统中的CDC
  3. 3. 设计和实现
    1. 3.1. mr的应用
    2. 3.2. sprak sql 的应用