Pekey‘s Blog

Speeding ETL Processing in Data Warehouses Using High-Performance Joins for C...

2018/04/20 Share

本文主要介绍如下两种操作,来优化数据操作, 加速数据仓库处理

1.Join and 2.Aggregation – which will play an integral role during preprocessing as well in manipulating and consolidating data in a data warehouse.

1.连接和2.聚合 - 在预处理过程中以及在数据仓库中操纵和合并数据时,它们将扮演不可或缺的角色。


ETL systems move data from OLTP systems to a data warehouse, but they can also be used to move data from one data warehouse to another. A heterogeneous architecture for an ETL system is one that extracts data from multiple sources. The complexity of this architecture arises from the fact that data from more than one source must be merged, rather than from the fact that data may be formatted differently in the different sources.

ETL系统将数据从OLTP系统移动到数据仓库,但它们也可用于将数据从一个数据仓库移动到另一个数据仓库。 ETL系统的异构体系结构是从多个来源提取数据的体系结构。这种体系结构的复杂性源于以下事实:来自多个源的数据必须合并,而不是来自不同来源的数据格式不同的事实。


Joins are used to combine information from two or more data sources, such as database tables, and place it into a new data source suitable for downstream processing or reports. Joins are particularly powerful because they enable rapidly changing data to be organized into categories for subsequent report preparation through the use of matching keys. Joins are used to pre-process data, to improve the efficiency of queries, and to accelerate changed data capture (CDC).



Therefore if the businessrequirements are for only certain changes to be captured, then it would be wasteful to transfer all changes. The most advanced CDC solutions therefore provide filters that reduce the amount of information transferred, again minimizing resource requirements and maximizing speed and efficiency.

因此,如果业务需求只是为了捕获某些变化,那么转移所有变更将是浪费的。 因此,最先进的CDC解决方案提供了可减少传输信息量的过滤器,从而最大限度地降低资源需求并最大限度地提高速度和效率。


Rather than replacing the information in the data warehouse with the data in the entire online transactional database, a join will match the primary key of the previously loaded record with its corresponding new record and then compare the data portions of the two records to determine if they’ve changed. In this way, only added, deleted, and altered records are updated, which significantly reduces elapsed time of database loads. By using a high-performance join for CDC, data warehouse updates can be performed with far greater efficiency.

与其将数据仓库中的信息替换为整个在线交易数据库中的数据,联合会将先前加载的记录的主键与其相应的新记录相匹配,然后比较两个记录的数据部分以确定它们 已经改变了。 通过这种方式,只有添加,删除和更改的记录才会更新,这会显着减少数据库加载所花费的时间。 通过为CDC使用高性能连接,数据仓库更新可以以更高的效率执行。