Challenges in following ETL Process during Data Warehousing
In the context of data science, ETL i.e. Extract, Transform and Load is a commonly used acronym. The name ETL came into existence in the early years of the 21st century, when formalization of data science as a discipline came into existence. As ETL process involves various stages of transformation, homogenization, and cleansing, it is a routine programming problem for all business applications.
Challenges in extraction process
One of the challenges in integrating data across heterogeneous sources is the availability of compatible drivers across diverse data sources. Any data extraction tool, program or script needs to be able to parse the source data. This implies availability of efficient APIs which allows the ETL processes to interface with the data sources and extract the required data fields efficiently and accurately. If compatible and standard drivers are not available, it needs to be coded and maintained in the Extraction repository.
Extraction of data from unstructured data sources, e.g. html pages, textual data involves custom programming as most commercially available ETL tools do not have these plug ins. However, commercially available ETL tools work efficiently with static data structures and designing processes that recognize source data structure changes and repair extraction processes dynamically is a complex programming problem which is not supported out-of-the-box by them. This adds significantly to the cost of maintenance of the ETL process.
A key challenge in defining the CDC strategy is it has the potential to disrupt the transaction processing during extraction. A typical extraction schedule would be daily incremental extracts followed by either weekly or monthly full extracts to bring the warehouse in sync with the source on a weekly or monthly basis. Extraction of data from the master and transaction data bases has significant overheads since the transaction databases are designed for efficient inserts and updates rather than reads and queries, which an Extraction process starts off with.
Changes to the source data during extraction process has to be identified depending on the transformation and load strategy for changes which is classified as Slowly Changing Dimensions (SCD) patterns. Change recognition itself is a major design problem requiring applications to create audit trails or create ETL processes. Differing frequency of changes across multiple data sources and the ability to maintain time series data in each source gets further compounded by the very trivial clock synchronization across the data sources.
The volume of data extracted depends on the frequency of extraction, the amount of changes, the SCD pattern being implemented and the warehouse synchronization strategy – i.e. weekly or month full extracts for synchronization. Right timings for the extractions depend on the nature of analysis done on in the data warehouse. Challenges come when the frequency (dealt with in a previous section) is real time - each extraction and application of the change affects the data characteristics, influencing analyses dynamically, and even erroneously due to timing differences across different data sources during a real time data extraction and application cycle.
The combination of these challenges gets compounded with the number of data sources, each with their own frequency of changes, and accompanying CDC strategy. Correlation of the changes across multiple sources while extracting needs to handle ‘null’ values for correlated fields. Besides, each source may log changes in different formats and ways, some of it (DBMS logs) proprietary, needing source specific coding with accompanying maintenance overheads.
Transformation- an extension of the extraction workflow
Transformation can be done as an extension of the extraction workflow, or in a Data Staging Area, or a combination of both. There can be multiple stages and issues of the transformation process as well. Among them is data consistency. Data extracted from multiple sources result in ambiguity. Same data item from multiple sources can be of different data types and therefore cannot be merged without homogenization and representation mismatches. There will also be data which are outliers, i.e., beyond the normal range of possible values. As a part of merging data, a common challenge would be translation of multiple source values into a standard value. Another challenge in data mapping emerges if the target structure enforces referential integrity constraints.
The transformation stage also decides on derived data, which may range from simple concatenation, substitution, to complex statistical aggregations. These derivations may result in out-of-bound values, on-size errors and null values.
Loading, an eventual step to data cleansing and transformations
At this stage, the necessary data cleansing is done, and transformations and derivations are completed. The major challenge is the time taken to upload the warehouse. This in turn is dependent on whether a bulk upload strategy or a cursor upload strategy is adopted. When the data warehouse is being used for analysis, the underlying data should be available for use, and be relatively static so as not to affect readings taken. The duration of the load and concurrency levels available during the loads are important considerations. Data driven real time decision making would require real time data warehousing and decision support systems. The challenges in populating a data warehouse using ETL processes get compounded with the real time requirement. Most conventional data warehouses are built on a relational database environment and therefore the commercially available ETL tools work reasonably well if they are designed appropriately.