ETL Testing

Comprehensive testing of a data warehouse at every point throughout the ETL (extract, transform, and load) process is becoming increasingly important as more data is being collected and used for strategic decision-making. Data warehouse or ETL testing is often initiated as a result of mergers and acquisitions, compliance and regulations, data consolidation, and the increased reliance on data-driven decision making (use of Business Intelligence tools, etc.). ETL testing is commonly implemented either manually or with the help of a tool (functional testing tool, ETL tool, proprietary utilities). Let us understand some of the basic ETL concepts.

ETL –Extract Transform and Loading
ETL stands for extract, transform, and load. It can consolidate the scattered data for any organization while working with different departments. It can very well handle the data coming from different departments/sources. ETL can bundle all this data and consolidate it into a uniform presentation, such as for storing in a database or data warehouse. ETL can transform not only data from different departments but also data from different sources altogether. For example, an organization is running its business on different environments like Sybase and Oracle Apps. An informed decision by a business often needs an integrated view of their systems, viz., consolidating their data into readable reports. ETL can take these two source system data and make it integrated in to single format and load it into the single database.

Need for Data Warehouse testing
As we all know that a data warehouse is the main repository of any organization’s historical data. It contains the material and information for management’s decision support system. Most of the organization runs their businesses on the basis of collection of data for strategic decision- making. To take a competitive edge the organization should have the ability to review historical trends and monitor real-time functional data. To enable such strategic decision, the data should be clean, and free of any defects. Also, if the defects are detected during production it could easily lead to erratic decision making thus causing very high financial losses. So data warehouse testing is very much needed to avoid such losses and remain at the top of the business.

What is ETL Testing?
ETL Testing primarily involves Comparison of Data between Source and target as per Transformations/ Business Rules and checking that the process is getting completed within specified time period in an optimized way

Data warehouse testing phases
We followed the below phases in testing

  • Business understanding
    • High Level Test Approach
    • Test Estimation
    • Review Business Specification
    • Attend Business Specification and Technical Specification walkthroughs
  • Test plan creation, review and sign-off
  • Test case creation, review and sign-off
  • Test Bed & Environment setup
  • Test case execution and (regression testing if required).
    • Comparing the predictions with the actual results by testing the business rules in the test environment.
    • Raising defects for non conformance to transformation/filter rules.
  • Reporting test results

Above phases are followed for Component Integration testing in QA environment and for System Integration Testing in staging environment.
For us to test, data may be retrieved from production controlled baseline in QA/SIT environment ensuring masking sensitive information. At times, the test data may not be sufficient to cover all the possible scenarios w.r.t the requirements. In these cases, data may be mocked up that is explained in the later section of this write up.

Testing Rules for DW Testing
Before proceeding with basic rules of data warehouse testing, let us see how a data warehouse testing is different from web testing. The main difference between them is that web based test scenarios are generally user driven, where as data warehouse test scenarios are event driven. Events such as batch run, feed arrival etc will be part of functional requirements in data warehouse testing and test cases have to be written to validate those. These test cases will have SQL queries as the major part which again is the differentiator for data warehouse (DW) testing.
It is vital to test both the initial extraction of the data from the source and then updating/loading it on the target table. Filter rules are to be validated during extraction and transformation rules are to be validated during loading.
Data Warehouse / ETL Testing: Approach & Best Practices A defect found can be appreciated if and only if it is detected early and is fixed at the right time without leading to a high cost. So to achieve this, it is very important to set some basic testing rules. They are:

  • No Data losses
  • Correct transformation rules
  • Data validation

We are now going to talk with reference to the practices/strategies we implement in our current project on each of them.

No Data losses
We verify that all expected data gets loaded into the data warehouse. This includes validating that all records, all fields and the full contents of each field are loaded without any truncation at any step in the process.
As and when required negative scenarios are also validated. Example: Validating special characters.

Correct transformation Rules
We ensure that all data is transformed correctly according to business rules, it could be straight move, simple transformation or complex transformation.
The best method could be to pick some sample records, use the “EXACT” formula in the excel worksheet and compare the results to validate data transformations manually. This should be ideally done once the tester is absolutely clear about the transformation rules in the business specification.

Data Validation
We say that we have achieved the quality when we successfully fulfill customer’s requirements. In other words we basically achieve a value for our customer. Since in data warehouse testing; the test execution revolves around the data, so it is important to achieve the degree of excellence for the data and for that we do the data validation for both the data extracted from the source and then getting loaded at the table.
Data validation includes the following:

  • Data Completeness Testing
  • Data Transformation Testing
  • Data Quality Testing
  • Data Conditioning

Data Completeness Testing
Data Completeness testing involves the following:

  • Structure and Field Mapping Verification: In this step we ensure whether the data structures of the target tables are created as per the design mentioned in the mapping document. This is to verify the Field Name, Data type, Field Size and Constrained fields (E.g. Primary key, Foreign key, Null, Not null etc) between the database and mapping document.
  • Record count verification: In this step we ensure whether the number of records present in the source which are expected to be in the target table as per the business rules excluding any exceptions captured and the number of records present in the target tables are the same.

Data Transformation Testing
Data transformation testing is the verification of data integrity i.e. verification of data consistency and discrepancies between the source and target tables.
Data in different fields from source tables are either directly pulled into corresponding/respective fields in target table (or) Transformed based on some logic and then loaded into target table (or) Transformations are done through look up tables.

Data Quality Testing
The data quality testing is performed to validate the data present in the newly created tables. For systems where referential integrity are not enforced, various quality checks like duplicate records for the same natural keys, checking for various possible right combinations of indicative fields, unique GCI, Checking the order of dim keys generation etc are done.
Surrogate keys (auto generated one) in target table should be unique and non null.
Natural keys (usually loaded from source) in target table should be unique and non null.
Certain fields should not be null and certain others should be null for different records in target table (Null Check).

Data Conditioning
Data conditioning is carried out to manipulate (conditioning) the data in the source table and verify its behavior in the target table based on certain criteria.

Data may be filtered from source table based on some indicators and then loaded into target table.
SQL test scripts are written to recreate the Filter logic and the resultant data (obtained after executing the test SQL scripts) is compared against the data loaded in the target tables against the corresponding fields.

Conclusion:
Evolving needs of the business and changes in the source systems will drive continuous change in the data warehouse schema and the data being loaded. Hence, it is necessary that development and testing processes are clearly defined, followed by impact-analysis and strong alignment between development, operations and the business.