BI - Data Warehouse/ Lake/ Lakehouse

Testing & Data Validation

September 14, 2022

335

What is Data Validation?

Data validation has checks to ensure that data comes in the format and specifications we expect. We’re asking if the data is valid for how it will be used. This question can be asked at multiple stages of the data application lifecycle. It can be asked as we ingest input data, after we transform the data, or before uploading it for end-user consumption.

Common validations are checking if:

the column contains null values
a column exists
some statistics are as expected
the shape of the DataFrame

Scenario 1 – Changes in Source Collection

There may be changes in the way data is collected upstream that make our pipeline invalid. There can be changes in the schema of tables, column types, or even acceptable values. Think of situations where there were two categories, and suddenly a third one is added. These would require downstream operations to change. There will be cases where new data is captured, and other attributes may get retired in a domain.

Scenario 2 – Unexpected Data Shows Up

Corrupted or Bad data can be commonly sent due to manual inputs or updated input source files, which could cause transformations to break or embarrassing output to be published. It will break the transformation layer if data is missing or has data of a different type than the schema defined.

Scenario 3 – Data Collection Fails for a Bit

Data can become missing or stale when the underlying infrastructure goes down temporarily. It can also be due to delays as the system of record cannot send, or it sends the data, but the ingest cannot process it for a couple of days, and the deltas build up.

Scenario 4 – Shared Pipelines

Often data projects will share the exact data for conformed dimensions. These pipelines can be shared by multiple teams and should be to avoid unnecessary recomputation. One team’s changes may change the values in a table without another team knowing. The shape of the output will change over time and impact all the other tenants.

Data Quality and Coverage

Consider making a variety of test cases to get max data coverage.

cases that validate columns that both need their validation and validations across dependent columns
distribution validation
string validation
category dependent validations

When is Data Validation Done?

It would be best practice to run validation as much as possible using lightweight validations.

Source Validation – checking if the source data is up to standard
Transformation Validation – checking if the data was altered incorrectly during the transformation steps
Result Validation – validations can be performed right before the results are published.

What Happens When Validation Fails?

Errors can be preemptively captured, and pipelines can be stopped with failure reports. In some cases, pipelines can be continued. It would be best to continue since production support will be invoked to manage the issue and make a decision. The most important thing about validation in production is getting insight into potential failures before they occur. The validation is a contract that the data needs to fulfill for the data pipeline to succeed.

How much data quality coverage do you have on your data pipeline?

How do you manage your findings?

Testing & Data Validation

What is Data Validation?

Scenario 1 – Changes in Source Collection

Scenario 2 – Unexpected Data Shows Up

Scenario 3 – Data Collection Fails for a Bit

Scenario 4 – Shared Pipelines

Data Quality and Coverage

When is Data Validation Done?

What Happens When Validation Fails?

EDITOR PICKS

Estimation for Agile Developers While Status Reporting to Waterfall Managers

5 Major Reasons Why So Many Companies Fail At Social Media

Best Practices for Distributed Or Remote Teams in the Age of...

POPULAR POSTS

How to use business objects @Prompt Variable to build flexible universes...

How to Merge Data from Multiple Data Providers in WEBIntelligence (webi)

How to Calculate Number Of Days in a Month or Month...

POPULAR CATEGORY

Why Should you Consider Migrating to the Cloud?

Basic Terminology of Data Warehousing (DW) for Business Intelligence (BI)