Testing & Data Validation

0
335
big-data testing and data quality

What is Data Validation?

Data validation has checks to ensure that data comes in the format and specifications we expect. We’re asking if the data is valid for how it will be used. This question can be asked at multiple stages of the data application lifecycle. It can be asked as we ingest input data, after we transform the data, or before uploading it for end-user consumption.

Common validations are checking if:

  • the column contains null values
  • a column exists
  • some statistics are as expected
  • the shape of the DataFrame

Scenario 1 – Changes in Source Collection

There may be changes in the way data is collected upstream that make our pipeline invalid. There can be changes in the schema of tables, column types, or even acceptable values. Think of situations where there were two categories, and suddenly a third one is added. These would require downstream operations to change. There will be cases where new data is captured, and other attributes may get retired in a domain.

Scenario 2 – Unexpected Data Shows Up

Corrupted or Bad data can be commonly sent due to manual inputs or updated input source files, which could cause transformations to break or embarrassing output to be published. It will break the transformation layer if data is missing or has data of a different type than the schema defined.

Scenario 3 – Data Collection Fails for a Bit

Data can become missing or stale when the underlying infrastructure goes down temporarily. It can also be due to delays as the system of record cannot send, or it sends the data, but the ingest cannot process it for a couple of days, and the deltas build up.

Scenario 4 – Shared Pipelines

Often data projects will share the exact data for conformed dimensions. These pipelines can be shared by multiple teams and should be to avoid unnecessary recomputation. One team’s changes may change the values in a table without another team knowing. The shape of the output will change over time and impact all the other tenants.

Data Quality and Coverage

Consider making a variety of test cases to get max data coverage.

  • cases that validate columns that both need their validation and validations across dependent columns
  • distribution validation
  • string validation
  • category dependent validations

When is Data Validation Done?

It would be best practice to run validation as much as possible using lightweight validations.

  1. Source Validation – checking if the source data is up to standard
  2. Transformation Validation – checking if the data was altered incorrectly during the transformation steps
  3. Result Validation – validations can be performed right before the results are published.

What Happens When Validation Fails?

Errors can be preemptively captured, and pipelines can be stopped with failure reports. In some cases, pipelines can be continued. It would be best to continue since production support will be invoked to manage the issue and make a decision. The most important thing about validation in production is getting insight into potential failures before they occur. The validation is a contract that the data needs to fulfill for the data pipeline to succeed.

How much data quality coverage do you have on your data pipeline?

How do you manage your findings?