Data Warehouse Design Patterns

0
5876
design-patterns

This post will not dive into each topic in detail but serve more like a curriculum of things to research for the  Data Journey.  Anyone who needs to get into the Data Warehouse (DW) space should have a handle on the following Design Patterns:

Connection Patterns

There are 4 Patterns that can be used between applications in the Cloud and on premise.  The combinations are as follows

  • on-premise caller to Cloud provider
  • Cloud caller to on-premise provider
  • Cloud caller to Cloud provider
  1. Remote procedure calls (RPC) Connection Patterns
  2. Asynchronous (fire and forget) Connection Patterns using Queues
  3. Shared Database in cloud or on-premise
  4. Data/File synchronizing in Copying Data (ETL) flat file loads, database to database sources to targets.

Extract Transform Load (ETL) Patterns

Truncate and Load Pattern (AKA full load): its good for small to medium volume data sets which can load pretty fast.  it is good for staging areas and it is simple.  The key benefit is that if there are deletions in the source then the target is updated pretty easy.  The disadvantage is there is no history .kept and no tracking. CUID ie created, updated Inserted or Deleted cannot be tracked.

Slowly Changing Dimension Type 1 Pattern: This pattern is simple but it is very slow and should not be done for anything over 1000 rows.  See the dimensions definition for type 1

Slowly Changing Dimension Type 2 Pattern: This pattern is simple but it is very slow and should not be done for anything over 1000 rows.  See the dimensions definition for type 2

Declarative/Adhoc SQL Query Patterns

  1. Join patterns: directional, inner or equijoin, left and right outer join, full outer joinA theta join allows for arbitrary comparison relationships (such as ≥ or between).  An equijoin is a theta join using the equality operator.  A natural join is an equijoin on attributes that have the same name in each relationship
  2. Flattened Hierarchies which put all the levels on one row as columns vs Ragged hierarchies which like unbalanced hierarchies, the branches of the hierarchies can descend to different levels.
  3. Join Tables/ Translation tables usually when putting two silo systems in the same context so the data can be merged
  4. Parent/ Child Tables and Cardinality (Fan traps that occur when using aggregate measures), the parent is a foreign key (FK) on the child record so the relationship creates data clusters, We have to ensure we do not writes queries that multiple the aggregation values.
  5. Self Joins (aka Alias)  in the SQL we can refer to the same table by another name and join to itself eg Manager is a type of Person so technically the person table can be self joined to get the manager’s info.

Query Performance Patterns

  1. Explain Plans, Indexing and Partitions, this is the bedrock of performance tuning in relational databases.  This topic alone deserves its own post. It would be dependent on table storage and data types configurations at the Data Definition Language (DDL) setup. It will also need knowledge of the data cardinality to create balanced tree vs bitmap indexes and user query patterns to create covering indexes and getting more index range scans if query does not uniquely select the index or hit partitions to use much smaller data sets for faster queries.
  2. ETL Aggregation and Aggregate awareness for multiple aggregation tables
  3. Table Constraints in Data quality, including PK, FK and additional functions or regular expressions that can be put on columns to ensure the accurate data and not nulls are stored as needed.

Dashboard Design Patterns

  1. Layout Patterns
  2. Leading Indicators Aggregation Pattern
  3. Drill Down Pattern
  4. Progressive Filtering Choice Pattern

Security Patterns

I have a dedicated article to security patterns which are getting increasing complex as time progress and new regulations.