Basic Terminology of Data Warehousing (DW) for Business Intelligence (BI)

0
5460
data-warehouse

Why build a Data Warehouse/ Data Lake?

Transactions are essential, but the capture is of small value compared to the data’s value.  The purpose of a Data Warehouse (DW) is to create a place to clean, merge and keep historical data for reporting to gain insight into the business process.  A DW is a foundation for Business Intelligence (BI). 

Data is the treasure trove that we are sitting on but never notice because that data has to be turned into information, then into knowledge, and finally into income.  A DW is where all the data is gathered and massaged.  This stored data is now in a reportable and comprehensive format to gain insight. 

All DW projects begin by standing before a board and explain why they should invest money in a DW.  Here are the high-level reasons to consider.

  1. Single Version of the Truth, create one place that everyone can source data and have all reporting with the same numbers as its base data.  See definition for SOR below.  A data warehouse integrates data from multiple data sources.  Once data is in the data warehouse, it will track all history of changes.
  2. Data Aggregation, we live in the era of big data.  A human can’t consume and interpret this amount of data.  DW supports one version of the truth as all reporters will have the same numbers at a high level.
  3. Drilling down, we have to support and prove the high-level numbers.  Consumers can view summary numbers or data aggregates called Key Process Indicators (KPI).  Drill down on the KPIs to get the details that make up that aggregate value.  This lineage gives more trust to the consumers of the data.
  4. Trends across time: Store historical information in one place and the right context or grain.  This data will persists for the lifetime of the DW and give historical context.  History has a way of repeating patterns.
  5. Correlation of multiple data sources, silo business processes store its data in its databases.  We need to see the big picture of how a business process is impacting each other if there are relationships

Some technical terms that you will need to know when making a Data Warehouse (DW)

Two names come up as founders in the data warehouse space, and we should fundamentally know the difference between the two paradigms.

Bill Inmon’s paradigm: Data warehouse is one part of the overall business intelligence system.  An enterprise has one data warehouse, and data marts source their information from the data warehouse.  In the data warehouse, information is stored in 3rd normal form.

Ralph Kimball’s paradigm: Data warehouse is the conglomerate of all data marts within the enterprise.  Information is always stored in the dimensional model.

The dimensional model includes fact tables and dimension tables.  Fact tables connect to one or more dimension tables, but fact tables do not have direct relationships with one another.  Dimensions and hierarchies are represented by dimension tables.  Attributes are the non-key columns in the dimension tables.

Dimension: A category of information.

Dimension tables – usually give detailed information about the attributes for the ids (foreign keys) stored on fact tables. It is a table of attributes for that category of information.

Attribute: A unique level within a dimension.

Grain or Granularity –  The grain will be the lowest information level stored in the fact table.  The lower the level of detail, the larger the data amount in the fact table. How detailed your questions need to be will determine how close to the transaction you need to store the data in the DW. 

Surrogate key – a new key developed to be used by a new system-moves data from the source system to the new system like the DW.  It is excellent at creating historical rows for primary system keys as changes occur.

Fact tables usually answer who, what, where, when, and how much, kind of questions.  Many dimension tables support a fact table, and that defines the context.  The fact usually supports measures of interest—this the basis for a star schema.

There are three types of facts:

  • Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
  • Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
  • Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.  These make up the Factless Fact Table.

Cumulative Fact Tables or Aggregrate Tables: This type of fact table describes what has happened over time and is in summary format.   The grain might be monthly instead of daily.

Snapshot Fact Tables: This type of fact table describes the state of things at an instance of time and usually includes more semi-additive and non-additive facts

Junk Dimension is a technique of slimming down a fat Fact table for better performance. Remove all passive Non-Additive attributes to a dimension table and put the foreign key for the Junk Dimension table on the fact.

Normalized form – Store data in many rows instead of a flattened table with many columns and a single row.  A Normalized design will typically have many tables to manage each entity, prevent repeating data, and conserve storage space.

When normalizing, we typically go to the 3rd Normal Form, defined by transitive functional dependency.  We have transitive dependency if  A is functionally dependent on B, and B is functionally dependent on C. In this case, C is transitively dependent on A via B

De-normalized rows- flattens data to fit in one row might take more storage, but it is easier for reporting queries.

Slowly Changing Dimensions – Ralph Kimball introduced the concept of Slowly Changing Dimension (SCD) attributes in 1996.  The types are dependent on how we manage these changes.  The business must drive it as they know the data much better.

SCD approaches type 1 (overwrite, no history is kept); if the phone number changes, we do not track it

Key   Name Phone
1001   Stephen 123 456 7894

SCD type 2 (add a row to keep history), this is a scalable design

Key Effective_dt Name Phone
1001 1-Aug Stephen 123 456 7894
1001 1-Dec Stephen 456 524 7894

SCD type 3 (add a column track history), The original record is modified to reflect the change, so it’s not a scalable type of change if we need an unnamed amount of changes on the record.  The design is limited in its ability to scale changes.

Key Effective_dt Name Phone New Phone New_Effective_dt
1001 1-Aug Stephen 123 456 7894 456 524 7894 1-Dec

Since legibility is a vital component of the Kimball mantra, we sometimes wish Ralph had given these techniques more descriptive names, such as “overwrite” instead of “type 1.” But at this point, the SCD type numbers are part of our industry’s vernacular.

Fast-Changing Dimensions – these are usually more oversized tables that can be split into an SCD and a reference table in a DW design.

Conformed Dimension is a dimension that is referenced by more than one fact table.  It gives the same meaning when referring to different fact tables to give analytic integrity to the data model.  The time dimension table is a typical example of a conformed dimension.

Logical Data Model – describes the data in as much detail as possible in a domain-specific language.  It will have all the entities with a primary key on each defined and their relationships to each other with foreign keys.  At logical model design, you should resolve many-to-many relationships and normalize them.  The columns are defined by an Attribute type.

Physical Data Model – converts the logical design to physical tables, including column name, column data type, column constraints, primary key, foreign key, and relationships between tables.

Star Schema is a fact table surrounded by many dimensions tables.

Snow Flake Schema is many star schemas joined together.  Snowflakes are sometimes different domain tables left in the 3rd Normalized form and joined together.  We can also have a hybrid of star and 3rd Normal form.  A snowflake schema usually improves query performance due to smaller lookup tables and more specific fact tables.

KPI – Key Performance Indicators drive the objectives and goals of any given business.  Each business and each department will define its own KPI’s.

OLAP is about Multidimensional analysis.  OLAP stands for On-Line Analytical Processing.  There are two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP).  Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

MOLAP data is stored in a multidimensional cube.  The storage is not in the relational database but proprietary formats.  The cube size is usually a limiting factor for computed data storage.

ROLAP analysis is essentially a SQL query (or multiple SQL queries) in the relational database

HOLAP gets the best of both worlds and typically will “drill through” from the cube into the underlying relational data to get more details.

Hierarchy is a style of modeling data in levels for ROLAP.  Each level will have attributes of that level.  This design will define how to roll up data or make a drill path for “drill down.”  The specification of levels represents the relationship between different attributes within a dimension.  Eg, every year has months as its next attribute level, and months have days as its next attribute level.  This nesting of related categories forms the hierarchy.

FASMI test, Fast Analysis of Shared Multidimensional Information.  Dr. Codd proposed 12 rules for OLAP.  For a more detailed description of both Dr. Codd’s rules and the FASMI test, visit The OLAP Report

Operational Reports are reports used consistently by a business to service its clients.  

Analytic Reports are reports used exploratory in nature and used by a business to gain insight into trends or make predictions for future decision-making.

ETL – Extract, Transform, and Load.  ETL is needed for all BI as we must typically pull data from many sources and put them in a design model to answer future questions.

Meta-Data -this is data about data.  It is the dictionary that describes the information captured in the DW.

Data Store – this is a domain-specific reporting database.  It would typically follow a star schema designed for query and analysis rather than for transaction processing.  It usually contains historical data derived from its transaction data.

System of Record (SOR) – or source system of record (SSoR) is a data management term for an information storage system that is the authoritative data source for a given data element or piece of information.  It can be the application where the data is captured or delegated to a data warehouse.  The SOR would act as the single version of the truth.

What should you look at next?  It would help if you considered looking at the Design Patterns for Data Warehousing