{"id":920,"date":"1999-09-16T21:07:01","date_gmt":"1999-09-17T03:07:01","guid":{"rendered":"http:\/\/www.designandexecute.com\/designs\/?p=920"},"modified":"2022-09-10T11:25:35","modified_gmt":"2022-09-10T17:25:35","slug":"basics-of-data-warehouse-dw","status":"publish","type":"post","link":"https:\/\/www.designandexecute.com\/designs\/basics-of-data-warehouse-dw\/","title":{"rendered":"Basic Terminology of Data Warehousing (DW) for Business Intelligence (BI)"},"content":{"rendered":"<h1><span data-preserver-spaces=\"true\">Why build a Data Warehouse\/ Data Lake?<\/span><\/h1>\n<p><span data-preserver-spaces=\"true\">Transactions are essential, but the capture is of small value compared to the data&#8217;s value.\u00a0 The purpose of a Data Warehouse (DW) is to create a place to clean, merge and keep historical data for reporting to gain insight into the business process.\u00a0 A DW is a foundation for Business Intelligence (BI).\u00a0<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Data is the treasure trove that we are sitting on but never notice because that data has to be turned into information, then into knowledge, and finally into income.\u00a0 A DW is where all the data is gathered and massaged.\u00a0 This stored data is now in a reportable and comprehensive format to gain insight.\u00a0<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">All DW projects begin by standing before a board and explain why they should invest money in a DW.\u00a0 Here are the high-level reasons to consider.<\/span><\/p>\n<ol>\n<li><strong><span data-preserver-spaces=\"true\">Single Version of the Truth<\/span><\/strong><span data-preserver-spaces=\"true\">, create one place that everyone can source data and have all reporting with the same numbers as its base data.\u00a0 <\/span><em><span data-preserver-spaces=\"true\">See definition for SOR below.\u00a0 <\/span><\/em>A data warehouse integrates data from multiple data sources.\u00a0 Once data is in the data warehouse, it will track all history of changes.<\/li>\n<li><strong><span data-preserver-spaces=\"true\">Data Aggregation<\/span><\/strong><span data-preserver-spaces=\"true\">, we live in the era of big data.\u00a0 A human can&#8217;t consume and interpret this amount of data.\u00a0 DW supports one version of the truth as all reporters will have the same numbers at a high level.<\/span><\/li>\n<li><strong><span data-preserver-spaces=\"true\">Drilling down<\/span><\/strong><span data-preserver-spaces=\"true\">, we have to support and prove the high-level numbers.\u00a0 Consumers can view summary numbers or data aggregates called Key Process Indicators (KPI).\u00a0 Drill down on the KPIs to get the details that make up that aggregate value.\u00a0 This lineage gives more trust to the consumers of the data.<\/span><\/li>\n<li><strong><span data-preserver-spaces=\"true\">Trends across time:\u00a0<\/span><\/strong><span data-preserver-spaces=\"true\">Store historical information in one place and the right context or grain.\u00a0 This data will persists for the lifetime of the DW and give historical context.\u00a0 History has a way of repeating patterns.<\/span><\/li>\n<li><strong><span data-preserver-spaces=\"true\">Correlation<\/span><\/strong><span data-preserver-spaces=\"true\"> of multiple data sources, silo business processes store its data in its databases.\u00a0 We need to see the big picture of how a business process is impacting each other if there are relationships<\/span><\/li>\n<\/ol>\n<h3><span data-preserver-spaces=\"true\">Some technical terms that you will need to know when making a Data Warehouse (DW)<\/span><\/h3>\n<p>Two names come up as founders in the data warehouse space, and we should fundamentally know the difference between the two paradigms<b>.<\/b><\/p>\n<p><b>Bill Inmon&#8217;s paradigm<\/b>: Data warehouse is one part of the overall business intelligence system.\u00a0 An enterprise has one data warehouse, and data marts source their information from the data warehouse.\u00a0 In the data warehouse, information is stored in 3rd normal form.<\/p>\n<p><b>Ralph Kimball&#8217;s paradigm<\/b>: Data warehouse is the conglomerate of all data marts within the enterprise.\u00a0 Information is always stored in the dimensional model.<\/p>\n<p><strong>The dimensional model<\/strong> includes fact tables and dimension tables.\u00a0 Fact tables connect to one or more dimension tables, but fact tables do not have direct relationships with one another.\u00a0 Dimensions and hierarchies are represented by dimension tables.\u00a0 Attributes are the non-key columns in the dimension tables.<\/p>\n<p><strong><span data-preserver-spaces=\"true\"><b>Dimension<\/b>: <\/span><\/strong>A category of information.<\/p>\n<p><strong><span data-preserver-spaces=\"true\">Dimension<\/span><\/strong><span data-preserver-spaces=\"true\"> tables &#8211; usually give detailed information about the attributes for the ids (foreign keys) stored on fact tables. It is a table of attributes for that category of information.<\/span><\/p>\n<p><b>Attribute<\/b>: A unique level within a dimension.<\/p>\n<p><strong><span data-preserver-spaces=\"true\">Grain or Granularity &#8211; <\/span><\/strong><span data-preserver-spaces=\"true\">\u00a0The grain will be the lowest information level stored in the fact table.\u00a0 The lower the level of detail, the larger the data amount in the fact table. How\u00a0detailed your questions need to be will determine how close to the transaction you need to store the data in the DW.\u00a0<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">Surrogate key<\/span><\/strong><span data-preserver-spaces=\"true\"> &#8211; a new key developed to be used by a new system-moves data from the source system to the new system like the DW.\u00a0 It is excellent at creating historical rows for primary system keys as changes occur.<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">Fact<\/span><\/strong><span data-preserver-spaces=\"true\"><strong> tables<\/strong> usually answer who, what, where, when, and how much, kind of questions.\u00a0 Many dimension tables support a fact table, and that defines the context.\u00a0 The fact usually supports measures of interest\u2014this the basis for a star schema.<\/span><\/p>\n<p>There are three types of facts:<\/p>\n<ul>\n<li><b>Additive<\/b>: Additive facts are facts that can be summed up through all of the dimensions in the fact table.<\/li>\n<li><b>Semi-Additive<\/b>: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.<\/li>\n<li><b>Non-Additive<\/b>: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.\u00a0 These make up the Factless Fact Table.<\/li>\n<\/ul>\n<p><b>Cumulative Fact Tables or Aggregrate Tables<\/b>: This type of fact table describes what has happened over time and is in summary format.\u00a0 \u00a0The grain might be monthly instead of daily.<\/p>\n<p><b>Snapshot Fact Tables<\/b>: This type of fact table describes the state of things at an instance of time and usually includes more semi-additive and non-additive facts<\/p>\n<p><b>Junk Dimension <\/b>is a technique of slimming down a fat Fact table for better performance. Remove all <strong>passive <\/strong><b>Non-Additive <\/b>attributes to a dimension table and put the foreign key for the Junk Dimension table on the fact.<\/p>\n<p><strong><span data-preserver-spaces=\"true\">Normalized<\/span><\/strong><span data-preserver-spaces=\"true\"> form &#8211; Store data in many rows instead of a flattened table with many columns and a single row.\u00a0 A Normalized design will typically have many tables to manage each entity, prevent repeating data, and conserve storage space. <\/span><\/p>\n<p><span data-preserver-spaces=\"true\">When normalizing, we typically go to the 3rd Normal Form, <\/span>defined by <span data-preserver-spaces=\"true\">transitive functional dependency.\u00a0 We have transitive dependency if\u00a0 A is functionally dependent on B, and B is functionally dependent on C. In this case, C is transitively dependent on A via B<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">De-normalized<\/span><\/strong><span data-preserver-spaces=\"true\"> rows- flattens data to fit in one row might take more storage, but it is easier for reporting queries.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\"><b>Slowly Changing Dimensions &#8211;\u00a0<\/b><\/span><span data-preserver-spaces=\"true\">Ralph Kimball introduced the concept of Slowly Changing Dimension (SCD) attributes in 1996.\u00a0 The types are dependent on how we manage these changes.\u00a0 The business must drive it as they know the data much better. <\/span><\/p>\n<p><span data-preserver-spaces=\"true\">SCD approaches type 1 (overwrite, no history is kept); if the phone number changes, we do not track it<\/span><\/p>\n<table width=\"321\">\n<tbody>\n<tr>\n<td width=\"64\">Key<\/td>\n<td width=\"97\">\u00a0<\/td>\n<td width=\"64\">Name<\/td>\n<td width=\"96\">Phone<\/td>\n<\/tr>\n<tr>\n<td style=\"font-weight: 400;\" width=\"64\">1001<\/td>\n<td style=\"font-weight: 400;\" width=\"97\">\u00a0<\/td>\n<td style=\"font-weight: 400;\" width=\"64\">Stephen<\/td>\n<td style=\"font-weight: 400;\" width=\"96\">123 456 7894<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span data-preserver-spaces=\"true\"><strong>SCD type 2<\/strong> (add a row to keep history), this is a scalable design<\/span><\/p>\n<table width=\"321\">\n<tbody>\n<tr>\n<td width=\"64\">Key<\/td>\n<td width=\"97\">Effective_dt<\/td>\n<td width=\"64\">Name<\/td>\n<td width=\"96\">Phone<\/td>\n<\/tr>\n<tr>\n<td style=\"font-weight: 400;\" width=\"64\">1001<\/td>\n<td style=\"font-weight: 400;\" width=\"97\">1-Aug<\/td>\n<td style=\"font-weight: 400;\" width=\"64\">Stephen<\/td>\n<td style=\"font-weight: 400;\" width=\"96\">123 456 7894<\/td>\n<\/tr>\n<tr>\n<td style=\"font-weight: 400;\" width=\"64\">1001<\/td>\n<td style=\"font-weight: 400;\" width=\"97\">1-Dec<\/td>\n<td style=\"font-weight: 400;\" width=\"64\">Stephen<\/td>\n<td style=\"font-weight: 400;\" width=\"96\">456 524 7894<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span data-preserver-spaces=\"true\"><strong>SCD type 3<\/strong> (add a column track history), The original record is modified to reflect the change, so it&#8217;s not a scalable type of change if we need an unnamed amount of changes on the record.\u00a0 The design is limited in its ability to scale changes.<\/span><\/p>\n<table width=\"569\">\n<tbody>\n<tr>\n<td width=\"64\">Key<\/td>\n<td width=\"97\">Effective_dt<\/td>\n<td width=\"64\">Name<\/td>\n<td width=\"96\">Phone<\/td>\n<td width=\"113\">New Phone<\/td>\n<td width=\"135\">New_Effective_dt<\/td>\n<\/tr>\n<tr>\n<td style=\"font-weight: 400;\" width=\"64\">1001<\/td>\n<td style=\"font-weight: 400;\" width=\"97\">1-Aug<\/td>\n<td style=\"font-weight: 400;\" width=\"64\">Stephen<\/td>\n<td style=\"font-weight: 400;\" width=\"96\">123 456 7894<\/td>\n<td style=\"font-weight: 400;\" width=\"113\">456 524 7894<\/td>\n<td style=\"font-weight: 400;\" width=\"135\">1-Dec<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span data-preserver-spaces=\"true\">Since legibility is a vital component of the Kimball mantra, we sometimes wish Ralph had given these techniques more descriptive names, such as &#8220;overwrite&#8221; instead of &#8220;type 1.&#8221; But at this point, the SCD type numbers are part of our industry&#8217;s vernacular.<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">Fast-Changing Dimensions<\/span><\/strong><span data-preserver-spaces=\"true\">\u00a0&#8211; these are usually more oversized tables that can be split into an SCD and a reference table in a DW design.<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">Conformed Dimension is\u00a0<\/span><\/strong><span data-preserver-spaces=\"true\">a dimension that is referenced by more than one fact table.\u00a0 It gives the same meaning when referring to different fact tables to give analytic integrity to the data model.\u00a0 The time dimension table is a typical example of a conformed dimension.<\/span><\/p>\n<p><b>Logical Data Model &#8211; <\/b>describes the data in as much detail as possible in a domain-specific language.\u00a0 It will have all the entities with a primary key on each defined and their relationships to each other with foreign keys.\u00a0 At logical model design, you should resolve many-to-many relationships and normalize them.\u00a0 The columns are defined by an Attribute type.<\/p>\n<p><b>Physical Data Model &#8211;\u00a0<\/b>converts the logical design to physical tables, including column name, column data type, column constraints, primary key, foreign key, and relationships between tables.<\/p>\n<p><strong><span data-preserver-spaces=\"true\">Star Schema is\u00a0<\/span><\/strong><span data-preserver-spaces=\"true\">a fact table surrounded by many dimensions tables.<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">Snow Flake<\/span><\/strong><span data-preserver-spaces=\"true\"> Schema is many star schemas joined together.\u00a0 Snowflakes are sometimes different domain tables left in the 3rd Normalized form and joined together.\u00a0 We can also have a hybrid of star and 3rd Normal form.\u00a0 A snowflake schema usually improves query performance due to smaller lookup tables and more specific fact tables.<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">KPI<\/span><\/strong><span data-preserver-spaces=\"true\"> &#8211; Key Performance Indicators drive the objectives and goals of any given business.\u00a0 Each business and each department will define its own KPI&#8217;s.<\/span><\/p>\n<p><b>OLAP <\/b>is about Multidimensional analysis.\u00a0 OLAP stands for On-Line Analytical Processing.\u00a0 There are two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP).\u00a0 Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.<\/p>\n<p><strong>MOLAP<\/strong> data is stored in a multidimensional cube.\u00a0 The storage is not in the relational database but proprietary formats.\u00a0 The cube size is usually a limiting factor for computed data storage.<\/p>\n<p><strong>ROLAP<\/strong> analysis is essentially a SQL query (or multiple SQL queries) in the relational database<\/p>\n<p><strong>HOLAP<\/strong> gets the best of both worlds and typically will &#8220;drill through&#8221; from the cube into the underlying relational data to get more details.<\/p>\n<p><strong><span data-preserver-spaces=\"true\">Hierarchy<\/span><\/strong><span data-preserver-spaces=\"true\"> is a style of modeling data in levels for ROLAP.\u00a0 Each level will have attributes of that level.\u00a0 This design will define how to roll up data or make a drill path for &#8220;drill down.&#8221;\u00a0 The specification of levels represents the relationship between different attributes within a dimension.\u00a0 Eg, every year has months as its next attribute level, and months have days as its next attribute level.\u00a0 This nesting of related categories forms the hierarchy.<\/span><\/p>\n<p><strong>FASMI<\/strong> test,\u00a0<i><b>F<\/b><\/i>ast\u00a0<i><b>A<\/b><\/i>nalysis of\u00a0<i><b>S<\/b><\/i>hared\u00a0<i><b>M<\/b><\/i>ultidimensional\u00a0<i><b>I<\/b><\/i>nformation.\u00a0 Dr. Codd proposed 12 rules for OLAP.\u00a0 For a more detailed description of both Dr. Codd&#8217;s rules and the FASMI test, visit <a href=\"https:\/\/web.archive.org\/web\/20100106080548\/http:\/\/www.olapreport.com\/FASMI.HTM\" target=\"_blank\" rel=\"noopener noreferrer\">The OLAP Report<\/a><\/p>\n<p><strong><span data-preserver-spaces=\"true\">Operational Reports\u00a0<\/span><\/strong><span data-preserver-spaces=\"true\">are reports used consistently by a business to service its clients.\u00a0\u00a0<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">Analytic Reports\u00a0<\/span><\/strong><span data-preserver-spaces=\"true\">are reports used exploratory in nature and used by a business to gain insight into trends or make predictions for future decision-making.<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">ETL<\/span><\/strong><span data-preserver-spaces=\"true\"> &#8211; Extract, Transform, and Load.\u00a0 ETL is needed for all BI as we must typically pull data from many sources and put them in a design model to answer future questions.<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">Meta-Data<\/span><\/strong><span data-preserver-spaces=\"true\"> -this is data about data.\u00a0 It is the dictionary that describes the information captured in the DW.<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">Data Store<\/span><\/strong><span data-preserver-spaces=\"true\"> &#8211; this is a domain-specific reporting database.\u00a0 It would typically follow a star schema designed for query and analysis rather than for transaction processing.\u00a0 It usually contains historical data derived from its transaction data.<\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">System of Record (SOR)<\/span><\/strong><span data-preserver-spaces=\"true\">\u00a0&#8211; or source\u00a0<\/span><strong><span data-preserver-spaces=\"true\">system of record<\/span><\/strong><span data-preserver-spaces=\"true\"> (SSoR) is a data management term for an information storage system that is the authoritative data source for a given data element or piece of information.\u00a0 It can be the application where the data is captured or delegated to a data warehouse.\u00a0 The SOR would act as the single version of the truth.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">What should you look at next?\u00a0 It would help if you considered looking at the <\/span><a class=\"editor-rtfLink\" href=\"http:\/\/www.designandexecute.com\/designs\/data-warehouse-design-patterns\/\" target=\"_blank\" rel=\"noopener noreferrer\"><span data-preserver-spaces=\"true\">Design Patterns for Data Warehousing<\/span><\/a><\/p>\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Why build a Data Warehouse\/ Data Lake? Transactions are essential, but the capture is of small value compared to the data&#8217;s value.\u00a0 The purpose of a Data Warehouse (DW) is to create a place to clean, merge and keep historical data for reporting to gain insight into the business process.\u00a0 A DW is a foundation [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":2781,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[32,30,31],"tags":[59,10,58],"class_list":["post-920","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bi-dashboards-analytics","category-bi-business-objects","category-bi-data-warehouse","tag-big-data","tag-business-intelligence","tag-data-warehouse"],"jetpack_featured_media_url":"https:\/\/www.designandexecute.com\/designs\/wp-content\/uploads\/1999\/09\/data-warehouse.jpg","_links":{"self":[{"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/posts\/920","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/comments?post=920"}],"version-history":[{"count":5,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/posts\/920\/revisions"}],"predecessor-version":[{"id":15540,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/posts\/920\/revisions\/15540"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/media\/2781"}],"wp:attachment":[{"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/media?parent=920"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/categories?post=920"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.designandexecute.com\/designs\/wp-json\/wp\/v2\/tags?post=920"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}