Tuesday 9 June 2015

Data Warehouse Concepts

A data warehouse is a repository (collection of resources that can be accessed to retrieve information) of an organization's electronically stored data, designed to facilitate reporting and analysis. In simple form data warehouse is a collection of large amount of data.
§  A DWH is a historical database because the database contains many years of historical business data for Decision making purpose.
§  A DWH system is designed to read the business data for business analysis processing but not for Transactional processing. Hence it is called as a Read only database.
A DWH is designed to take the decision. Hence it is also known as DSS (Decision Supportive System).

1.       The fathers of DWH are W.H. Inmon & Ralph Kimball. W.H.Inmon defined the DWH as
Time Variant, Non Volatile, Integrated and Subject Oriented.
                                I.            Time Variant: In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
                          A business user can analyze the business data in the warehouse to the different time periods like Year, Quarter, Month, and Weeks etc.
                              II.            Non Volatile: Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. The data that is present in the DWH is Static.
                            III.            Integrated: Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
                           IV.            Subject Oriented: Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case makes the data warehouse subject oriented.

 

1.       Types of DWH systems: There are mainly 2 types of DWH systems.
                                I.            EDW (Enterprise Data Warehouse): It contains the historical business data at the enterprise level to support the business needs of top management in the organization.
  • Contains data drawn from multiple operational systems.
  •  Supports time- series and trend analysis across different business areas.
  •                               Can be used to populate data marts
  • Can be used for everyday and strategic decision making.
                              II.            Data Marts: A data mart is a subset of an organizational data store, usually oriented to a specific purpose or major data subject, which may be distributed to support business needs.
  •  Subset of enterprise data warehouse.
  •                               Organized around a single business process.
  •                              May or may not contain aggregates.

Less expensive and much smaller than a full blown corporate data warehouse. 

1.       Staging Area:
v A storage area and set of process that clean, transform, combine, removing duplicate, archive and prepare source data for use in data warehouse.
v It accepts data from different sources.
v The structure is closer to the Operational Systems rather than the DW.
v Data arriving at different point of time is merged and then loaded into the DW.
v Usually does not maintain history; only a temporary area.

2.       Types of DWH approach:-
                                I.            Top – Down approach: According to W.H.Inmon, we need to develop the enterprise DWH system and then from the EDW develop subject oriented databases called Datamarts according to the business needs.
                              II.            Bottom – Up approach: According to Ralph Kimball, 1st develop the datamarts according to business needs and then integrate all datamarts into EDW.
3.       Types of Data Marts:
                                I.            Dependent DM: The DM developed in Top – Down approach is known as Dependent DM. Because 1st we will load data into EDW and then into DM.
                              II.            Independent DM: The DM developed in Bottom – Up approach is known as Independent DM. Because 1st we will load data into DM and then into EDW.
·        Please find below the example for Dependent Datamart.

1.       Real Time DWH: Traditionally data warehouses do not contain today's data. They are usually loaded with data from operational systems at most weekly or in some cases nightly, but are in any case a window on the past.
                               As today's decisions in the business world become more real-time, the systems that support those decisions need to keep up. It is only natural that Data Warehouse, Business Intelligence, Decision Support, and OLAP systems quickly begin to incorporate real-time data. Data warehouses and business intelligence applications are designed to answer exactly the types of questions that users would like to pose against real-time data. Ad-hoc reporting is made easy using today's advanced OLAP tools.
2.       Data Acquisition:

§  Data Acquisition means Extraction, Transformation & Loading.
Here we will extract data from different sources like COBOL, ERP, Operational etc and bring into our Staging Area. Staging Area is a temporary storage area. From Staging Area we will load data into DWH or DM’s.
§  Data acquisition process is defined with Data extraction, Data transformation and Data loading.
                                I.            Data Extraction: It is a process of reading the data from different sources like Operational sources, ERP systems, COBOL files, Flat files etc.
                              II.            Data Transformation: It is a process of transforming data from one format to required business format. In Data transformation we are having 4 types.
                                                        i.            Data Merging: It is a process of integrating the data from similar sources with the similar structure and data type. Ex: Join, Union etc.
                                                      ii.            Data Cleansing: It is a process of identifying and changing the inconsistencies and in accuracies. Ex: Initcap, Lower, Upper, NVL etc.



                                                    iii.            Data Scrubbing: It is a process of deriving new definitions from existing source definitions. Ex: In target table we will add a new column ‘TAX’, which will be calculated based on the column ‘SAL’ coming from the source.
                                                    iv.            Data Aggregation: It is a process of where multiple detail values are summarized into a single summary values typically numeric like Sum, Average, Min, Max etc.
3.       Star Schema:
§ A star schema is a logical database design which contains a centrally located fact table surrounded by at least one or more dimension tables.
§  A Fact table contains composite keys (More than one key) where each candidate key is a foreign key to the dimension table.
§  The facts that the data warehouse helps analyze are classified along different dimensions:
·        The fact table holds the main data. It includes a large amount of aggregated data, such as price and units sold. There may be multiple fact tables in a star schema.
·        Dimension tables, which are usually smaller than fact tables, include the attributes that describe the facts. Often this is a separate table for each dimension. Dimension tables can be joined to the fact table(s) as needed.
·        Dimension tables have a simple primary key, while fact tables have a set of foreign keys which make up a compound primary key consisting of a combination of relevant dimension keys.
§  Example: Fact.Sales is the fact table and there are three dimension tables Dim.Date, Dim.Store and Dim.Product. Each dimension table has a primary key on its PK column, relating to one of the columns (viewed as rows in the example schema) of the Fact.Sales table's three-column (compound) primary key (Date_FK, Store_FK, Product_FK). The non-primary key [Units Sold] column of the fact table in this example represents a measure or metric that can be used in calculations and analysis. The non-primary key columns of the dimension tables represent additional attributes of the dimensions (such as the Year of the Dim.Date dimension).

1.       Snow Flake Schema:
§  In a Star schema database design, if the dimension table is split into a one or more dimension tables which results in Normalization. Since the database design looks like a snow flake. Hence it is known as Snow flake schema.
§  Generally these types of schema designs are not recommended for the warehouse implementations because dimension tables results in Normalization and decrease the performances.
§  The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are denormalized with each dimension represented by a single table.

·        The advantages and disadvantages of snow flake schema are given below.




1.       Galaxy Schema:
Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and therefore it's called a Galaxy Schema or a Fact Constellation. As we see in the two star schemas as above, the two fact tables, sales table and purchase table are now sharing both the 'product' and 'time' dimension tables. Therefore we decide to choose Galaxy Schema as the model for our data warehouse, which is displayed as follow:
Before applying Galaxy schema:



After applying Galaxy schema:



1.      Fact Tables:
§  A fact table contains composite keys (More than one key) where each candidate key is a foreign key to the dimension table.
§  A fact table contains facts. In DWH, facts are generally numeric.
§  A measure is a numeric attribute of a fact, representing the performance or behavior of the business relative to dimensions.
§  A fact table contains the fact information at the lowest level granularity.
§  The level at which fact information stores in a fact table is called as Fact Granularity or Grain of fact.
§  A fact table can contain fact information either in 1NF or 2NF or 3NF. (NF: Normalization Form).
§  To provide the meaningful business context to the facts design the dimension tables with a de-normalized business information.


 Types of Fact Tables:

                   I.            Additive Fact table:
§  A fact which can be summed up for any of the dimensions available in the fact table is called as Additive fact.

Example:  Sales Amount by Store by Product in a given day is an additive fact as it can be summed up across all STORE, PRODUCT and TIME dimensions. We can sum the Sales for a week to get the Total Sales Amount for that week.

                 II.            Semi Additive Fact table:
§  A fact which can be summed up for few dimensions but not for all the dimensions present in the fact table.
Example: Current Balance for an Account is the Semi-Additive fact as it can be summed up for all Accounts to find out the Total current balance of the Bank. It won’t make sense to sum up the Current Balance for an Account in a given month (Time).

              III.                   Non Additive Fact table:
§  A fact which cannot be summed up for any of the dimensions available in the fact table.
Example: AVERAGE.

              IV.                   Fact less Fact Table:
§  A fact which contains only Keys but not measures.

    Types of Facts:
                   I.            Accumulative Fact Table:
§  Generally these fact tables describe what has happened over the period of time.                     A cumulative fact table contains Additive or Semi additive facts. Ex: Transactional fact table, Orders fact table.
                 II.            Snap shot Fact Table:
§  This type of fact table describes the status of things at a particular instant of the time.
 Dimension Tables:
·        The dimension tables contain attributes (or fields) used to constrain and group data when performing data warehousing queries.
·        In a data warehouse, a dimension is a data element that categorizes each item in a data set into non-overlapping regions.
·        For example, "Customer", "Date", and "Product" are all dimensions that could be applied meaningfully to a sales receipt.

    Types of Dimensions Tables:
                    I.            Conformed Dimension: The dimension that is shared across multiple fact tables. At the most basic level, conformed dimensions mean the exact same thing with every possible fact table to which they are joined. The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts. Ex: Time Dimension, Geographical Dimension.
                  II.            Junk Dimension:  Junk dimension is just a dimension that stores unwanted attributes. A junk dimension is a convenient grouping of typically low-cardinality flags and indicators. By creating an abstract dimension, these flags and indicators are removed from the fact table while placing them into a useful dimensional framework. Ex:
Source

Junk Dimension
FLAG_1
Yes/No

Key_ID
Flag_1
Flag_2
FLAG_2
True/False

1
Yes
TRUE



2
Yes
FALSE



3
No
TRUE



4
No
FALSE

                III.            Degenerated Dimension: In a data warehouse, a degenerate dimension is a dimension which is derived from the fact table and doesn't have its own dimension table. The decision to use degenerate dimensions is often based on the desire to provide a direct reference back to a transactional system without the overhead of maintaining a separate dimension table. A dimension key, such as a transaction number, invoice number, ticket number, or bill-of-lading number, that has no attributes and hence does not join to an actual dimension table. Degenerate dimensions are very common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent.
               IV.            Slowly Changing Dimension: Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule. It’s further classified into 3 types.
·        SCD Type 1: This type of dimension table maintains the latest or current data.
·        SCD Type 2: This type of dimension table maintains complete history.
·        SCD Type 3: This type of dimension table maintains partial history.
                 V.            Monster Dimension: The dimension that contains > 100 million records.
Example: Customer Dimension.
              VI.            Casual Dimension: In most data warehouses, you build a fact table record when something happens.

                         The below are the differences between systems.
             OLTP                                                         OLAP
    1. It is dynamic.                                   1. It is static [unchanged].
    2. It follows normalization.                           2. It follows denormalization.
    3. It contains current data.                              3. It contains historical data.
    4. It is designed to support transactional      4. It is designed to support decision making process.                                                                        process.
    5. It contains detailed data.                                5. It contains summarized information.
   
                ODS                                                               DWH
    1. It is designed to support operational            1. It is designed to support decision making process.                                                            process.
    Similarities:-
    2. Integrated database.                     2. Integrated database.
    3. Enterprise data.                                         3. Enterprise data.
    4. Subject oriented database.                       4. Subject oriented database.
    Differences:-
    5. Contains current information.                 5. Contains historical information.
    6. Data is volatile.                              6. Data is non-volatile.
    7. Contains detail information.        7. Contains summary information.
   
                        ODS                                              OLTP
    1. Subject oriented database.                 1. Application oriented database.
   
        

                   OLTP                                                 DWH
    1. Data is volatile.                            1. Data is non-volatile.
    2. It contains current data.                         2. It contains historical data.
    3. It is application oriented database.     3. It is subject oriented database.
    4. It is not flexible.                                       4. It is flexible.
    5. It stored all data.                                      5. It stores relevant data.
   
                      OLTP                                                          DSS
    1. It is designed to support operational           1. It is designed to support decision making process.                                                            Process.                                                  
    2. Data is volatile.                                      2. Data is non-volatile.
    3. Data is in inconsistency form.                        3. It is in consistent form.
    4. It stores recent data for approximately       4. It stores One year data.
       4 to 6 months data.
    5. It follows normalized schema.                            5. It follows star schema.
   
                      DWH                                                        DM
    1. It is about entire organization.                          1. It is about individual department in the organization.
    2. It is created on RDBMS.                                         2. It is created on RDBMS & MDDB.
    3. It follows integrated schema design.     3. It follows star schema design.
    4. It is integrated database.                                       4. Subject oriented databases.

 

 






1 comment:

  1. You are amazing! Very informative and easy to understand.

    ReplyDelete