Wednesday, February 26, 2014

Keys Locking Up The Sane Data

Hi All,

In the previous blog we discussed about different Schema in which a Data Warehouse can be designed. Now lets discuss some of the other important concepts which are essential in order to provide flexibility and scalability to a Data Warehouse.

Some of the important concepts regarding the keys in order to maintain data integrity are as follows:

Surrogate Key:
The advent of Surrogate key comes from the basics of maintaining temporal or historical data in the database. Distinction between a Surrogate Key and Primary Key depends on whether the database is Current Database or Temporal Database. Lets discuss Surrogate Key usability in two different types of database
       
Current Database:
Since current database stores only current data , hence there is a one to one relationship between Surrogate Key and Primary Key. In this case the Surrogate Key may be used as a primary key.
     
Temporal Database:
As temporal database stores historical and temporary database therefore there is a one to many relationship between Surrogate Key and Business key. Distinction between these two keys is highly important as Surrogate Key identifies one row in the table uniquely, whereas business key identifies a business entity in the business model uniquely. The table holds the life span of the entity in the modeled world. An example of a temporal table containing Surrogate Keys.


Advantages of Surrogate Key:
  • Application can not lose their reference to a row in the database.
  • Surrogate key tend to be a compact data type, hence they perform faster when queried
  • Surrogate helps in maintaining the uniformity of detail or an entity throughout the data

Natural Key:
A natural key is a key that is formed on the entities or attributes that already exists in the world. For example a table consisting of Genetic DNA makeup of a person can be considered as a Natural Key.

A Natural key is a candidate key that has a logical relationship to the attributes within that row.A Natural Key is sometimes called as a Domain Key. It ensures that there can only be one row for a Key.

Disadvantages of Natural Key:
We need to choose natural key cautiously as choosing a column that may change dynamically can lead to problems. For example if we choose name of a person as natural key and if the name of the person changes due to religion transformation, changing the name of a person in the database and reflecting the changes across all the table containing a foreign key will be an overhead for the database.


With this i would like to conclude my blog on "Keys Locking Up The Sane Data". Stay tuned to learn about Slowly Changing Dimensions, the next important step in creating a Coherent Dimensional Model.

References:
The Data Warehouse Toolkit - Third Edition, Ralph Kimball and Margy Ross
http://en.wikipedia.org/wiki/Dimensional_Modelling
http://en.wikipedia.org/wiki/Surrogate_key
http://en.wikipedia.org/wiki/Natural_key







Sunday, February 9, 2014

Snow Covered Wagon hitched to a Star (Dimensional Modelling)

Hello One and All,

In my last Blog we discussed about Dimensional Modelling and some of its components. Today we will go through different Schema that can be used during Dimensional Modelling to create a Data Warehouse.

Before we start with today's topic , For my viewers those who are new to this field i would like to revisit some of the key points of my previous blogs:

1)  Business Intelligence is mainly divided into three parts as per my understanding
                 a) Data Warehouse design and Implementation (ETL process)
                 b) Data Analysis (Using OLAP cubes)
                 c) Reporting and Dashboard Creation
For further details revisit my First blog 

2) Important Components involved in Dimensional Modelling or Data Warehouse Designing
                 a)  Fact Tables (Additive Facts, Semi-Additive Facts, Non- Additive Facts)
                 b)  Dimension Table
                 c)  Grain
For further details revisit my Second blog

After a thorough revision of previous concepts lets start our today's discussion about different Schema involved in Dimensional Modelling or Data Warehouse Designing.

First of all i would like to explain the meaning of the topic i.e
Snow Covered Wagon Hitched to a Star = SnowFlake Schema and Star Schema are two types of Schema that are used while designing a Data Warehouse, Hence they can be explained as follows:

Star Schema:  A Star Schema is one of the simplest and easiest schema to understand. A schema which consists of Dimension tables only attached to Fact tables. A Star Schema get its name from its physical representation where we have Fact table at the center with Dimension table surrounding it representing the star points. Star Schema can also be called as a denormalized schema as it contains Dimensional table in the lowest normal form.

Star Schema
Snow Flake Schema:   The Snowflake Schema can be formed by Snowflaking(Normalizing) the Star Schema . The Dimension table in a Snowflake Schema may or may not contain a parent table depending on its normal form. All the dimension tables in Snowflake schema are normalized to third normal form (thus dimension table does not contain any transitive dependencies).


SnowFlake Schema
Pros and Cons of both the Schemas are as follows:

Star Schema:
Pros: A Star Schema is very quick in response to ad-hoc queries as this kind of schema can answer maximum queries with minimum number of joins.
Cons:  A Star Schema suffers in space and data management as the data is redundant in dimension table as the tables are in their lowest normal forms.

Snowflake Schema: 
Pros:    A SnowFlake Schema is preferred for Space and Data Management as the Schema consists of table in third normal form.
Cons:   A SnowFlake Schema takes large amount of time in querying database as this type of design involves joins to answer a query and as joins are highly expensive.

Hence a choice between Schema is made as per business requirement, the system that requires faster query performance uses Star Schema whereas the one that requires Space Management uses SnowFlake Schema.

With this i conclude my blog about Schema. Stay tuned to learn more about different challenges and tasks involved while designing a Data Warehouse. In my coming blog we will talk about Surrogate Keys, Natural Keys, Durable Keys, Slowly Changing Dimensions Type and much more.

References:
The Data Warehouse Toolkit - Third Edition, Ralph Kimball and Margy Ross
http://en.wikipedia.org/wiki/Dimensional_Modelling

















Wednesday, February 5, 2014

Unraveling the Unravelable Dimensional Modelling

Hello All,

In my previous blog, we went through different concepts on a macro level. From this blog onward we will delve deeper and will understand each concept in detail.

First concept that we will talk about in this blog is about Dimensional Modelling.

What is Dimensional Modelling?
From a high level, Dimensional Modelling is the technique, concepts in designing a Data Warehouse. We build data warehouse with certain design methodology, technique  in order to analyse billions of rows in few seconds which is called as Dimensional Modelling.

Some of the key elements of Dimensional Modelling are as follows:

  1. Fact Table
  2. Dimension Table
  3. Grain


Fact Table:   A fact table is a simple  table that contains facts and dimensions. Facts means numerical values  that helps in making decision when aggregated or analysed across different dimensions for example profit in a transaction, cost paid by customer in a transaction. The facts are stored in the fact table corresponding to a particular set of dimensions. For example we store profit by product by store by time by customer type. Therefore profit here is the fact and product, store, time, customer type are dimensions.

Facts are of three different types:

Additive Facts: These are the facts when aggregated across dimensions have a meaningful outcome For example :  Profit

Semi-Additive Facts: These are the facts that can be aggregated across some dimensions but does not     have a meaningful outcome when aggregated across other dimensions. For example: To aggregate the stock level  of a product on a particular day, but aggregating it over a period wont tell us about the correct stock level value about the product.

Non- Additive Facts: These are the Facts that can not be aggregated across any dimension. For example: ratio, percentage etc.

Dimension Table: This table contains values which adds meaning to fact values.
The table that contains dimensions across which facts are sliced, diced , drilled down and drilled through. Dimensions helps in bringing out the values that are useful for business people. For Example Fact such as profit is useless until we do not know that profit value is by product by time by location and many more which will help in inferring useful information.

Grain:  The highest level of detail that your data contains is called grain. Grain helps in defining the level of questions that your dimensional model can answer, for Example lets consider our transcript. A transcript contains data at three levels i.e.
     1) GPA at Subject Level
     2) GPA at Semester Level
     3) GPA at Year Level

These are three  different levels of grain,  hence if we store data regarding subject we can calculate our GPA at Semester and Year level, and if we store our GPA regarding semester level we can roll it up to Year level but can not drill down to Subject level, Similar is the case for Year level.

Thus determining the grain is the most important design aspect of Dimensional Modelling.With this i conclude my blog on Unraveling the Unravelable Dimensional Modelling

Stay tuned to know more about dimensional Modelling. In my next blog i will discuss about Star Schema, Snow-Flake Schema, Business process and a glimpse into Market Basket Analysis.

References:
The Data Warehouse Toolkit - Third Edition, Ralph Kimball and Margy Ross
http://en.wikipedia.org/wiki/Dimensional_Modelling