Thursday, 26 July 2012

What is a DATAWAREHOUSE? Advantage of a Datawarehouse

In 1993, the “father of data warehousing”, Bill Inmon, gave this definition of a data warehouse: 
“A datewarehouse is a subject oriented , integrated,non volatile and time variant collection of data in support of managements decisions” 
According to this definition, a data warehouse is different from an opera­tional database in four important ways.
Data Warehouse
Operational Database
subject oriented
application oriented
multiple diverse sources
real-time, current
Comparing a Data Warehouse and an Operational Database
An operational database is designed primarily to support day to day opera­tions.  A data warehouse is designed to support strategic decision making.
Data warehouse assembles data dispersed in various data sources across the enterprise and helps business stakeholders manage their operations by making better informed decisions. Most organizations have multiple data stores: relational databases, spreadsheets, mainframes, mail systems or even paper files. Each of these data stores tends to serve a subset of the enterprise. Data warehouse attempts to overcome this limitation by combining all relevant data and by allowing managers to view their business from many different angles.
Data warehouses are built using dimensional data models which consist of fact and dimension tables. Dimension tables are used to describe dimensions; they contain dimension keys, values and attributes. For example, the time dimension would contain every hour, day, week, month, quarter and year that has occurred since you started your business operations. Product dimension could contain a name and description of products you sell, their unit price, color, weight and other attributes as applicable.
Benefits of Data Warehousing
Data warehousing is being hailed as one of the most strategically significant developments in information processing in recent times.  One of the reasons for this is that it is seen as part of the answer to information overload. 
Some of the benefits of data warehousing that were seen as relevant to the Avondale College project are listed here.  The points highlighted in the Bill Inmon’s definition give some of the reasons why data warehousing is regarded as important.
1.      Has a subject area orientation
2.      Integrates data from multiple, diverse sources
3.      Allows for analysis of data over time
4.      Adds ad hoc reporting and enquiry
5.      Provides analysis capabilities to decision makers
6.      Relieves the development burden on IT
7.      Provides improved performance for complex analytical queries
8.      Relieves processing burden on transaction oriented databases
9.      Allows for a continuous planning process
10.  Converts corporate data into strategic information

Before building a data warehouse you should become familiar with the terminology used to describe various parts of the warehouse.
Dimension tables are typically small, ranging from a few to several thousand rows.Occasionally dimensions can grow fairly large, however.For example, a large credit card company could have a customer dimension with millions of rows. Dimension table structure is typically very lean, for example customer dimension could look like following: 
Although there might be other attributes that you store in the relational database, data warehouses might not need all of those attributes. For example, customer telephone numbers, email addresses and other contact information would not be necessary for the warehouse. Keep in mind that data warehouses are used to make strategic decisions by analyzing trends. It is not meant to be a tool for daily business operations. On the other hand, you might have some reports that do include data elements that aren't necessary for data analysis.
Most data warehouses will have one or multiple time dimensions. Since the warehouse will be used for finding and examining trends, data analysts will need to know when each fact has occurred. The most common time dimension is calendar time. However, your business might also need a fiscal time dimension in case your fiscal year does not start on January 1st as the calendar year.
Most data warehouses will also contain product or service dimensions since each business typically operates by offering either products or services to others. Geographically dispersed businesses are likely to have a location dimension.


Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. The fact table of an educational entity could track credit hours awarded to students. A bakery could have a fact table that records manufacturing of various baked goods.

Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table "grain". For instance, for a healthcare billing company it might be sufficient to track revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand, the assembly line warehouse analysts might be very concerned in number of defective goods that were manufactured each hour. Similarly a marketing data warehouse might be concerned by the activity of a consumer group with a specific income-level rather than purchases made by each individual.

Data Staging Area
A storage area and set of processes that clean, transform, combine, de-duplicate,household, archive, and prepare source data for use in the data warehouse.The data staging area is everything in between the source system and the presentation server.Although it would be nice if the data staging area were a single centralized facility on one piece of hardware, it is far more likely that the data staging area is spread over a number of machines. The data staging area is dominatedby the simple activities of sorting & sequential processing and, in some cases, the data staging area does not need to be based on relational technology.

Below are the few other terms which comes mostly in OLAP side of Datawarehousing
It defines parent-child relationships among various levels within a single dimension. For instance in a time dimension, year level is parent of four quarters, each of which is a parent of three months, which are parents of 28 to 31 days, which are parents of 24 hours. Similarly in a geography dimension a continent is a parent of countries, country could be a parent of states, and state could be a parent of cities.


Level is a column within a dimension table that could be used for aggregating data. For example, product dimension could have levels of product type (beverage), product category (alcoholic beverage), product class (beer), product name (miller lite, budlite, corona, etc).

Drill-down is the repetitive selection and analysis of summarized facts, with each repetition of data selection occurring at a lower level of summarization.  An example of drill-down is a multiple-step process where sales revenue is first analyzed by year, then by quarter, and finally by month.  Each iteration of drill-down returns sales revenue at a lower level of aggregation along the period dimension.
Roll-up is the opposite of drill-down.  Roll-up is the repetitive selection and analysis of summarized facts with each repetition of data selection occurring at a higher level of summarization.
5. Aggregation 
Aggregation is a key attribute of a data warehouse.  Summarization and consolidation are other words that used to convey the same meaning.  In the case of a multi-dimensional database, the summarizations are pre-computed for all the various combinations of the dimensions.  This allows for very fast response to slice and dice operations at any level of drill-down, and also allows for fast drill-down and roll-up operations.
Granularity is a term that is used to describe the level below which no supporting details are stored, only the summaries.  Good judgment needs to be exercised to determine granularity.  If the granularity is set to be too fine, unused data will be stored, wasting processing time during replication steps, and wasting disk space to hold it.  On the other hand, if the granularity is set too coarse, the detailed data will not be available if it is needed at some point in the future.

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...