“A
datewarehouse is a subject oriented , integrated,non volatile and time variant
collection of data in support of managements decisions”
According to this definition, a data warehouse is different from an operational database in four important ways.
According to this definition, a data warehouse is different from an operational database in four important ways.
Data Warehouse
|
Operational Database
|
subject
oriented
|
application
oriented
|
integrated
|
multiple
diverse sources
|
time-variant
|
real-time,
current
|
nonvolatile
|
updateable
|
Comparing
a Data Warehouse and an Operational Database
An
operational database is designed primarily to support day to day operations. A data warehouse is designed to support
strategic decision making.
Data
warehouse assembles data dispersed in various data sources across the
enterprise and helps business stakeholders manage their operations by making
better informed decisions. Most organizations have multiple data stores:
relational databases, spreadsheets, mainframes, mail systems or even paper
files. Each of these data stores tends to serve a subset of the enterprise.
Data warehouse attempts to overcome this limitation by combining all relevant
data and by allowing managers to view their business from many different
angles.
Data warehouses are built using dimensional data models which consist of fact and dimension tables. Dimension tables are used to describe dimensions; they contain dimension keys, values and attributes. For example, the time dimension would contain every hour, day, week, month, quarter and year that has occurred since you started your business operations. Product dimension could contain a name and description of products you sell, their unit price, color, weight and other attributes as applicable.
Data warehouses are built using dimensional data models which consist of fact and dimension tables. Dimension tables are used to describe dimensions; they contain dimension keys, values and attributes. For example, the time dimension would contain every hour, day, week, month, quarter and year that has occurred since you started your business operations. Product dimension could contain a name and description of products you sell, their unit price, color, weight and other attributes as applicable.
Benefits
of Data Warehousing
Data
warehousing is being hailed as one of the most strategically significant
developments in information processing in recent times. One of the reasons for this is that it is
seen as part of the answer to information overload.
Some
of the benefits of data warehousing that were seen as relevant to the Avondale
College project are listed here. The
points highlighted in the Bill Inmon’s definition give some of the reasons why
data warehousing is regarded as important.
Before
building a data warehouse you should become familiar with the terminology used
to describe various parts of the warehouse.
1. Has
a subject area orientation
2. Integrates
data from multiple, diverse sources
3. Allows
for analysis of data over time
4. Adds
ad hoc reporting and enquiry
5. Provides
analysis capabilities to decision makers
6. Relieves
the development burden on IT
7. Provides
improved performance for complex analytical queries
8. Relieves
processing burden on transaction oriented databases
9. Allows
for a continuous planning process
10. Converts
corporate data into strategic information
1.Dimensions
Customer_full_name
Customer_city
Customer_state
Customer_country
Dimension tables are typically small, ranging from a few to several thousand
rows.Occasionally dimensions can grow fairly large, however.For example, a
large credit card company could have a customer dimension with millions of
rows. Dimension table structure is typically very lean, for example customer
dimension could look like following:
Customer_keyCustomer_full_name
Customer_city
Customer_state
Customer_country
Although
there might be other attributes that you store in the relational database, data
warehouses might not need all of those attributes. For example, customer
telephone numbers, email addresses and other contact information would not be
necessary for the warehouse. Keep in mind that data warehouses are used to make
strategic decisions by analyzing trends. It is not meant to be a tool for daily
business operations. On the other hand, you might have some reports that do
include data elements that aren't necessary for data analysis.
Most data warehouses will have one or multiple time dimensions. Since the warehouse will be used for finding and examining trends, data analysts will need to know when each fact has occurred. The most common time dimension is calendar time. However, your business might also need a fiscal time dimension in case your fiscal year does not start on January 1st as the calendar year.
Most data warehouses will also contain product or service dimensions since each business typically operates by offering either products or services to others. Geographically dispersed businesses are likely to have a location dimension.
2.Facts
Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. The fact table of an educational entity could track credit hours awarded to students. A bakery could have a fact table that records manufacturing of various baked goods.
Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table "grain". For instance, for a healthcare billing company it might be sufficient to track revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand, the assembly line warehouse analysts might be very concerned in number of defective goods that were manufactured each hour. Similarly a marketing data warehouse might be concerned by the activity of a consumer group with a specific income-level rather than purchases made by each individual.
Most data warehouses will have one or multiple time dimensions. Since the warehouse will be used for finding and examining trends, data analysts will need to know when each fact has occurred. The most common time dimension is calendar time. However, your business might also need a fiscal time dimension in case your fiscal year does not start on January 1st as the calendar year.
Most data warehouses will also contain product or service dimensions since each business typically operates by offering either products or services to others. Geographically dispersed businesses are likely to have a location dimension.
2.Facts
Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. The fact table of an educational entity could track credit hours awarded to students. A bakery could have a fact table that records manufacturing of various baked goods.
Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table "grain". For instance, for a healthcare billing company it might be sufficient to track revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand, the assembly line warehouse analysts might be very concerned in number of defective goods that were manufactured each hour. Similarly a marketing data warehouse might be concerned by the activity of a consumer group with a specific income-level rather than purchases made by each individual.
Data Staging Area
A storage area and set
of processes that clean, transform, combine, de-duplicate,household, archive,
and prepare source data for use in the data warehouse.The data Below are the few other terms which comes mostly in OLAP side of Datawarehousing
1.Hierarchy
It defines parent-child relationships among various levels within a single dimension. For instance in a time dimension, year level is parent of four quarters, each of which is a parent of three months, which are parents of 28 to 31 days, which are parents of 24 hours. Similarly in a geography dimension a continent is a parent of countries, country could be a parent of states, and state could be a parent of cities.
2.Level
Level is a column within a dimension table that could be used for aggregating data. For example, product dimension could have levels of product type (beverage), product category (alcoholic beverage), product class (beer), product name (miller lite, budlite, corona, etc).
3.Drill-down
Drill-down is the repetitive selection and analysis of summarized facts, with each repetition of data selection occurring at a lower level of summarization. An example of drill-down is a multiple-step process where sales revenue is first analyzed by year, then by quarter, and finally by month. Each iteration of drill-down returns sales revenue at a lower level of aggregation along the period dimension.
It defines parent-child relationships among various levels within a single dimension. For instance in a time dimension, year level is parent of four quarters, each of which is a parent of three months, which are parents of 28 to 31 days, which are parents of 24 hours. Similarly in a geography dimension a continent is a parent of countries, country could be a parent of states, and state could be a parent of cities.
2.Level
Level is a column within a dimension table that could be used for aggregating data. For example, product dimension could have levels of product type (beverage), product category (alcoholic beverage), product class (beer), product name (miller lite, budlite, corona, etc).
3.Drill-down
Drill-down is the repetitive selection and analysis of summarized facts, with each repetition of data selection occurring at a lower level of summarization. An example of drill-down is a multiple-step process where sales revenue is first analyzed by year, then by quarter, and finally by month. Each iteration of drill-down returns sales revenue at a lower level of aggregation along the period dimension.
4.Roll-up
Roll-up is the opposite of drill-down. Roll-up is the repetitive selection and analysis of summarized facts with each repetition of data selection occurring at a higher level of summarization.
Roll-up is the opposite of drill-down. Roll-up is the repetitive selection and analysis of summarized facts with each repetition of data selection occurring at a higher level of summarization.
5. Aggregation
Aggregation is a key attribute of a data warehouse. Summarization and consolidation are other words that used to convey the same meaning. In the case of a multi-dimensional database, the summarizations are pre-computed for all the various combinations of the dimensions. This allows for very fast response to slice and dice operations at any level of drill-down, and also allows for fast drill-down and roll-up operations.
Aggregation is a key attribute of a data warehouse. Summarization and consolidation are other words that used to convey the same meaning. In the case of a multi-dimensional database, the summarizations are pre-computed for all the various combinations of the dimensions. This allows for very fast response to slice and dice operations at any level of drill-down, and also allows for fast drill-down and roll-up operations.
6.Granularity
Granularity is a term that is used to describe the level below which no supporting details are stored, only the summaries. Good judgment needs to be exercised to determine granularity. If the granularity is set to be too fine, unused data will be stored, wasting processing time during replication steps, and wasting disk space to hold it. On the other hand, if the granularity is set too coarse, the detailed data will not be available if it is needed at some point in the future.
Granularity is a term that is used to describe the level below which no supporting details are stored, only the summaries. Good judgment needs to be exercised to determine granularity. If the granularity is set to be too fine, unused data will be stored, wasting processing time during replication steps, and wasting disk space to hold it. On the other hand, if the granularity is set too coarse, the detailed data will not be available if it is needed at some point in the future.
No comments:
Post a Comment