A blog where you can explore everything about Datawarehouse, OBIEE, Informatica, Power BI, QlikView, Hadoop, Oracle SQL-PLSQL, Cognos and much more....
The goal of performance
tuning is to optimize session performance so that the sessions run during the
available load window for the Informatica Server.Increase the session performance
by following.
Network:The performance of the
Informatica Server is related to network connections. Data generally moves
across a network at less than 1 MB per second, whereas a local disk moves data
five to twenty times faster. Thus network connections often affect on session
performance. So avoid network connections.
Flat files: If your flat files
stored on a machine other than the informatica server, move those files to the
machine that consists of informatica server.
Less Connections: Minimize the
connections to sources ,targets and informatica server to improve session
performance. Moving target database into server system may improve session
performance.
Staging areas: If you use staging
areas you force informatica server to perform multiple data passes.Removing of
staging areas may improve session performance.Use staging area only when its
mandatory
Informatica Servers:You can run the
multiple informatica servers against the same repository. Distributing the
session load to multiple informatica servers may improve session performance.
Run the informatica server in ASCII data movement mode improves the session
performance. Because ASCII data movement mode stores a character value in one
byte. Unicode mode takes 2 bytes to store a character.
Source qualifier: If a session joins
multiple source tables in one Source Qualifier, optimizing the query may
improve performance. Also, single table select statements with an ORDER BY or
GROUP BY clause may benefit from optimization such as adding indexes.
Drop constraints: If target consists
key constraints and indexes it slows the loading of data. To improve the
session performance in this case drop constraints and indexes before we run the
session(while loading facts and dimensions) and rebuild them after completion
of session.
Parallel sessions:Running a parallel
sessions by using concurrent batches will also reduce the time of loading the
data. So concurrent batches may also increase the session performance.
Partitioning: the session improves
the session performance by creating multiple connections to sources and targets
and loads data in parallel pipe lines.
Incremental Aggregation:In some
cases if a session contains an aggregator transformation ,you can use
incremental aggregation to improve session performance.
Transformation Errors:Avoid
transformation errors to improve the session performance.Before saving the
mapping validate it and see and if any transformation errors rectify it.
Lookup Transformations: If the
session contained lookup transformation you can improve the session performance
by enabling the look up cache.The cache improves the speed by saving the
previous data and hence no need to load that again
Filter Transformations: If your
session contains filter transformation ,create that filter transformation
nearer to the sources or you can use filter condition in source qualifier.
Group transformations: Aggregator,
Rank and joiner transformation may often decrease the session performance
.Because they must group data before processing it. To improve session
performance in this case use sorted ports option ie sort the data before using
the transformation.
Packet size: We can improve the
session performance by configuring the network packet size, which allows data
to cross the network at one time. To do this go to server manger ,choose server
configure database connections.
Session
parameters, like mapping parameters, represent values you might want to change
between sessions, such as a database connection or source file. Use session
parameters in the session properties, and then define the parameters in a
parameter file. You can specify the parameter file for the session to use in
the session properties. You can also specify it when you use pmcmd to
start the session.The Workflow Manager provides one built-in session parameter,
$PMSessionLogFile.With $PMSessionLogFile, you can change the name of the session
log generated for the session.The Workflow Manager also allows you to create
user-defined session parameters.
Naming Conventions for User-Defined Session
Parameters
Parameter Type
Naming Convention
Database Connection
$DBConnectionName
Source File
$InputFileName
Target File
$OutputFileName
Lookup File
$LookupFileName
Reject File
$BadFileName
Use
session parameters to make sessions more flexible. For example, you have the
same type of transactional data written to two different databases, and you use
the database connections TransDB1 and TransDB2 to connect to the databases. You
want to use the same mapping for both tables. Instead of creating two sessions
for the same mapping, you can create a database connection parameter,
$DBConnectionSource, and use it as the source database connection for the
session. When you create a parameter file for the session, you set
$DBConnectionSource to TransDB1 and run the session. After the session
completes, you set $DBConnectionSource to TransDB2 and run the session again.
You
might use several session parameters together to make session management
easier. For example, you might use source file and database connection
parameters to configure a session to read data from different source files and
write the results to different target databases. You can then use reject file
parameters to write the session reject files to the target machine. You can use
the session log parameter, $PMSessionLogFile, to write to different session
logs in the target machine, as well.
When
you use session parameters, you must define the parameters in the parameter
file. Session parameters do not have default values. When the PowerCenter
Server cannot find a value for a session parameter, it fails to initialize the
session.
In
1993, the “father of data warehousing”, Bill Inmon, gave this definition of a
data warehouse:
“A
datewarehouse is a subject oriented , integrated,non volatile and time variant
collection of data in support of managements decisions” According
to this definition, a data warehouse is different from an operational database
in four important ways.
Data Warehouse
Operational Database
subject
oriented
application
oriented
integrated
multiple
diverse sources
time-variant
real-time,
current
nonvolatile
updateable
Comparing
a Data Warehouse and an Operational Database
An
operational database is designed primarily to support day to day operations. A data warehouse is designed to support
strategic decision making.
Data
warehouse assembles data dispersed in various data sources across the
enterprise and helps business stakeholders manage their operations by making
better informed decisions. Most organizations have multiple data stores:
relational databases, spreadsheets, mainframes, mail systems or even paper
files. Each of these data stores tends to serve a subset of the enterprise.
Data warehouse attempts to overcome this limitation by combining all relevant
data and by allowing managers to view their business from many different
angles.
Data warehouses are built using dimensional data models which consist of fact
and dimension tables. Dimension tables are used to describe dimensions; they
contain dimension keys, values and attributes. For example, the time dimension
would contain every hour, day, week, month, quarter and year that has occurred
since you started your business operations. Product dimension could contain a
name and description of products you sell, their unit price, color, weight and
other attributes as applicable.
Benefits
of Data Warehousing
Data
warehousing is being hailed as one of the most strategically significant
developments in information processing in recent times. One of the reasons for this is that it is
seen as part of the answer to information overload.
Some
of the benefits of data warehousing that were seen as relevant to the Avondale
College project are listed here. The
points highlighted in the Bill Inmon’s definition give some of the reasons why
data warehousing is regarded as important.
1.Has
a subject area orientation
2.Integrates
data from multiple, diverse sources
3.Allows
for analysis of data over time
4.Adds
ad hoc reporting and enquiry
5.Provides
analysis capabilities to decision makers
6.Relieves
the development burden on IT
7.Provides
improved performance for complex analytical queries
8.Relieves
processing burden on transaction oriented databases
9.Allows
for a continuous planning process
10.Converts
corporate data into strategic information
Before
building a data warehouse you should become familiar with the terminology used
to describe various parts of the warehouse.
1.Dimensions
Dimension tables are typically small, ranging from a few to several thousand
rows.Occasionally dimensions can grow fairly large, however.For example, a
large credit card company could have a customer dimension with millions of
rows. Dimension table structure is typically very lean, for example customer
dimension could look like following:
Although
there might be other attributes that you store in the relational database, data
warehouses might not need all of those attributes. For example, customer
telephone numbers, email addresses and other contact information would not be
necessary for the warehouse. Keep in mind that data warehouses are used to make
strategic decisions by analyzing trends. It is not meant to be a tool for daily
business operations. On the other hand, you might have some reports that do
include data elements that aren't necessary for data analysis. Most
data warehouses will have one or multiple time dimensions. Since the warehouse
will be used for finding and examining trends, data analysts will need to know
when each fact has occurred. The most common time dimension is calendar time.
However, your business might also need a fiscal time dimension in case your
fiscal year does not start on January 1st as the calendar year.
Most
data warehouses will also contain product or service dimensions since each
business typically operates by offering either products or services to others.
Geographically dispersed businesses are likely to have a location dimension.
2.Facts Fact tables contain keys to dimension tables as well as measurable facts that
data analysts would want to examine. For example, a store selling automotive
parts might have a fact table recording a sale of each item. The fact table of
an educational entity could track credit hours awarded to students. A bakery
could have a fact table that records manufacturing of various baked goods.
Fact tables can grow very large, with millions or even billions of rows. It is
important to identify the lowest level of facts that makes sense to analyze for
your business this is often referred to as fact table "grain". For
instance, for a healthcare billing company it might be sufficient to track
revenues by month; daily and hourly data might not exist or might not be
relevant. On the other hand, the assembly line warehouse analysts might be very
concerned in number of defective goods that were manufactured each hour.
Similarly a marketing data warehouse might be concerned by the activity of a
consumer group with a specific income-level rather than purchases made by each
individual.
Data Staging Area
A storage area and set
of processes that clean, transform, combine, de-duplicate,household, archive,
and prepare source data for use in the data warehouse.The data staging area is
everything in between the source system and the presentation server.Although it would be
nice if the data staging area were a single centralized facility on one piece of hardware, it
is far more likely that the data staging area is spread over a number of machines. The data
staging area is dominatedby the simple activities of sorting & sequential processing
and, in some cases, the data staging area does not need to be based on relational
technology.
Below are the few other terms which comes mostly in OLAP side of
Datawarehousing
1.Hierarchy It defines parent-child relationships among
various levels within a single dimension. For instance in a time dimension,
year level is parent of four quarters, each of which is a parent of three
months, which are parents of 28 to 31 days, which are parents of 24 hours.
Similarly in a geography dimension a continent is a parent of countries,
country could be a parent of states, and state could be a parent of cities.
2.Level Level is a column within a
dimension table that could be used for aggregating data. For example, product
dimension could have levels of product type (beverage), product category
(alcoholic beverage), product class (beer), product name (miller lite, budlite,
corona, etc).
3.Drill-down Drill-down
is the repetitive selection and analysis of
summarized facts, with each repetition of data selection occurring at a lower
level of summarization. An example of
drill-down is
a multiple-step process where sales revenue is first analyzed by year, then by
quarter, and finally by month. Each
iteration of drill-down returns sales revenue at a lower level of
aggregation along the period dimension.
4.Roll-up Roll-up
is the opposite of drill-down.
Roll-up is the repetitive selection and analysis of
summarized facts with each repetition of data selection occurring at a higher
level of summarization.
5. Aggregation Aggregation is
a key attribute of a data warehouse.
Summarization and consolidation are other words that used to convey the same
meaning. In the case of a
multi-dimensional database, the summarizations are pre-computed for all the
various combinations of the dimensions.
This allows for very fast response to slice and dice operations at any level of drill-down, and
also allows for fast drill-down and roll-up operations.
6.Granularity Granularity is
a term that is used to describe the level below which no supporting details are
stored, only the summaries. Good
judgment needs to be exercised to determine granularity. If the granularity is set to be too fine,
unused data will be stored, wasting processing time during replication steps, and wasting disk space to hold it. On the other hand, if the granularity is set
too coarse, the detailed data will not be available if it is needed at some
point in the future.
A data
warehouse does not generate any data; instead it is populated from various
transactional or operational data stores. The process of importing and
manipulating transactional data into the warehouse is referred to as
Extraction, Transformation and Loading (ETL). Informatica is the widely used
ETL tool for extracting source data and loading into target with required
Transformations.SQL Server supplies an excellent ETL tool known as Data
Transformation Services (DTS) in version 2000 and SQL Server Integration Services
(SSIS) in version 2005.
In addition to importing
data from various sources, ETL is also responsible for transforming data into a
dimensional model. Depending on your data sources the import process can be
relatively simple or very complicated. For eg, some organizations keep all of
their data in a single relational engine, such as SQL Server. Others could have
numerous systems that might not be easily accessible. In some cases you might
have to rely on scanned documents or scrape report screens to get the data for
your warehouse. In such situations you should bring all data into a common
staging area first and then transform it into a dimensional model.
A staging area also
provides a good place for assuring that your ETL is working correctly before
data is loaded into dimension and fact tables. So your ETL could be made up of multiple
stages:
Import
data from various data sources into the staging area.
Cleanse
data from inconsistencies (could be either automated or manual effort).
Ensure
that row counts of imported data in the staging area match the counts in
the original data source.
Load
data from the staging area into the dimensional model.
ETL
resolves the inconsistencies in entity and attribute naming across multiple
data sources. For example the same entity could be called customers, clients,
prospects or consumers in various data stores. Furthermore attributes such as
address might be stored as three or more different columns (address line1,
address line2, city, state, county, postal code, country and so forth). Each
column can also be abbreviated or spelled out completely, depending on data source.
Similarly there might be differences in data types, such as storing data and
time as a string, number or date. During the ETL process data is imported from
various sources and is given a common shape.
There
are some over-arching themes in successful ETL system deployments regardless of
which tools and technologies are used. Most important — and most frequently
neglected — is the practice of designing the ETL system before development
begins. Too often we see systems that just evolved without any initial
planning. These systems are inefficient and slow, they break down all the time,
and they're unmanageable. The data warehouse team has no idea how to pinpoint
the bottlenecks and problem areas of the system.
List of the most popular ETL
tools:
Informatica - Power Center
IBM - Websphere
DataStage(Formerly known as Ascential DataStage)
SAP - BusinessObjects Data
Integrator
IBM - Cognos Data Manager
(Formerly known as Cognos DecisionStream)
Microsoft - SQL Server
Integration Services
Oracle - Data Integrator
(Formerly known as Sunopsis Data Conductor)
When an execution plan is
executed, if a task fails, the status of the tasks that are dependent on the
failed task is changed to Stopped. While tasks are still running, the execution
plan’s status is Running. When all the tasks have been run, and if one or more
tasks have failed, the execution plan’s status is changed to Failed. You can
check the tasks that have failed in the Current Run tab of the Execute view, fix
the problems,and then requeue the failed tasks by changing the status to
Queued. You can then restart the ETL.
All the tasks will then be
rerun. You can also manually run a task, change its status to Completed,and
then restart the ETL. Tasks with a Completed status are skipped.To restart a
Failed ETL, click Run Now from the Current Run tab of the Execute view.
2.In Case of Abnormal
Termination of the DAC Server
If the DAC server fails
during the execution of the ETL, the status of the ETL execution will remain as
Running. When the DAC server is started again, it will automatically run the
ETL if the Auto Restart ETL DAC system property is set to True. If the same
system property is set to False, when the server restarts, it will set the
correct status as Failed. In order to execute the ETL from the point of
failure, submit the request to the server again. The DAC server will
automatically terminate if it looses connection to the DAC repository.
Data Warehouse Administration Console(DAC) is a centralized console for schema management as well as configuration, administration, loading, and monitoring of the Oracle Business Analytics Warehouse.You can also read aboutHow to do Full load in DAC
The DAC provides a framework for the entire life cycle of data warehouse implementations. It allows you to create, configure, execute, and monitor modular data warehouse applications in a parallel, high-performing environment.
It manages the following data warehouse load processes:
Dynamic generation of subject areas and execution plans
Dynamic settings for parallelism and load balancing
Intelligent task queue engine based on user-defined and computed scores
OLAP (or Online
Analytical Processing) has been growing in popularity due to the increase
in data volumes and the recognition of the business value of analytics. It uses
database tables (fact and dimension tables) to enable multidimensional viewing,
analysis and querying of large amounts of data. E.g. OLAP technology could
provide management with fast answers to complex queries on their operational
data or enable them to analyze their company's historical data for trends and
patterns.
OLAP allows business users
to slice and dice data at will. Normally data in an organization is distributed
in multiple data sources and are incompatible with each other. A retail
example: Point-of-sales data and sales made via call-center or the Web are
stored in different location and formats. It would a time consuming process for
an executive to obtain OLAP reports such as - What are the most popular
products purchased by customers between the ages 15 to 30?
Part of the OLAP
implementation process involves extracting data from the various data
repositories and making them compatible. Making data compatible involves
ensuring that the meaning of the data in one repository matches all other
repositories. An example of incompatible data: Customer ages can be stored as
birth date for purchases made over the web and stored as age categories (i.e.
between 15 and 30) for in store sales.The major OLAP vendors are Hyperion,
Cognos, Business Objects, MicroStrategy.
OLAPs are designed to give
an overview analysis of what happened. Hence the data storage (i.e. data
modeling) has to be set up differently. The most common method is called the
star design.
The central table in an
OLAP start data model is called the fact table. The surrounding tables are called
the dimensions. Using the above data model, it is possible to build reports
that answer questions such as:
The supervisor that gave the most
discounts.
The quantity shipped on a particular
date, month, year or quarter.
In which pincode did product A sell
the most.
To obtain answers, such as
the ones above, from a data model OLAP cubes
are created. OLAP cubes are not strictly cuboids - it is the name given to the
process of linking data from the different dimensions. The cubes can be
developed along business units such as sales or marketing.Or a giant cube can
be formed with all the dimensions.
OLAP can be a valuable and
rewarding business tool. Aside from producing reports, OLAP analysis can aid an
organization evaluate balanced scorecard targets.
Business
intelligence refers to computer-based methods for identifying and extracting
useful information from business data. OLAP (online analytical processing) as
the name suggest is a compilation of ways to query multi-dimensional databases.
OLAP
is a class of systems, which provide answers to multi-dimensional queries.Typically
OLAP is used for marketing, budgeting, forecasting and similar applications. It
goes without saying that the databases used for OLAP are configured for complex
and ad-hoc queries with a quick performance in mind. Typically a matrix is used
to display the output of an OLAP. The rows and columns are formed by the
dimensions of the query. They often use methods of aggregation on multiple
tables to obtain summaries.For example,it can be used to find out about the
sales of this year in Wal-Mart compared to last year? What is the prediction on
the sales in the next quarter? What can be said about the trend by looking at
the percentage change?
OLAP
tools provides multidimensional data analysis and they provide summaries of the
data but contrastingly, data mining focuses on ratios, patterns and influences
in the set of data.