Pages

Saturday, 28 July 2012

How can we do Performance Tuning in Informatica

The goal of performance tuning is to optimize session performance so that the sessions run during the available load window for the Informatica Server.Increase the session performance by following.
  • Network:The performance of the Informatica Server is related to network connections. Data generally moves across a network at less than 1 MB per second, whereas a local disk moves data five to twenty times faster. Thus network connections often affect on session performance. So avoid network connections.
  • Flat files: If your flat files stored on a machine other than the informatica server, move those files to the machine that consists of informatica server.
  • Less Connections: Minimize the connections to sources ,targets and informatica server to improve session performance. Moving target database into server system may improve session performance.
  • Staging areas: If you use staging areas you force informatica server to perform multiple data passes.Removing of staging areas may improve session performance.Use staging area only when its mandatory
  • Informatica Servers:You can run the multiple informatica servers against the same repository. Distributing the session load to multiple informatica servers may improve session performance.
  • Run the informatica server in ASCII data movement mode improves the session performance. Because ASCII data movement mode stores a character value in one byte. Unicode mode takes 2 bytes to store a character.
  • Source qualifier: If a session joins multiple source tables in one Source Qualifier, optimizing the query may improve performance. Also, single table select statements with an ORDER BY or GROUP BY clause may benefit from optimization such as adding indexes.
  • Drop constraints: If target consists key constraints and indexes it slows the loading of data. To improve the session performance in this case drop constraints and indexes before we run the session(while loading facts and dimensions) and rebuild them after completion of session.
  • Parallel sessions:Running a parallel sessions by using concurrent batches will also reduce the time of loading the data. So concurrent batches may also increase the session performance.
  • Partitioning: the session improves the session performance by creating multiple connections to sources and targets and loads data in parallel pipe lines.
  • Incremental Aggregation:In some cases if a session contains an aggregator transformation ,you can use incremental aggregation to improve session performance.
  • Transformation Errors:Avoid transformation errors to improve the session performance.Before saving the mapping validate it and see and if any transformation errors rectify it.
  • Lookup Transformations: If the session contained lookup transformation you can improve the session performance by enabling the look up cache.The cache improves the speed by saving the previous data and hence no need to load that again
  • Filter Transformations: If your session contains filter transformation ,create that filter transformation nearer to the sources or you can use filter condition in source qualifier.
  • Group transformations: Aggregator, Rank and joiner transformation may often decrease the session performance .Because they must group data before processing it. To improve session performance in this case use sorted ports option ie sort the data before using the transformation.
  • Packet size: We can improve the session performance by configuring the network packet size, which allows data to cross the network at one time. To do this go to server manger ,choose server configure database connections.
 Also read about Diff between datastage and Informatica

Session Parameters in Informatica

Session Parameter

Session parameters, like mapping parameters, represent values you might want to change between sessions, such as a database connection or source file. Use session parameters in the session properties, and then define the parameters in a parameter file. You can specify the parameter file for the session to use in the session properties. You can also specify it when you use pmcmd to start the session.The Workflow Manager provides one built-in session parameter, $PMSessionLogFile.With $PMSessionLogFile, you can change the name of the session log generated for the session.The Workflow Manager also allows you to create user-defined session parameters.

Naming Conventions for User-Defined Session Parameters
Parameter Type
Naming Convention
Database Connection
$DBConnectionName
Source File
$InputFileName
Target File
$OutputFileName
Lookup File
$LookupFileName
Reject File
$BadFileName

Use session parameters to make sessions more flexible. For example, you have the same type of transactional data written to two different databases, and you use the database connections TransDB1 and TransDB2 to connect to the databases. 
You want to use the same mapping for both tables. Instead of creating two sessions for the same mapping, you can create a database connection parameter, $DBConnectionSource, and use it as the source database connection for the session. When you create a parameter file for the session, you set $DBConnectionSource to TransDB1 and run the session. After the session completes, you set $DBConnectionSource to TransDB2 and run the session again.

You might use several session parameters together to make session management easier. For example, you might use source file and database connection parameters to configure a session to read data from different source files and write the results to different target databases. You can then use reject file parameters to write the session reject files to the target machine. You can use the session log parameter, $PMSessionLogFile, to write to different session logs in the target machine, as well.

When you use session parameters, you must define the parameters in the parameter file. Session parameters do not have default values. When the PowerCenter Server cannot find a value for a session parameter, it fails to initialize the session.

Thursday, 26 July 2012

What is a DATAWAREHOUSE? Advantage of a Datawarehouse


In 1993, the “father of data warehousing”, Bill Inmon, gave this definition of a data warehouse: 
“A datewarehouse is a subject oriented , integrated,non volatile and time variant collection of data in support of managements decisions” 
According to this definition, a data warehouse is different from an opera­tional database in four important ways.
Data Warehouse
Operational Database
subject oriented
application oriented
integrated
multiple diverse sources
time-variant
real-time, current
nonvolatile
updateable
Comparing a Data Warehouse and an Operational Database
An operational database is designed primarily to support day to day opera­tions.  A data warehouse is designed to support strategic decision making.
Data warehouse assembles data dispersed in various data sources across the enterprise and helps business stakeholders manage their operations by making better informed decisions. Most organizations have multiple data stores: relational databases, spreadsheets, mainframes, mail systems or even paper files. Each of these data stores tends to serve a subset of the enterprise. Data warehouse attempts to overcome this limitation by combining all relevant data and by allowing managers to view their business from many different angles.
Data warehouses are built using dimensional data models which consist of fact and dimension tables. Dimension tables are used to describe dimensions; they contain dimension keys, values and attributes. For example, the time dimension would contain every hour, day, week, month, quarter and year that has occurred since you started your business operations. Product dimension could contain a name and description of products you sell, their unit price, color, weight and other attributes as applicable.
Benefits of Data Warehousing
Data warehousing is being hailed as one of the most strategically significant developments in information processing in recent times.  One of the reasons for this is that it is seen as part of the answer to information overload. 
Some of the benefits of data warehousing that were seen as relevant to the Avondale College project are listed here.  The points highlighted in the Bill Inmon’s definition give some of the reasons why data warehousing is regarded as important.
 
1.      Has a subject area orientation
2.      Integrates data from multiple, diverse sources
3.      Allows for analysis of data over time
4.      Adds ad hoc reporting and enquiry
5.      Provides analysis capabilities to decision makers
6.      Relieves the development burden on IT
7.      Provides improved performance for complex analytical queries
8.      Relieves processing burden on transaction oriented databases
9.      Allows for a continuous planning process
10.  Converts corporate data into strategic information

Before building a data warehouse you should become familiar with the terminology used to describe various parts of the warehouse.
1.Dimensions 
Dimension tables are typically small, ranging from a few to several thousand rows.Occasionally dimensions can grow fairly large, however.For example, a large credit card company could have a customer dimension with millions of rows. Dimension table structure is typically very lean, for example customer dimension could look like following: 
Customer_key
Customer_full_name
Customer_city
Customer_state
Customer_country
Although there might be other attributes that you store in the relational database, data warehouses might not need all of those attributes. For example, customer telephone numbers, email addresses and other contact information would not be necessary for the warehouse. Keep in mind that data warehouses are used to make strategic decisions by analyzing trends. It is not meant to be a tool for daily business operations. On the other hand, you might have some reports that do include data elements that aren't necessary for data analysis.
Most data warehouses will have one or multiple time dimensions. Since the warehouse will be used for finding and examining trends, data analysts will need to know when each fact has occurred. The most common time dimension is calendar time. However, your business might also need a fiscal time dimension in case your fiscal year does not start on January 1st as the calendar year.
Most data warehouses will also contain product or service dimensions since each business typically operates by offering either products or services to others. Geographically dispersed businesses are likely to have a location dimension.


2.Facts

Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. The fact table of an educational entity could track credit hours awarded to students. A bakery could have a fact table that records manufacturing of various baked goods.

Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table "grain". For instance, for a healthcare billing company it might be sufficient to track revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand, the assembly line warehouse analysts might be very concerned in number of defective goods that were manufactured each hour. Similarly a marketing data warehouse might be concerned by the activity of a consumer group with a specific income-level rather than purchases made by each individual.

Data Staging Area
A storage area and set of processes that clean, transform, combine, de-duplicate,household, archive, and prepare source data for use in the data warehouse.The data staging area is everything in between the source system and the presentation server.Although it would be nice if the data staging area were a single centralized facility on one piece of hardware, it is far more likely that the data staging area is spread over a number of machines. The data staging area is dominatedby the simple activities of sorting & sequential processing and, in some cases, the data staging area does not need to be based on relational technology.

Below are the few other terms which comes mostly in OLAP side of Datawarehousing
1.Hierarchy 
It defines parent-child relationships among various levels within a single dimension. For instance in a time dimension, year level is parent of four quarters, each of which is a parent of three months, which are parents of 28 to 31 days, which are parents of 24 hours. Similarly in a geography dimension a continent is a parent of countries, country could be a parent of states, and state could be a parent of cities.

2.Level 

Level is a column within a dimension table that could be used for aggregating data. For example, product dimension could have levels of product type (beverage), product category (alcoholic beverage), product class (beer), product name (miller lite, budlite, corona, etc).

3.Drill-down
Drill-down is the repetitive selection and analysis of summarized facts, with each repetition of data selection occurring at a lower level of summarization.  An example of drill-down is a multiple-step process where sales revenue is first analyzed by year, then by quarter, and finally by month.  Each iteration of drill-down returns sales revenue at a lower level of aggregation along the period dimension.
4.Roll-up 
Roll-up is the opposite of drill-down.  Roll-up is the repetitive selection and analysis of summarized facts with each repetition of data selection occurring at a higher level of summarization.
5. Aggregation 
Aggregation is a key attribute of a data warehouse.  Summarization and consolidation are other words that used to convey the same meaning.  In the case of a multi-dimensional database, the summarizations are pre-computed for all the various combinations of the dimensions.  This allows for very fast response to slice and dice operations at any level of drill-down, and also allows for fast drill-down and roll-up operations.
6.Granularity 
Granularity is a term that is used to describe the level below which no supporting details are stored, only the summaries.  Good judgment needs to be exercised to determine granularity.  If the granularity is set to be too fine, unused data will be stored, wasting processing time during replication steps, and wasting disk space to hold it.  On the other hand, if the granularity is set too coarse, the detailed data will not be available if it is needed at some point in the future.

Wednesday, 25 July 2012

What does ETL tool mean- Extraction, Transformation & Loading-ETL Basics

Extraction, Transformation & Loading:

A data warehouse does not generate any data; instead it is populated from various transactional or operational data stores. The process of importing and manipulating transactional data into the warehouse is referred to as Extraction, Transformation and Loading (ETL). Informatica is the widely used ETL tool for extracting source data and loading into target with required Transformations.SQL Server supplies an excellent ETL tool known as Data Transformation Services (DTS) in version 2000 and SQL Server Integration Services (SSIS) in version 2005.
In addition to importing data from various sources, ETL is also responsible for transforming data into a dimensional model. Depending on your data sources the import process can be relatively simple or very complicated. For eg, some organizations keep all of their data in a single relational engine, such as SQL Server. Others could have numerous systems that might not be easily accessible. In some cases you might have to rely on scanned documents or scrape report screens to get the data for your warehouse. In such situations you should bring all data into a common staging area first and then transform it into a dimensional model.
A staging area also provides a good place for assuring that your ETL is working correctly before data is loaded into dimension and fact tables. So your ETL could be made up of multiple stages:
  • Import data from various data sources into the staging area.
  • Cleanse data from inconsistencies (could be either automated or manual effort).
  • Ensure that row counts of imported data in the staging area match the counts in the original data source.
  • Load data from the staging area into the dimensional model.
ETL resolves the inconsistencies in entity and attribute naming across multiple data sources. For example the same entity could be called customers, clients, prospects or consumers in various data stores. Furthermore attributes such as address might be stored as three or more different columns (address line1, address line2, city, state, county, postal code, country and so forth). Each column can also be abbreviated or spelled out completely, depending on data source. Similarly there might be differences in data types, such as storing data and time as a string, number or date. During the ETL process data is imported from various sources and is given a common shape.

There are some over-arching themes in successful ETL system deployments regardless of which tools and technologies are used. Most important — and most frequently neglected — is the practice of designing the ETL system before development begins. Too often we see systems that just evolved without any initial planning. These systems are inefficient and slow, they break down all the time, and they're unmanageable. The data warehouse team has no idea how to pinpoint the bottlenecks and problem areas of the system.

List of the most popular ETL tools:
  • Informatica - Power Center
  • IBM - Websphere DataStage(Formerly known as Ascential DataStage)
  • SAP - BusinessObjects Data Integrator
  • IBM - Cognos Data Manager (Formerly known as Cognos DecisionStream)
  • Microsoft - SQL Server Integration Services
  • Oracle - Data Integrator (Formerly known as Sunopsis Data Conductor)
  • SAS - Data Integration Studio
  • Oracle - Warehouse Builder
  • AB Initio
  • Information Builders - Data Migrator
 Guys  Read  more  about  ETL
 

Execution of ETL using DAC

1.When the Execution of an Execution Plan Fails
When an execution plan is executed, if a task fails, the status of the tasks that are dependent on the failed task is changed to Stopped. While tasks are still running, the execution plan’s status is Running. When all the tasks have been run, and if one or more tasks have failed, the execution plan’s status is changed to Failed. You can check the tasks that have failed in the Current Run tab of the Execute view, fix the problems,and then requeue the failed tasks by changing the status to Queued. You can then restart the ETL.
All the tasks will then be rerun. You can also manually run a task, change its status to Completed,and then restart the ETL. Tasks with a Completed status are skipped.To restart a Failed ETL, click Run Now from the Current Run tab of the Execute view.

2.In Case of Abnormal Termination of the DAC Server
If the DAC server fails during the execution of the ETL, the status of the ETL execution will remain as Running. When the DAC server is started again, it will automatically run the ETL if the Auto Restart ETL DAC system property is set to True. If the same system property is set to False, when the server restarts, it will set the correct status as Failed. In order to execute the ETL from the point of failure, submit the request to the server again. The DAC server will automatically terminate if it looses connection to the DAC repository.

What is DAC in OBIA

Data Warehouse Administration Console (DAC) is a centralized console for schema management as well as configuration, administration, loading, and monitoring of the Oracle Business Analytics Warehouse.You can also read about How to do Full load in DAC
The DAC provides a framework for the entire life cycle of data warehouse implementations. It allows you to create, configure, execute, and monitor modular data warehouse applications in a parallel, high-performing environment.
  • It manages the following data warehouse load processes:
  • Dynamic generation of subject areas and execution plans
  • Dynamic settings for parallelism and load balancing
  • Intelligent task queue engine based on user-defined and computed scores
  • Index management for ETL and query performance
  • Embedded high-performance OTLP change capture techniques
  • Restart from any point of failure

Monday, 23 July 2012

What is an OLAP and Explain OLAP's advantages with example?

OLAP (or Online Analytical Processing) has been growing in popularity due to the increase in data volumes and the recognition of the business value of analytics. It uses database tables (fact and dimension tables) to enable multidimensional viewing, analysis and querying of large amounts of data. E.g. OLAP technology could provide management with fast answers to complex queries on their operational data or enable them to analyze their company's historical data for trends and patterns.
OLAP allows business users to slice and dice data at will. Normally data in an organization is distributed in multiple data sources and are incompatible with each other. A retail example: Point-of-sales data and sales made via call-center or the Web are stored in different location and formats. It would a time consuming process for an executive to obtain OLAP reports such as - What are the most popular products purchased by customers between the ages 15 to 30?
Part of the OLAP implementation process involves extracting data from the various data repositories and making them compatible. Making data compatible involves ensuring that the meaning of the data in one repository matches all other repositories. An example of incompatible data: Customer ages can be stored as birth date for purchases made over the web and stored as age categories (i.e. between 15 and 30) for in store sales.The major OLAP vendors are Hyperion, Cognos, Business Objects, MicroStrategy.
OLAPs are designed to give an overview analysis of what happened. Hence the data storage (i.e. data modeling) has to be set up differently. The most common method is called the star design.
                                                                     

The central table in an OLAP start data model is called the fact table. The surrounding tables are called the dimensions. Using the above data model, it is possible to build reports that answer questions such as:
  • The supervisor that gave the most discounts.
  • The quantity shipped on a particular date, month, year or quarter.
  • In which pincode did product A sell the most.
To obtain answers, such as the ones above, from a data model OLAP cubes are created. OLAP cubes are not strictly cuboids - it is the name given to the process of linking data from the different dimensions. The cubes can be developed along business units such as sales or marketing.Or a giant cube can be formed with all the dimensions.
OLAP can be a valuable and rewarding business tool. Aside from producing reports, OLAP analysis can aid an organization evaluate balanced scorecard targets.
Business intelligence refers to computer-based methods for identifying and extracting useful information from business data. OLAP (online analytical processing) as the name suggest is a compilation of ways to query multi-dimensional databases.
OLAP is a class of systems, which provide answers to multi-dimensional queries.Typically OLAP is used for marketing, budgeting, forecasting and similar applications. It goes without saying that the databases used for OLAP are configured for complex and ad-hoc queries with a quick performance in mind. Typically a matrix is used to display the output of an OLAP. The rows and columns are formed by the dimensions of the query. They often use methods of aggregation on multiple tables to obtain summaries.For example,it can be used to find out about the sales of this year in Wal-Mart compared to last year? What is the prediction on the sales in the next quarter? What can be said about the trend by looking at the percentage change?
OLAP tools provides multidimensional data analysis and they provide summaries of the data but contrastingly, data mining focuses on ratios, patterns and influences in the set of data.