Pages

Wednesday, 18 July 2012

Informatica ETL(Extract, Transform and Load) Tool

Here,we will see how the tool INFORMATICA is used to extract data from Source(s), transform it and then load it into the Target. Data transformation is done to eliminate any erroneous or redundant data. This ensures that only the correct data is loaded into the Target (OLAP), which will be used for analysis / reporting.
Informatica PowerCenter architecture is used to achieve the extract, transform and load of data. PowerCenter provides an environment that allows you to load data into a centralized location, such as a datamart, data warehouse, or operational data store (ODS). You can extract data from multiple sources, transform the data according to business
logic you build in the client application, and load the transformed data into file and relational targets. PowerCenter provides the following integrated components
  • PowerCenter repository. The PowerCenter repository is at the center of the PowerCenter suite. You create a set of metadata tables within the repository database that the PowerCenter applications and tools access. The PowerCenter Client and Server access the repository to save and retrieve metadata.
  • PowerCenter Repository Server. The PowerCenter Repository Server manages connections to the repository from client applications. It inserts, updates, and fetches objects from the repository database tables. It also maintains object consistency.
  • PowerCenter Client. Use the PowerCenter Client to manage users, define sources and targets, build mappings and mapplets with the transformation logic, and create workflows to run the mapping logic. The PowerCenter Client has the following client applications: Repository Manager, Repository Server Administration Console, Designer, Workflow Manager, and Workflow Monitor.
  • PowerCenter Server. The PowerCenter Server extracts the source data, performs the data transformation, and loads the transformed data into the targets.
Sources
PowerCenter accesses the following sources:
  • Relational. Oracle, Sybase, Informix, IBM DB2, Microsoft SQL Server, and Teradata.
  • File. Fixed and delimited flat file, COBOL file, and XML.
  • Application. You can purchase additional PowerCenter Connect products to access business sources, such as PeopleSoft, SAP R/3, Siebel, IBM MQSeries, and TIBCO.
  • Mainframe. You can purchase PowerExchange for faster access to IBM DB2 on MVS.
  • Other. Microsoft Excel and Access.
Targets
PowerCenter can load data into the following targets:
  • Relational. Oracle, Sybase, Sybase IQ, Informix, IBM DB2, Microsoft SQL Server, and Teradata.
  • File. Fixed and delimited flat file and XML.
  • Application. You can purchase additional PowerCenter Connect products to load data into SAP BW. You can also load data into IBM MQSeries message queues and TIBCO.
  • Other. Microsoft Access.
You can load data into targets using ODBC or native drivers, FTP, or external loaders. 

Repository
The PowerCenter repository resides on a relational database. The repository database tables contain the instructions required to extract, transform, and load data. PowerCenter Client applications access the repository database tables through the Repository Server.
You add metadata to the repository tables when you perform tasks in the PowerCenter Client application, such as creating users, analyzing sources, developing mappings or mapplets, or creating workflows. The PowerCenter Server reads metadata created in the Client application when you run a workflow. The PowerCenter Server also creates metadata, such as start and finish times of a session or session status.
You can develop global and local repositories to share metadata:
  • Global repository. The global repository is the hub of the domain. Use the global repository to store common objects that multiple developers can use through shortcuts. These objects may include operational or Application source definitions, reusable transformations, mapplets, and mappings.
  • Local repositories. A local repository is within a domain that is not the global repository. Use local repositories for development. From a local repository, you can create shortcuts to objects in shared folders in the global repository. These objects typically include source definitions, common dimensions and lookups, and enterprise standard transformations. You can also create copies of objects in non-shared folders.
  • Version control. A versioned repository can store multiple copies, or versions, of an object. Each version is a separate object with unique properties. PowerCenter version control features allow you to efficiently develop, test, and deploy metadata into production.
You can connect to a repository, back up, delete, or restore repositories using pmrep, a command line program.
You can view much of the metadata in the Repository Manager. The Informatica Metadata Exchange (MX) provides a set of relational views that allow easy SQL access to the Informatica metadata repository.
Repository Server
The Repository Server manages repository connection requests from client applications. For each repository database registered with the Repository Server, it configures and manages a Repository Agent process. The Repository Server also monitors the status of running Repository Agents, and sends repository object notification messages to client applications.
The Repository Agent is a separate, multi-threaded process that retrieves, inserts, and updates metadata in the repository database tables. The Repository Agent ensures the consistency of metadata in the repository by employing object locking. 

PowerCenter Client
The PowerCenter Client consists of the following applications that you use to manage the repository, design mappings, mapplets, and create sessions to load the data:
  • Repository Server Administration Console. Use the Repository Server Administration console to administer the Repository Servers and repositories.
  • Repository Manager. Use the Repository Manager to administer the metadata repository. You can create repository users and groups, assign privileges and permissions, and manage folders and locks.
  • Designer. Use the Designer to create mappings that contain transformation instructions for the PowerCenter Server. Before you can create mappings, you must add source and target definitions to the repository. The Designer has five tools that you use to analyze sources, design target schemas, and build source-to-target mappings:
    • Source Analyzer. Import or create source definitions.
    • Warehouse Designer. Import or create target definitions.
    • Transformation Developer. Develop reusable transformations to use in mappings.
    • Mapplet Designer. Create sets of transformations to use in mappings.
    • Mapping Designer. Create mappings that the PowerCenter Server uses to extract, transform, and load data.
  • Workflow Manager. Use the Workflow Manager to create, schedule, and run workflows. A workflow is a set of instructions that describes how and when to run tasks related to extracting, transforming, and loading data. The PowerCenter Server runs workflow tasks according to the links connecting the tasks. You can run a task by placing it in a workflow.
  • Workflow Monitor. Use the Workflow Monitor to monitor scheduled and running workflows for each PowerCenter Server. You can choose a Gantt chart or Task view. You can also access details about those workflow runs.
PowerCenter Server
The PowerCenter Server reads mapping and session information from the repository. It extracts data from the mapping sources and stores the data in memory while it applies the transformation rules that you configure in the mapping. The PowerCenter Server loads the transformed data into the mapping targets.
The PowerCenter Server can achieve high performance using symmetric multi-processing systems. The PowerCenter Server can start and run multiple workflows concurrently. It can also concurrently process partitions within a single session. When you create multiple partitions within a session, the PowerCenter Server creates multiple database connections to a single source and extracts a separate range of data for each connection, according to the properties you configure. 

Connectivity
PowerCenter uses the following types of connectivity:
  • Network protocol
  • Native drivers
  • ODBC
The PowerCenter Client uses ODBC and native drivers to connect to source and target databases. It uses TCP/IP to connect to the Repository Server. The Repository Server uses native drivers to connect to the repository database. The Workflow Manager and the PowerCenter Server use TCP/IP to communicate with each other.
The PowerCenter Server uses native drivers to connect to the databases to move data. You can optionally use ODBC to connect the PowerCenter Server to the source and target databases. It uses TCP/IP to connect to the PowerCenter Client. 

Database Connections
The Repository Server maintains a pool of reusable database connections for serving client applications. The server generates a Repository Agent process for each database. The Repository Agent creates new database connections only if all the current connections are in use.
For example, if 10 clients send requests to the Repository Agent one at a time, the agent requires only one connection. It reuses the same database connection for all the requests. If the 10 clients send requests simultaneously, the Repository Agent opens 10 connections. You can set the maximum number of open connections using the DatabasePoolSize parameter in the repository configuration file.
For a session, a reader object holds the connection for as long as it needs to read the data from the source tables. A writer object holds a connection for as long as it needs to write data to the target tables.
The PowerCenter Server maintains a database connection pool for stored procedure or lookup databases in a workflow. You can optionally set the MaxLookupSPDBConnections parameter to limit connections when you configure the PowerCenter service. The PowerCenter Server allows an unlimited number of connections to lookup or stored procedure databases. If a database user does not have permission for the number of connections a session requires, the session fails.
For pre-session, post-session, and load-stored procedures, consecutive stored procedures reuse a connection if they have identical connection attributes. Otherwise, the connection for one stored procedure closes and a new connection begins for the next stored procedure. 

Designer Windows
You can display the following windows in the Designer:
  • Navigator. Connect to repositories, and open folders within the Navigator. You can also copy objects and create shortcuts within the Navigator.
  • Workspace. Open different tools in this window to create and edit repository objects, such as sources, targets, mapplets, transformations, and mappings.
  • Output. View details about tasks you perform, such as saving your work or validating a mapping.
  • Status bar. Displays the status of the operation you perform.
  • Overview. An optional window to simplify viewing a workspace that contains a large mapping or multiple objects. Outlines the visible area in the workspace and highlights selected objects in color.
  • Instance data. View transformation data while you run the Debugger to debug a mapping.
  • Target data. View target data while you run the Debugger to debug a mapping.

2 comments: