Extraction,
Transformation & Loading:
A data
warehouse does not generate any data; instead it is populated from various
transactional or operational data stores. The process of importing and
manipulating transactional data into the warehouse is referred to as
Extraction, Transformation and Loading (ETL). Informatica is the widely used
ETL tool for extracting source data and loading into target with required
Transformations.SQL Server supplies an excellent ETL tool known as Data
Transformation Services (DTS) in version 2000 and SQL Server Integration Services
(SSIS) in version 2005.
In addition to importing
data from various sources, ETL is also responsible for transforming data into a
dimensional model. Depending on your data sources the import process can be
relatively simple or very complicated. For eg, some organizations keep all of
their data in a single relational engine, such as SQL Server. Others could have
numerous systems that might not be easily accessible. In some cases you might
have to rely on scanned documents or scrape report screens to get the data for
your warehouse. In such situations you should bring all data into a common
staging area first and then transform it into a dimensional model.
A staging area also provides a good place for assuring that your ETL is working correctly before data is loaded into dimension and fact tables. So your ETL could be made up of multiple stages:
A staging area also provides a good place for assuring that your ETL is working correctly before data is loaded into dimension and fact tables. So your ETL could be made up of multiple stages:
- Import data from various data sources into the staging area.
- Cleanse data from inconsistencies (could be either automated or manual effort).
- Ensure that row counts of imported data in the staging area match the counts in the original data source.
- Load data from the staging area into the dimensional model.
ETL
resolves the inconsistencies in entity and attribute naming across multiple
data sources. For example the same entity could be called customers, clients,
prospects or consumers in various data stores. Furthermore attributes such as
address might be stored as three or more different columns (address line1,
address line2, city, state, county, postal code, country and so forth). Each
column can also be abbreviated or spelled out completely, depending on data source.
Similarly there might be differences in data types, such as storing data and
time as a string, number or date. During the ETL process data is imported from
various sources and is given a common shape.
There
are some over-arching themes in successful ETL system deployments regardless of
which tools and technologies are used. Most important — and most frequently
neglected — is the practice of designing the ETL system before development
begins. Too often we see systems that just evolved without any initial
planning. These systems are inefficient and slow, they break down all the time,
and they're unmanageable. The data warehouse team has no idea how to pinpoint
the bottlenecks and problem areas of the system.
List of the most popular ETL tools:
List of the most popular ETL tools:
- Informatica - Power Center
- IBM - Websphere DataStage(Formerly known as Ascential DataStage)
- SAP - BusinessObjects Data Integrator
- IBM - Cognos Data Manager (Formerly known as Cognos DecisionStream)
- Microsoft - SQL Server Integration Services
- Oracle - Data Integrator (Formerly known as Sunopsis Data Conductor)
- SAS - Data Integration Studio
- Oracle - Warehouse Builder
- AB Initio
- Information Builders - Data Migrator
No comments:
Post a Comment