June 2016

Monday, 20 June 2016

HIVE STORAGE FORMATS

In this post we will take a look on the different Storage File Formats and Record Formats in Hive

Before we move forward lets discuss for a split second about Apache Hive.

Apache Hive which is a data warehouse system for Hadoop facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems, first created at Facebook . Hive provide a means to project structure onto this data and query the data using a SQL-like language called HiveQL.…read more on Hive here

Among the different storage file formats that are used in hive, the default and simplest storage file format is the TEXTFILE.

TEXTFILE

The data in a TEXTFILE is stored as plain text, one line per record. The TEXTFILE is very useful for sharing data with other tools and also when you want to manually edit the data in the file. However the TEXTFILE is less proficient when compared to the other formats.

SYNTAX :

CREATE TABLE TEXTFILE_TABLE (

COLUMN1 STRING,
COLUMN2 STRING,
COLUMN3 INT,
COLUMN4 INT

) STORED AS TEXTFILE;

SEQUENCE FILE

In sequence files the data is stored in a binary storage format consisting of binary key value pairs. A complete row is stored as single binary value. Sequence files are more compact than text and fit well the map-reduce output format. Sequence files do support block compression and can be compressed on value, or block level, to improve its IO profile further.

SEQUENCEFILE is a standard format that is supported by Hadoop itself and is good choice for Hive table storage especially when you want to integrate Hive with other techonolgies in the Hadoop ecosystem.

The USING sequence file keywords lets you create a sequence File. Here is an example statement to create a table using sequence File:

CREATE TABLE SEQUENCEFILE_TABLE (

COLUMN1 STRING,

COLUMN2 STRING,

COLUMN3 INT,

COLUMN4 INT

) STORED AS SEQUENCEFILE

Due to the complexity of reading sequence files, they are often only used for “in flight” data such as intermediate data storage used within a sequence of MapReduce jobs.

RCFILE OR RECORD COLUMNAR FILE

The RCFILE is one more file format that can be used with Hive. The RCFILE stores columns of a table in a record columnar format rather than row oriented fashion and provides considerable compression and query performance benefits with highly efficient storage space utilization. Hive added the RCFile format in version 0.6.0.

RC file format is more useful when tables have large number of columns but only few columns are typically retrieved.

The RCFile combines multiple functions to provide the following features

Fast data storing
Improved query processing,
Optimized storage space utilization
Dynamic data access patterns.

SYNTAX:

CREATE TABLE RCFILE_TABLE (

COLUMN1 STRING,

COLUMN2 STRING,

COLUMN3 INT,

COLUMN4 INT ) STORED AS RCFILE;

Compressed RCFile reduces the IO and storage significantly over text, sequence file, and row formats. Compression on a column base is more efficient here since it can take advantage of similarity of the data in a column.

ORC FILE OR OPTIMIZED ROW COLUMNAR FILE

ORCFILE stands for Optimized Row Columnar File and it’s a new Hive File Format that was created to provide many advantages over the RCFILE format while processing data. The ORC File format comes with the Hive 0.11 version and cannot be used with previous versions.

Lightweight indexes are included with ORC file to improve the performance.

Also it uses specific encoders for different column data types to improve compression further, e.g. variable length compression on integers

ORC stores collections of rows in one file and within the collection the row data is stored in a columnar format allowing parallel processing of row collections across a cluster.

ORC files compress better than RC files, enabling faster queries. To use it just add STORED AS orc to the end of your create table statements like this:

CREATE TABLE mytable (

COLUMN1 STRING,

COLUMN2 STRING,

COLUMN3 INT,

COLUMN4 INT

) STORED AS orc;

If you like this post, please share it on google by clicking on the Google +1 button.

Saturday, 4 June 2016

Advantages of MongoDB over RDBMS

The Advantages of MongoDB over RDBMS are

No schema migrations. Since MongoDB is schema-free, your code defines your schema.
Number of fields, content and size of the document can be different from one document to another.
Tuning using indexes

Ease of scale-out: MongoDB is easy to scale by adding commodity hardware

What is R - Programming Language – Statistical Analysis – Reporting – Graphic Representation

R is a tool for statistics and data modeling. R is a programming language and software environment for statistical analysis, graphics representation and reporting. The R programming language is elegant, versatile, and has a highly expressive syntax designed around working with data. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.

DATAWAREHOUSE CONCEPTS

Pages

Monday, 20 June 2016

HIVE STORAGE FORMATS

SEQUENCEFILE is a standard format that is supported by Hadoop itself and is good choice for Hive table storage especially when you want to integrate Hive with other techonolgies in the Hadoop ecosystem.

The USING sequence file keywords lets you create a sequence File. Here is an example statement to create a table using sequence File:

Due to the complexity of reading sequence files, they are often only used for “in flight” data such as intermediate data storage used within a sequence of MapReduce jobs.

Saturday, 4 June 2016

Advantages of MongoDB over RDBMS

Thursday, 2 June 2016

What is R - Programming Language – Statistical Analysis – Reporting – Graphic Representation

ShareThis