In
this post we will take
a look on the different Storage File
Formats and Record Formats in Hive
Before
we move forward lets discuss for a split second about Apache Hive.
Apache
Hive which is a data warehouse system for Hadoop facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets stored in
Hadoop compatible file systems, first created at Facebook . Hive provide a
means to project structure onto this data and query the data using a SQL-like
language called HiveQL.…read more on Hive here
Among
the different storage file formats that are used in hive, the default and simplest
storage file format is the TEXTFILE.
TEXTFILE
The
data in a TEXTFILE is stored as plain text, one line per record. The TEXTFILE
is very useful for sharing data with other tools and also when you want to
manually edit the data in the file. However the TEXTFILE is less proficient
when compared to the other formats.
SYNTAX :
CREATE
TABLE TEXTFILE_TABLE (
COLUMN1 STRING,
COLUMN2 STRING,
COLUMN3 INT,
COLUMN4 INT
COLUMN2 STRING,
COLUMN3 INT,
COLUMN4 INT
)
STORED AS TEXTFILE;
SEQUENCE FILE
In
sequence files the data is stored in a binary storage format consisting of
binary key value pairs. A complete row is stored as single binary value. Sequence
files are more compact than text and fit well the map-reduce output format.
Sequence files do support block compression and can be compressed on value, or
block level, to improve its IO profile further.
SEQUENCEFILE is a standard format that is supported by
Hadoop itself and is good choice for Hive table storage especially when you
want to integrate Hive with other techonolgies in the Hadoop ecosystem.
The USING sequence file keywords lets you create a sequence File. Here is an example statement to create a
table using sequence File:
COLUMN1
STRING,
COLUMN2
STRING,
COLUMN3
INT,
COLUMN4
INT
)
STORED AS SEQUENCEFILE
Due to the complexity of reading sequence files, they are often
only used for “in flight” data such as intermediate data storage used within a
sequence of MapReduce jobs.
RCFILE
OR RECORD COLUMNAR FILE
The
RCFILE is one more file format that can be used with Hive. The RCFILE stores
columns of a table in a record columnar format rather than row oriented
fashion and provides considerable
compression and query performance benefits with highly efficient storage space
utilization. Hive
added the RCFile format in version 0.6.0.
RC
file format is more useful when tables have large number of columns but only
few columns are typically retrieved.
The
RCFile combines multiple functions to provide the following features
- Fast data storing
- Improved query processing,
- Optimized storage space utilization
- Dynamic data access patterns.
SYNTAX:
CREATE
TABLE RCFILE_TABLE (
COLUMN1
STRING,
COLUMN2
STRING,
COLUMN3
INT,
COLUMN4
INT ) STORED AS RCFILE;
Compressed
RCFile reduces the IO and storage significantly over text, sequence file, and
row formats. Compression on a column base is more efficient here since it can
take advantage of similarity of the data in a column.
ORC FILE OR OPTIMIZED
ROW COLUMNAR FILE
ORCFILE
stands for Optimized Row Columnar File and it’s a new Hive File Format that was
created to provide many advantages over the RCFILE format while processing
data. The ORC File format comes with the Hive 0.11 version and cannot be used
with previous versions.
Lightweight
indexes are included with ORC file to improve the performance.
Also
it uses specific encoders for different column data types to improve
compression further, e.g. variable length compression on integers
ORC stores collections of rows in
one file and within the collection the row data is stored in a columnar format allowing parallel
processing of row collections across a cluster.
ORC
files compress better than RC files, enabling faster queries. To use it just
add STORED AS orc to the end of your create table statements like this:
CREATE
TABLE mytable (
COLUMN1
STRING,
COLUMN2
STRING,
COLUMN3
INT,
COLUMN4
INT
)
STORED AS orc;