Saturday, 17 September 2016

How to Create/Change/Set Databases in Hive?

As discussed in previous posts, HIVE makes it easier for developers to port SQL-based applications to Hadoop, compared with other Hadoop languages and tools.


Hive is most suited for data warehouse applications where data is static and fast response time is not required and record-level inserts, updates, and deletes are not required

Creating a Database
The simplest syntax for creating a database in hive is shown in the following example:
Go to the Hive Shell by giving the command sudo hive and enter the command


CREATE DATABASE <DATA BASE NAME>
EXAMPLE
HIVE> CREATE DATABASE HR_STAGING;

HIVE> CREATE DATABASE IF NOT EXISTS HR_STAGING;


We can suppress the warning if the database hr_staging already exists in the hive database by using IF NOT EXISTS. The general syntax for creating the database in Hive is given below. The keyword ‘SCHEMA’ can be used instead of ‘DATABASE’ while creating database.



CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] DATABASE_NAME
  [COMMENT DATABASE_COMMENT]
  [LOCATION HDFS_PATH]
  [WITH DBPROPERTIES (PROPERTY_NAME=PROPERTY_VALUE, ...)]


The CREATE DATABASE command creates the database under HDFS at the default location: /user/hive/warehouse.
Hive creates a directory for each database. Tables in that database will be stored in sub directories of the database directory. The exception is tables in the default database, which doesn’t have its own directory.

Syntax to see the databases that already exists in hive. 

  

HIVE> SHOW DATABASES;
HR_STAGING

HIVE> CREATE DATABASE EMP_STAGING ;
HIVE> SHOW DATABASES;
HR_STAGING
EMP_STAGING

HIVE> SHOW DATABASES LIKE 'H.*';
HR_STAGING


Using a Database
The USE command sets a database as your working database, similar to changing working directories in a file system:

SYNTAX:
USE <DATABASE_NAME>
HIVE> USE HR_STAGING;
HIVE > USE DEFAULT;


Dropping Database in Hive
Syntax to drop a database:

HIVE> DROP DATABASE IF EXISTS HR_STAGING;


Hive won’t allow to drop the database if they contain tables. In such case we have to either drop the table first or append the CASCADE keyword to the command, which will cause the Hive to drop the tables in the database first.


DROP (DATABASE|SCHEMA) [IF EXISTS] DATABASE_NAME
[RESTRICT|CASCADE];

HIVE> DROP DATABASE IF EXISTS HR_STAGING CASCADE;



Alter Database in Hive
You can set key-value pairs in the DBPROPERTIES associated with a database using the ALTER DATABASE command. No other metadata about the database can be changed, including its name and directory location:

ALTER (DATABASE|SCHEMA) DATABASE_NAME
SET DBPROPERTIES (PROPERTY_NAME=PROPERTY_VALUE, ...); 

ALTER (DATABASE|SCHEMA) DATABASE_NAME
SET OWNER [USER|ROLE] USER_OR_ROLE;

HIVE> ALTER DATABASE HR_STAGING
SET DBPROPERTIES ('EDITED-BY' = 'XXXX');


If you like this post, please share it on google by clicking on the Google +1 button.


Monday, 20 June 2016

HIVE STORAGE FORMATS

In this post we will take a look on the different Storage File Formats and Record Formats in Hive

Before we move forward lets discuss for a split second about Apache Hive.
Apache Hive which is a data warehouse system for Hadoop facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems, first created at Facebook . Hive provide a means to project structure onto this data and query the data using a SQL-like language called HiveQL.…read more on Hive here
  



Among the different storage file formats that are used in hive, the default and simplest storage file format is the TEXTFILE.

TEXTFILE 
The data in a TEXTFILE is stored as plain text, one line per record. The TEXTFILE is very useful for sharing data with other tools and also when you want to manually edit the data in the file. However the TEXTFILE is less proficient when compared to the other formats.

SYNTAX :

CREATE TABLE TEXTFILE_TABLE (
COLUMN1 STRING,
COLUMN2 STRING,
COLUMN3 INT,
COLUMN4 INT
) STORED AS TEXTFILE;


SEQUENCE FILE
In sequence files the data is stored in a binary storage format consisting of binary key value pairs. A complete row is stored as single binary value. Sequence files are more compact than text and fit well the map-reduce output format. Sequence files do support block compression and can be compressed on value, or block level, to improve its IO profile further.

SEQUENCEFILE is a standard format that is supported by Hadoop itself and is good choice for Hive table storage especially when you want to integrate Hive with other techonolgies in the Hadoop ecosystem.

The USING  sequence  file keywords lets you create a sequence  File. Here is an example statement to create a table using sequence File:




 CREATE TABLE SEQUENCEFILE_TABLE (
COLUMN1 STRING,
COLUMN2 STRING,
COLUMN3 INT,
COLUMN4 INT
) STORED AS SEQUENCEFILE

Due to the complexity of reading sequence files, they are often only used for “in flight” data such as intermediate data storage used within a sequence of MapReduce jobs.




RCFILE OR RECORD COLUMNAR FILE

The RCFILE is one more file format that can be used with Hive. The RCFILE stores columns of a table in a record columnar format rather than row oriented fashion  and provides considerable compression and query performance benefits with highly efficient storage space utilization. Hive added the RCFile format in version 0.6.0.

RC file format is more useful when tables have large number of columns but only few columns are typically retrieved.
  
The RCFile combines multiple functions to provide the following features
  •  Fast data storing
  •  Improved query processing,
  • Optimized storage space utilization
  • Dynamic data access patterns.
SYNTAX:

CREATE TABLE RCFILE_TABLE (
COLUMN1 STRING,
COLUMN2 STRING,
COLUMN3 INT,
COLUMN4 INT ) STORED AS RCFILE;

Compressed RCFile reduces the IO and storage significantly over text, sequence file, and row formats. Compression on a column base is more efficient here since it can take advantage of similarity of the data in a column.


ORC FILE OR OPTIMIZED ROW COLUMNAR FILE
ORCFILE stands for Optimized Row Columnar File and it’s a new Hive File Format that was created to provide many advantages over the RCFILE format while processing data. The ORC File format comes with the Hive 0.11 version and cannot be used with previous versions.

Lightweight indexes are included with ORC file to improve the performance.
Also it uses specific encoders for different column data types to improve compression further, e.g. variable length compression on integers 
ORC stores collections of rows in one file and within the collection the row data is stored in a columnar format allowing parallel processing of row collections across a cluster.

ORC files compress better than RC files, enabling faster queries. To use it just add STORED AS orc to the end of your create table statements like this:

CREATE TABLE mytable (
COLUMN1 STRING,
COLUMN2 STRING,
COLUMN3 INT,
COLUMN4 INT
) STORED AS orc;



If you like this post, please share it on google by clicking on the Google +1 button.

Related Posts Plugin for WordPress, Blogger...

ShareThis