Pages

Thursday, 31 October 2013

What is HADOOP ( HDFS and MapReduce)

HADOOP is a software framework that was inspired by Google's Map Reduce and Google File System and now is considered as best solution which can deal with BigData.

When we talk about Big data, it can be anything in the form of picture, movie etc ...and consumes huge amount of space

In Hadoop the storage is provided by HDFS-it provides good way of storage to prevent loss of data in case of failure, and analysis by Map Reduce(data processing) using its own adhoc analysis and runs the query against a huge data and shows the result in a reasonable amount of time.
HDFS and MapReduce are the key points in Hadoop.

MapReduce primarily works well on unstructured data and Semi-structured data for example the web log file. These data are not organized as in relational tables like oracle tables. And map reduces find easy to process these data sets. Some of the higher level languages built on map reduce are Pig and Hive.

Map Reduce consist of two functions mainly a map function and a reduce function. It works on huge datasets and returns desired results.A query which looks complicated can be expressed using MapReduce in the form of MapReduce job.

First step here is passing the input data. As mentioned, Map Reduce will have two phases map phase (Map function) and the reduce phase (Reduce function).The input data will be passed on to the Map phase. Let’s take the example of unstructured data. Map function will process the input data and take the required fields from the input and pass to the reduce phase. This removes lot of unwanted records.

Output of the map function will be passed on to Map Reduce phase. The reduce function will then further process the data and extract the output from the mapped data based on the logic of the job.

When we go more detailed into the way Map Reduce job works then we can see that Hadoop run the job in terms of map tasks. The job is split into many pieces which we call as splits and one map task is assigned for each split. The size of the split is important in execution time for reaching the output. Ideally the split size should be the size of a HDFS block.

Below are some other key terms:

  •  Data locality Optimization   
Running of map task on the node on which the input data resides in HDFS 
  •  Combiner function
Reduces the amount of data transferred between the map and reduce phase. It will optimize the map result and form the output for the map functions and pass it on to reduce functions


No comments:

Post a Comment