Pages

Tuesday, 14 February 2017

What is MapReduce in Big Data ?

MapReduce is pivotal to big data as it has allowed the processing of massive datasets, which the earlier preferred format for data storage and analysis, RBDMS was not capable of doing.

In terms of big data analytics, the open-source framework of Hadoop has been a game changer. It has enabled storage of large datasets going up to petabytes in a cluster format, and faster processing of this distributed data. An essential feature of Hadoop is its ability for parallel processing which is executed via MapReduce.



MapReduce is defined as the distributed data processing and querying programming engine that effectively splits and spreads around the necessary computation activities on a dataset across a wide range of servers which are known as data clusters.  A query that needs to run through a mega data set may take hours if situated in one computer server. This is however cut down to minutes when done in parallel over a spread of servers.

The term MapReduce refers to two critical tasks it handles on the Hadoop Distributed File System (HDFS) – the Map Job and the Reduce Job. The Map function takes the different input data elements available and processes them into an output data element, creating key value pairs.  The Reduce function aggregates outputs  created under the key value pairs, put them back together quickly and reliably in order to produce the required end-result.

Structurally, MapReduce has a single master Job Tracker and several slaves Task Tracker, one each per cluster. The master distributes and schedules the tasks to these slaves and keeps track of the assigned jobs, redoing any that fail.  The slave tracker ensures that the assigned task is executed and communicates with the master

There are number of benefits of MapReduce that has made it an important element of Hadoop.
  • It allows developers to use any language like Java or C++ to write the applications, although Java is most preferred. 
  • MapReduce can handle all forms of data whether structured or unstructured. 
  • Another core feature of MapReduce is that it can easily run through petabytes of data stored in a data center due to its construct. 
  • The MapReduce framework is also highly flexible in case of failures. If one dataset fails but is available in another machine, it can index and use the alternate location.  

Today, there are several additional data processing engines like Pig or Hive that can be used to extract data from a Hadoop framework. These eliminate some of the complexity of MapReduce and make it easier to generate insights.

Will discuss more about Map reduce in our upcoming post.
Learn more about Pig and Hive here