Thursday, 11 May 2017

The Rise and Rule of Cassandra

The world of database is dark and full of terrors. We spent decades working happily upon relational databases, until realizing one day that relations are not enough. This paved way for NoSQL databases, or alternatively, any database that do not use tables. These new databases were shiny and cool, but could not match the massive power that Oracle and SQLServer wielded. This changed with the arrival of Cassandra.

When two scientists in Facebook in 2008 decided to build a database different from other NoSQL databases, they only wanted to have their own database. But little did they know of the impact they would have on the industry.

Cassandra was introduced with a sole objective: to solve the crisis of scalability. And it managed that beautifully. In fact, it has consistently been cited as the only NoSQL database that can take as many machines as could be added to it, without breaking a sweat. In 2012, a group of researchers from University of Toronto declared that as far as scalability goes, there is no match for Cassandra. But this has not been the only reason behind the vast popularity of Cassandra.

Cassandra prides itself as having no "single-point of failure", which implies that there is no single component whose failure can shut down the whole database. In a world where transactions are carried out every second, this feature of robustness is of vital importance. Many talk about decentralization, but nobody does it better than Cassandra.

But having a couple of advantages does not make you better, not in a world of ruthless competition. MongoDB, Redis and others would not be amused by a database who would take away their market with a couple of features. This is why Cassandra tried to achieve perfection. Its fault-tolerant and decentralized nature makes it extremely durable, thus being the perfect choice for those organizations who cannot afford to lose even an ounce of data. The throughput increase is linear with respect to growth in size, which makes it extremely desirable for databases which are growing constantly. After providing all these features, it is not a surprise that Cassandra is trusted by some of the biggest names in the industry, including CERN, eBay, Instagram, GoDaddy, Netflix and Reddit. In fact, Apple's deployment of Cassandra stores a whopping 10 PB of data across 75000 (and growing) nodes. Cassandra can give lessons on scalability to every other non-relational database.

That said, Cassandra is still not the most popular database around; it is not even the most popular NoSQL database right now. There are few inherent flaws that Cassandra needs to fix, including simplified deployment, simplified operational maintenance and an improved web interface, among other things. There is still the issue of low predictability of performance (which was partially reduced, but never solved) and the complexity of APIs in the client libraries which is nothing but unnecessary. But Cassandra is growing strong, and the time is not far when it will be a common name among all DB designers.

Tuesday, 14 February 2017

What is MapReduce in Big Data ?

MapReduce is pivotal to big data as it has allowed the processing of massive datasets, which the earlier preferred format for data storage and analysis, RBDMS was not capable of doing.

In terms of big data analytics, the open-source framework of Hadoop has been a game changer. It has enabled storage of large datasets going up to petabytes in a cluster format, and faster processing of this distributed data. An essential feature of Hadoop is its ability for parallel processing which is executed via MapReduce.

MapReduce is defined as the distributed data processing and querying programming engine that effectively splits and spreads around the necessary computation activities on a dataset across a wide range of servers which are known as data clusters.  A query that needs to run through a mega data set may take hours if situated in one computer server. This is however cut down to minutes when done in parallel over a spread of servers.

The term MapReduce refers to two critical tasks it handles on the Hadoop Distributed File System (HDFS) – the Map Job and the Reduce Job. The Map function takes the different input data elements available and processes them into an output data element, creating key value pairs.  The Reduce function aggregates outputs  created under the key value pairs, put them back together quickly and reliably in order to produce the required end-result.

Structurally, MapReduce has a single master Job Tracker and several slaves Task Tracker, one each per cluster. The master distributes and schedules the tasks to these slaves and keeps track of the assigned jobs, redoing any that fail.  The slave tracker ensures that the assigned task is executed and communicates with the master

There are number of benefits of MapReduce that has made it an important element of Hadoop.
  • It allows developers to use any language like Java or C++ to write the applications, although Java is most preferred. 
  • MapReduce can handle all forms of data whether structured or unstructured. 
  • Another core feature of MapReduce is that it can easily run through petabytes of data stored in a data center due to its construct. 
  • The MapReduce framework is also highly flexible in case of failures. If one dataset fails but is available in another machine, it can index and use the alternate location.  

Today, there are several additional data processing engines like Pig or Hive that can be used to extract data from a Hadoop framework. These eliminate some of the complexity of MapReduce and make it easier to generate insights.

Will discuss more about Map reduce in our upcoming post.
Learn more about Pig and Hive here

Related Posts Plugin for WordPress, Blogger...