Pages

Thursday, 26 January 2017

Overview of Mongo DB 3.4 : New Features


Mongo DB has been wildly popular ever since its introduction for plenty of reasons. The biggest one was because it got rid of the Object-Relation Mapping to a large extent, which had been the source of trouble of programmers for years. Even today, it is the 5th most popular database. However, the graph of popularity of Mongo DB decreased somewhat over the years, due to introduction of more advanced and simplified NoSQL databases. This might change with the release of Mongo DB 3.4, released late last year. According to the company, they seek to attain a "digital transformation" with this release.

The clear message that the company gave with this release was that it is aiming to simplify the life of large enterprises that have depended upon Mongo DB for long now. Like Python, Mongo DB is aiming to evolve so that it alone suffices for tasks that earlier required multiple technologies. Since we have seen this formula succeeding more than once, we have to admit this is a very smart move from the company.

Graph support was the need of the hour in Mongo DB for quite some time now. Taking more than 3 years to become a reality, it is arguably the biggest addition in the new version. While it does not seem to pose any threat to established graph databases like Neo4J, the graph support is sure to simplify things for its existing users. This feature is sure to have large impact, as it will facilitate companies to explore hitherto doubted avenues like Deep Analytics, Internet of Things and Artificial Intelligence. This would be further aided by Atlas, Mongo DB's database cloud service released earlier last year.

Ecommerce websites working upon Mongo DB had toiled hard for long to provide decent search functionality to its customers. This ends with the faceted navigation feature, which uses filters to narrow down the query results. This ensures faster and more relevant search results. Also, a read-only mode was introduced that could expose the information of an application while preventing any modification. Another huge feature was the creation of Geo-distributed Mongo DB zones, which deals with the problem of data sovereignty and solves it by providing tagging via a higher abstraction of “zones”.

The release also had few things in store for the regular users. The new SQL interface is sure to greatly ease things for the users who have struggled for long to import their SQL code into Mongo. Mongo DB also introduced the ($switch) operator, which greatly simplifies complex branching, while making it more readable. Like the popular "switch" expression, it tests a number of cases, executing only the one that turns out to be true. Another addition was the ($reduce) operator, that could reduce the results of multiple arrays into a single expression.

Apart from this, there has been a whole array of other additions, whose actual importance would only be realized in the long run. This includes elastic clustering, tunable consistency and enhanced DBA.

Overall, this release has been quite impressive and an instant success. Mongo DB has made its intention very clear: It is here to stay and win. With this, other NoSQL providers like Redis and Cassandra as well as established SQL players like MySQL and Oracle will have to up their game.

Tuesday, 24 January 2017

Pig vs Hive: Main differences between Apache Pig and Hive

Delving into the big data and extracting insights from it requires robust tools that allow flexibility in data management and querying – filtering, aggregating, and analyses. Typically, MapReduce code is leveraged to do this but the complexity involved in writing intricate Java code to prepare MapReduce scripts led to new languages being created that allowed users to access datasets with more ease.




Pig was created by researchers at Yahoo, and has the flexibility of multiple query approach. Although somewhat similar to SQL the traditional language for data analysis in some ways, doesn’t have its declarative nature and has limitations like- being dependent on relational database schemas. Pig is more of a programming language, and is often referred to as an abstraction of the complicated syntax of Java programming required for MapReduce. Pig has has different semantics than Hive and Sql.

Hive (invented at Facebook) on the other hand is highly similar to SQL, as it uses almost the same commands for data manipulation, making particularly suitable for those experienced in use of SQL.

These two components of the Hadoop ecosystem work atop Hadoop. The goal of both these tools is to make it easier to interact with massive datasets within Hadoop without having to write out complex MapReduce code.


Understanding the differences between Pig and Hive  
There are several differentiating elements between the two languages, and big data users need to appreciate these differences to make use of the right tool:
  •  As Hive adopts SQL-based declarative approach it is often preferred for structured data especially historical data. It is therefore often referred to as a data warehouse platform.
Pig on the other hand uses Procedural Data Flow Language and is preferred for semi- structured, unstructured or decentralized data. The flexibility of Pig allows better construction of data flows and its feature of self-optimization results in lesser number of data scans.

  • Hive use distinct query language called HQL whereas Pig use their own language called piglatin (procedural language).
  • Partitioning can be done using HIVE whereas it’s not possible in in PIG
  • In terms of practical usage, Hive is preferred for reporting and operates on the server side of a cluster while Pig is great for writing programs and operates on the client side.
  • Given its characteristics, Pig is typically used by researchers and programmers but Hive is preferred by data scientists who work on large quantitative datasets.
  • Hive usually executes quickly but loads slowly whereas Pig loads faster and more effectively.

Adopting a standard approach to big data analytics would hamper benefits from it.  Both Pig and Hive have their own advantages that make them apt for some situations but not in others. Analysts must carefully examine the insight requirements before deciding on the tool to use.