2017

Saturday 14 October 2017

Overview Of Scala Programming Language- How To Install Scala ?

SCALA (Scalable Language) is a modern programming language created by Martin Odersky , influenced by Java, Ruby, Smalltalk and others.

Scala smoothly and effortlessly integrates object-oriented and functional programming .It supports Java-like superior coding style and simultaneously supports a functional style.

Combining the strengths of the Functional and Imperative programming models, Scala is a great tool for building highly concurrent applications without surrendering the advantages of an OO methodology.

Features of Scala Programming Language

Scala code runs on the JVM and allows you to use the affluence of Java libraries that have been developed over the years.
Features in object oriented programming (OOP) language such as Classes and inheritance are fully supported in Scala. Some other features in Scala are traits, objects and case classes.
Every variable here is an object (Variable definitions start with var ) and every “operator” is a method. It’s also has functional programming (FP) language, so you can pass functions (Function definitions start with def.) around as variables.
Scala has advanced language features and rich Java integration.You can write your Scala code using OOP, FP, or both.
Scala has expressive syntax and static typing

Installing Scala Software

For Installing Scala On Unix systems download the software from the Scala download page to a directory on your computer like $HOME/scala, and then add these lines to your $HOME/.bash_profile file.

export SCALA_HOME=/Users/Al/scala

PATH=$PATH:/Users/Al/scala/bin

To Install Scala in Microsoft Windows you can follow equivalent process. Check the Scala download page for more information.

Books to Read on Scala

Programming in Scala: A comprehensive Step-by-Step Scala Programming Guide by MartinOdersky, Lex Spoon, Bill Venners
Scala for the Impatient by Cay Hortsmann
Scala in Depth by Joshua D Suereth
Introduction to the Art of Programming Using Scala by Mark Lewis
Atomic Scala by Eckel and Marsh
Functional Programming in Scala by Paul Chiusano and Rúnar Bjarnason

Wednesday 5 July 2017

Books Recommendation: Hands-On Machine Learning with Scikit-Learn and TensorFlow

This hands-on book (Link:Hands-On MachineLearning with Scikit-Learn and TensorFlow) shows you how to:

Explore the machine learning landscape, particularly neural nets.
Use scikit-learn to track an example machine-learning project end-to-end.
Explore several training models, including support vector machines, decision trees, random forests, and ensemble methods.
Use the TensorFlow library to build and train neural nets.
Dive into neural net architectures, including convolutional nets, recurrent nets, and deep reinforcement learning.
Learn techniques for training and scaling deep neural nets.
Apply practical code examples without acquiring excessive machine learning theory or algorithm details

MAIN CONTENTS

¨ The Machine Learning Landscape

§ What Is Machine Learning?

§ Why Use Machine Learning?

§ Types of Machine Learning Systems

§ Main Challenges of Machine Learning

¨ End-to-End Machine Learning Project

§ Prepare the Data for Machine Learning Algorithms

¨ Neural Networks and Deep Learning

¨ Up and Running with TensorFlow

¨ Introduction to Artificial Neural Networks

¨ Training Deep Neural Nets

¨ Distributing TensorFlow Across Devices and Servers

¨ Convolutional Neural Networks

¨ Recurrent Neural Networks

¨ Autoencoders

Link:Hands-On MachineLearning with Scikit-Learn and TensorFlow

Thursday 11 May 2017

The Rise and Rule of Cassandra

The world of database is dark and full of terrors. We spent decades working happily upon relational databases, until realizing one day that relations are not enough. This paved way for NoSQL databases, or alternatively, any database that do not use tables. These new databases were shiny and cool, but could not match the massive power that Oracle and SQLServer wielded. This changed with the arrival of Cassandra.

When two scientists in Facebook in 2008 decided to build a database different from other NoSQL databases, they only wanted to have their own database. But little did they know of the impact they would have on the industry.

Cassandra was introduced with a sole objective: to solve the crisis of scalability. And it managed that beautifully. In fact, it has consistently been cited as the only NoSQL database that can take as many machines as could be added to it, without breaking a sweat. In 2012, a group of researchers from University of Toronto declared that as far as scalability goes, there is no match for Cassandra. But this has not been the only reason behind the vast popularity of Cassandra.

Cassandra prides itself as having no "single-point of failure", which implies that there is no single component whose failure can shut down the whole database. In a world where transactions are carried out every second, this feature of robustness is of vital importance. Many talk about decentralization, but nobody does it better than Cassandra.

But having a couple of advantages does not make you better, not in a world of ruthless competition. MongoDB, Redis and others would not be amused by a database who would take away their market with a couple of features. This is why Cassandra tried to achieve perfection. Its fault-tolerant and decentralized nature makes it extremely durable, thus being the perfect choice for those organizations who cannot afford to lose even an ounce of data. The throughput increase is linear with respect to growth in size, which makes it extremely desirable for databases which are growing constantly. After providing all these features, it is not a surprise that Cassandra is trusted by some of the biggest names in the industry, including CERN, eBay, Instagram, GoDaddy, Netflix and Reddit. In fact, Apple's deployment of Cassandra stores a whopping 10 PB of data across 75000 (and growing) nodes. Cassandra can give lessons on scalability to every other non-relational database.

That said, Cassandra is still not the most popular database around; it is not even the most popular NoSQL database right now. There are few inherent flaws that Cassandra needs to fix, including simplified deployment, simplified operational maintenance and an improved web interface, among other things. There is still the issue of low predictability of performance (which was partially reduced, but never solved) and the complexity of APIs in the client libraries which is nothing but unnecessary. But Cassandra is growing strong, and the time is not far when it will be a common name among all DB designers.

Tuesday 14 February 2017

What is MapReduce in Big Data ?

MapReduce is pivotal to big data as it has allowed the processing of massive datasets, which the earlier preferred format for data storage and analysis, RBDMS was not capable of doing.

In terms of big data analytics, the open-source framework of Hadoop has been a game changer. It has enabled storage of large datasets going up to petabytes in a cluster format, and faster processing of this distributed data. An essential feature of Hadoop is its ability for parallel processing which is executed via MapReduce.

MapReduce is defined as the distributed data processing and querying programming engine that effectively splits and spreads around the necessary computation activities on a dataset across a wide range of servers which are known as data clusters. A query that needs to run through a mega data set may take hours if situated in one computer server. This is however cut down to minutes when done in parallel over a spread of servers.

The term MapReduce refers to two critical tasks it handles on the Hadoop Distributed File System (HDFS) – the Map Job and the Reduce Job. The Map function takes the different input data elements available and processes them into an output data element, creating key value pairs. The Reduce function aggregates outputs created under the key value pairs, put them back together quickly and reliably in order to produce the required end-result.

Structurally, MapReduce has a single master Job Tracker and several slaves Task Tracker, one each per cluster. The master distributes and schedules the tasks to these slaves and keeps track of the assigned jobs, redoing any that fail. The slave tracker ensures that the assigned task is executed and communicates with the master

There are number of benefits of MapReduce that has made it an important element of Hadoop.

It allows developers to use any language like Java or C++ to write the applications, although Java is most preferred.
MapReduce can handle all forms of data whether structured or unstructured.
Another core feature of MapReduce is that it can easily run through petabytes of data stored in a data center due to its construct.
The MapReduce framework is also highly flexible in case of failures. If one dataset fails but is available in another machine, it can index and use the alternate location.

Today, there are several additional data processing engines like Pig or Hive that can be used to extract data from a Hadoop framework. These eliminate some of the complexity of MapReduce and make it easier to generate insights.

Will discuss more about Map reduce in our upcoming post.

Learn more about Pig and Hive here

Thursday 26 January 2017

Overview of Mongo DB 3.4 : New Features

Mongo DB has been wildly popular ever since its introduction for plenty of reasons. The biggest one was because it got rid of the Object-Relation Mapping to a large extent, which had been the source of trouble of programmers for years. Even today, it is the 5th most popular database. However, the graph of popularity of Mongo DB decreased somewhat over the years, due to introduction of more advanced and simplified NoSQL databases. This might change with the release of Mongo DB 3.4, released late last year. According to the company, they seek to attain a "digital transformation" with this release.

The clear message that the company gave with this release was that it is aiming to simplify the life of large enterprises that have depended upon Mongo DB for long now. Like Python, Mongo DB is aiming to evolve so that it alone suffices for tasks that earlier required multiple technologies. Since we have seen this formula succeeding more than once, we have to admit this is a very smart move from the company.

Graph support was the need of the hour in Mongo DB for quite some time now. Taking more than 3 years to become a reality, it is arguably the biggest addition in the new version. While it does not seem to pose any threat to established graph databases like Neo4J, the graph support is sure to simplify things for its existing users. This feature is sure to have large impact, as it will facilitate companies to explore hitherto doubted avenues like Deep Analytics, Internet of Things and Artificial Intelligence. This would be further aided by Atlas, Mongo DB's database cloud service released earlier last year.

Ecommerce websites working upon Mongo DB had toiled hard for long to provide decent search functionality to its customers. This ends with the faceted navigation feature, which uses filters to narrow down the query results. This ensures faster and more relevant search results. Also, a read-only mode was introduced that could expose the information of an application while preventing any modification. Another huge feature was the creation of Geo-distributed Mongo DB zones, which deals with the problem of data sovereignty and solves it by providing tagging via a higher abstraction of “zones”.

The release also had few things in store for the regular users. The new SQL interface is sure to greatly ease things for the users who have struggled for long to import their SQL code into Mongo. Mongo DB also introduced the ($switch) operator, which greatly simplifies complex branching, while making it more readable. Like the popular "switch" expression, it tests a number of cases, executing only the one that turns out to be true. Another addition was the ($reduce) operator, that could reduce the results of multiple arrays into a single expression.

Apart from this, there has been a whole array of other additions, whose actual importance would only be realized in the long run. This includes elastic clustering, tunable consistency and enhanced DBA.

Overall, this release has been quite impressive and an instant success. Mongo DB has made its intention very clear: It is here to stay and win. With this, other NoSQL providers like Redis and Cassandra as well as established SQL players like MySQL and Oracle will have to up their game.

Tuesday 24 January 2017

Pig vs Hive: Main differences between Apache Pig and Hive

Delving into the big data and extracting insights from it requires robust tools that allow flexibility in data management and querying – filtering, aggregating, and analyses. Typically, MapReduce code is leveraged to do this but the complexity involved in writing intricate Java code to prepare MapReduce scripts led to new languages being created that allowed users to access datasets with more ease.

Pig was created by researchers at Yahoo, and has the flexibility of multiple query approach. Although somewhat similar to SQL the traditional language for data analysis in some ways, doesn’t have its declarative nature and has limitations like- being dependent on relational database schemas. Pig is more of a programming language, and is often referred to as an abstraction of the complicated syntax of Java programming required for MapReduce. Pig has has different semantics than Hive and Sql.

Hive (invented at Facebook) on the other hand is highly similar to SQL, as it uses almost the same commands for data manipulation, making particularly suitable for those experienced in use of SQL.

These two components of the Hadoop ecosystem work atop Hadoop. The goal of both these tools is to make it easier to interact with massive datasets within Hadoop without having to write out complex MapReduce code.

Understanding the differences between Pig and Hive

There are several differentiating elements between the two languages, and big data users need to appreciate these differences to make use of the right tool:

As Hive adopts SQL-based declarative approach it is often preferred for structured data especially historical data. It is therefore often referred to as a data warehouse platform.

Pig on the other hand uses Procedural Data Flow Language and is preferred for semi- structured, unstructured or decentralized data. The flexibility of Pig allows better construction of data flows and its feature of self-optimization results in lesser number of data scans.

Hive use distinct query language called HQL whereas Pig use their own language called piglatin (procedural language).
Partitioning can be done using HIVE whereas it’s not possible in in PIG
In terms of practical usage, Hive is preferred for reporting and operates on the server side of a cluster while Pig is great for writing programs and operates on the client side.
Given its characteristics, Pig is typically used by researchers and programmers but Hive is preferred by data scientists who work on large quantitative datasets.
Hive usually executes quickly but loads slowly whereas Pig loads faster and more effectively.

Adopting a standard approach to big data analytics would hamper benefits from it. Both Pig and Hive have their own advantages that make them apt for some situations but not in others. Analysts must carefully examine the insight requirements before deciding on the tool to use.

DATAWAREHOUSE CONCEPTS

Pages