What this post covers?

Apache Hadoop, Big Data, HDFS, MapReduce, Introduction to Apache Storm.

This post covers the basic concepts, and important links where you can learn more.
There are lot of articles over the web that explains each of these topic in detail, but my objective here is to give a kick-start to the BigData world.

Note: 1000KB = 1MB, again its GB, TB, PetaByte[PB], EB, ZB, YB

Big Data

Big Data – Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.
“much IT investment is going towards managing and maintaining big data”

Big Data refers to huge unstructured (could be structured as well) data.

If it was not huge, you really don’t need a cluster to store or process that data. If it is small amount of data, it’s better to get the job done by a node/server/computer than to distribute to many nodes and collaborate and compute/aggregate the result. It just adds more time and cost, than the job being done by a single node.

So Big means BIG in sizes of TB, PB and more.

Unstructured means that the data is not really structured, you just dump data which you obtain from IoT(Internet of Things) or social media or likewise; Structured data is what we normally store in Relational databases. There would be a schema, table structure with columns etc and. you persist the structured data in the database. The unstructured huge data will be stored in cluster.

So you have the data, now do what? Analyse and process the data.

So Hadoop does  { data storage + processing the stored data (data analytics) }

Big data is a term that describes the large volume of data – both structured and unstructured – that overloads a business on a day-to-day basis. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

More of defining the Big Data  

Big data is best understood by considering four different properties: volume, velocity,
variety, and veracity

    Volume is the most obvious property of big data—and the first that comes to most people’s minds when they hear the term. Data is constantly being generated every day from a multitude of sources: data generated by people via social media, data generated by
    software itself (website tracking, application logs, and so on), and user-generated data,
    such as Wikipedia, only scratch the surface of sources of data.
    When people think volume, companies such as Google, Facebook, and Twitter come
    to mind. Sure, all deal with enormous amounts of data, and we’re certain you can name
    others, but what about companies that don’t have that volume of data? There are many
    other companies that, by definition of volume alone, don’t have big data, yet these companies use Storm. Why? This is where the second V, velocity, comes into play.
    Velocity deals with the pace at which data flows into a system, both in terms of the
    amount of data and the fact that it’s a continuous flow of data.
    Different types/sources of data such as images, text, audio, video etc
    Veracity involves the accuracy of incoming and outgoing data. Sometimes, we need
    our data to be extremely accurate. Other times, a “close enough” estimate is all we need.

Big data tools

Many tools exist that address the various characteristics of big data (volume, velocity,
variety, and veracity). Within a given big data ecosystem, different tools can be used in
isolation or together for different purposes:

  • Data processing—These tools are used to perform some form of calculation and
    extract intelligence out of a data set.
  • Data transfer—These tools are used to gather and ingest data into the data processing systems (or transfer data in between different components of the system).
    Data processing systems come in many forms but most common is a message bus (or a queue).
    Examples of Data transfer tools include Kafka, Flume, Scribe, and Scoop
  • Data storage—These tools are used to store the data sets during various stages of
    processing. They may include distributed file-systems such as HDFS or GlusterFS as well as NoSQL data stores such as Cassandra.

Data Processing tools fall into two primary classes:

  • Batch processing and
  • Stream processing.   [Apache Storm is an example]
  • More recently, a hybrid between the two has emerged: micro-batch processing within a stream

Apache Storm

Storm is a distributed, real-time computational framework that makes, processing of unbounded streams of data, easy.

Storm is a stream processing tool

Storm comes with a framework called Trident that lets you perform micro-batching within a stream.


“In short, Hadoop is a open source software framework overseen by ASF for storing and for processing huge datasets with clusters of commodity hardware“.

It is mostly used for reliable, scalable, distributed computing but can be also used as a general purpose file storage capable to keep petabytes of data. There is a large number of companies and organizations that use Hadoop for both research and production.

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Hadoop is an open-source software framework [by Apache Software Foundation] for

  • storing data and
  • running applications on clusters of commodity hardware.

Hadoop provides :-

  • massive storage for any kind of data    [for storing huge data sets not for small data]
  • enormous processing power and
  • the ability to handle virtually limitless concurrent tasks or jobs.

Cluster – set of machines in a single LAN

Commodity hardware – Computer hardware that is affordable and easy to obtain. Typically it is a low-performance system that is IBM PC-compatible and is capable of running Microsoft Windows, Linux, or MS-DOS without requiring any special devices or equipment. eg our personal computer, laptop.  It is different from a server in terms of relaiblity and cost. Let’s say 1TB computer cost just $500, where as a 1TB machine used as a server cost $25000. That explains it.

Hadoop consists of two core components:

1. Hadoop Distributed File System (HDFS).

HDFS is a “specially designed file-system” for storing “huge data-sets” with “clusters” of “commodity hardware” with “streaming access pattern”.

  • HDFS is responsible for storing data on the Hadoop cluster.
  • It is basically the primary storage system used by Hadoop applications.
  • It creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations
  • Data is distributed across many machines at load time.
  • HDFS is optimized for large, streaming reads of files rather than random reads
  • Files in HDFS are written once and no random writes to files are allowed
  • Applications can read and write HDFS files directly via the Java API

2. MapReduce.  [MapReduce algorithm and its implementation]

  • MapReduce is a system used for distributed computing and processing of large amounts of data in the Hadoop cluster. [basically for data processing]
  • It is a programming model and software framework for writing applications that rapidly process large amounts of data in parallel on large clusters of compute nodes.




What is Hadoop? History of Hadoop? Why Hadoop? – Read this article

What is BigData , its business significance? – Read this article

Hadoop topics that you can check out :-

Introduction to Hadoop

  • How data got big
  • Hadoop: The big data solution

Hadoop Core

  • Distributed file systems
  • The MapReduce paradigm

Hadoop Ecosystem

  • Flume
  • Sqoop
  • HBase
  • Hive
  • Oozie
  • Pig
  • Mahout
  • Drill
  • Spark

Hadoop Use Cases

  • Hadoop workflow
  • Data warehouse optimization use case
  • Improving log analysis use case
  • Recommendation engines use case






Some of the favorite videos: