The Java Pro Seeker: Storm: parallelizing and computing data real time

What is Storm

Storm is a big data framework. It is similar to Hadoop. It is used to handle enormous amounts of real time data. At limelight its a open source real time computational system that can be used with any programming language.It is written in Clojure and by default supports java.It can be used to perform activities such as

i. Analyzing real time data.

ii. Online machine learning.

iii. Continuos computational task.

iv. Distributed remote procedure call.

v. ETL.

Storm does real-time data interpretation and analysis. It is quite similar to Hadoop but the fact is Hadoop performs offline data analytics and has its own programming paradigm and filesystems which Storm does not. Storm is all about obtaining chunks of data known as spouts and passing the data through various components called as bolts.The mechanism used in storm to process data is extremely fast and to perform real-time analytics.In some of the use case Hadoop and Storm can be beautifully used in designing a system which can used to do real time analytics and long time pattern recognition.

Storm Topologies

Storm comprises of topologies. A topology is basically a computational graph. each nodes in the topology has its own processing logic and links between nodes suggest how data should be passed between nodes. Storm topology can be run as below:-

storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2

You can see the class " backtype.storm.MyTopology " which defines the topology and submits it to Nimbus. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.The storm jar takes care of connecting to Nimbus and uploading the jar.

Storm Clusters

A Storm cluster is superficially similar to a Hadoop cluster, on Storm you run "topologies". "Jobs" and "topologies" themselves are very different. In a topology processes messages forever (or until you kill it).

There are two kinds of nodes on a Storm cluster: the master node and the worker nodes. The master node runs a daemon called "Nimbus" that is similar to Hadoop's "JobTracker". Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.

Each worker node runs a daemon called the "Supervisor". The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.

All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster. Additionally, the Nimbus daemon and Supervisor daemons are fail-fast and stateless; all state is kept in Zookeeper or on local disk.This design leads to Storm clusters being incredibly stable.

Below is the representation of storm cluster.

Streams

The core abstraction in Storm is the "stream". A stream is an unbounded sequence of tuples. Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way.

The basic primitives Storm provides for doing stream transformations are "spouts" and "bolts". Spouts and bolts have interfaces that you implement to run your application-specific logic.

A spout is a source of streams. For example, a spout may read tuples off of a Kestrel queue and emit them as a stream. Or a spout may connect to the Twitter API and emit a stream of tweets.

A bolt consumes any number of input streams, does some processing, and possibly emits new streams. Complex stream transformations, like computing a stream of trending topics from a stream of tweets, require multiple steps and thus multiple bolts. Bolts can do anything from run functions, filter tuples, do streaming aggregations, do streaming joins, talk to databases, and more.

Networks of spouts and bolts are packaged into a "topology" which is the top-level abstraction that you submit to Storm clusters for execution. A topology is a graph of stream transformations where each node is a spout or bolt. Edges in the graph indicate which bolts are subscribing to which streams. When a spout or bolt emits a tuple to a stream, it sends the tuple to every bolt that subscribed to that stream.

Each node in a Storm topology executes in parallel. In your topology, you can specify how much parallelism you want for each node, and then Storm will spawn that number of threads across the cluster to do the execution.

A topology runs forever, or until you kill it. Storm will automatically reassign any failed tasks. Additionally, Storm guarantees that there will be no data loss, even if machines go down and messages are dropped.

The Java Pro Seeker

Wednesday, November 27, 2013

Storm: parallelizing and computing data real time

No comments:

Post a Comment

Popular Posts