Tuesday, April 26, 2011

A Hadoop Use Case

I have started learning Hadoop recently and to make the concepts sink in I tried relating the stuff to my previous experience.

We had a system that would load trades (nothing but a contract between two parties for buying or selling some assets) and do some processing on those trades. Miliions of trades would be loaded into the system on a daily basis. We had a requirement to keep 30 days worth of trade data into the system to do some analytical processing and then create some scheduled reports for the business users. We were storing all the 30 days data in a database and faced numerous issues in terms of performance and scalabilty. We finally invested a lot on the database racks, hard ware and software and made it work with some changes in requirements etc

Now when I look at this requirement, I feel had I known about Hadoop then I would have certainly used Hadoop to solve this problem with some commodity servers - with much lesser investment and it would have been a much elegant solution.

How does Hadoop help in this use case?
Hadoop provides a distributed file system (HDFS) that can be used to store huge data running into tera bytes or peta bytes. Scalability, portability and reliability are inherent in its architecture.

Hadoop also provides MapReduce which is a programming model for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.

So now we can store the trades in the HDFS and use the Map Reduce to run the analytics job on the data that is stored in the HDFS.

How does this architecture help?
a. Seperation of concerns - this way we can design our database for OLTP keep it normailized, and have the OLAP components based on Hadoop cluster.
b. Improve Performance and Scalability - As the database will now be used for transactional data and the data stored in the db would be 1/30 performance would be much better and thus our application could scale better with same set of h/w and s/w.
c. Better Anlytics - We can now analyse more historical data as HDFS and Map Reduce are capable of handling large scale data running into Petabytes.
d. Save Cost - We can save cost not just on the software licensing but also on the hardware as Hadoop cluster can run on commodity servers.

These are some of the advantages that I can think of right away.

2 comments:

  1. If maintaining a OLTP db + Hadoop is an option then why not go for an OLTP db + OLAP rdbms instance itself? Btw, did you use partitioning?

    IMO SQL would perform better than MapReduce, especially for analytic queries.

    ReplyDelete
  2. Using OLAP rdbms is definetly an option but the disadvantage of using that would be the cost on h/w and s/w.

    The hadoop cluster also gives you the flexibility to add more data in the cluster with out having intially planned for it.

    Also Hadoop sub project Hive provides you with ability to use SQL type query language to query HDFS. You may read about it here: http://ankur-sa.blogspot.com/2011/04/mapreduce-hive-overview.html

    ReplyDelete