Tuesday, April 6, 2010

Cloudera for Hadoop, Hive & Pig

So, I am finishing up my Cloudera training for Hadoop development. My instructor was great! I might be biased as I worked with him in the past but he did a good job in helping to drill into all of the facets of the Hadoop ecosystem. I have one day left but already feel like it was totally worth it.

So why are you reading this? Perhaps by this sentence you aren't anymore but if you are I hope to not disappoint.

Is this a promotion for taking the Cloudera Hadoop training? If you are looking to just get your feet wet and catch up to what you have been missing these last couple of years then they have a lot of good online resources (the videos are great) http://www.cloudera.com/resources/?type=Training

I also recommend Tom White's book on Hadoop though you may not be the audience it was meant for. If you are a systems administrator or developer then definitely it is a must read.

Now for everyone else (bus dev, marketing, sales, analysts, scientists, stats folks) you just do not have a lot of resources yet to get you into the game.

What game is this?

It is the ability for you to do your job better than you have ever been able to-do it before. Hadoop, provides YOU the infrastructure to get all that data you always wanted that was either locked away across 5,000 databases or would have required "mountains to be moved" and there are just not enough dentists on the planet for all the teeth pulling it takes with IT to get it done.

So the good news is that the tech is finally out there to help bring about the aggregated information [when data exists] that you want to slice and dice and predict/forecast/present on (etc, etc).

Hadoop is still wrapped up in IT but the IT folks now like getting at the data for you because it is much easier than it has ever been and each time they do it is cutting edge. It is cutting edge because there are still many moving parts and the ecosystem is yet to have created "middleware".

Well, this is only partly true. There are some middleware projects going on but there is yet an event processing system to orchestrate and coordinate them. Cloudera has developed Sqoop (allows DB import to Hadoop) and I suspect they are in process developing other inter data transfer and management components.

** insert proverbial crystal ball here **

I do not see this being in IT forever. There are currently projects well underway and maturely being used (HIVE & PIG) that provide a higher level abstraction to the Hadoop system. Now, while this is not really meant for non tech folk (yet) it gives them the ability to make your job easier to get to this data. Hive and Pig were released by Facebook & Yahoo (respectively) and these "modules" hide all of that techie gobly gook from you to get at your information. It is still a lot of gobly gook but pseudo code enough for you to sit with the tech folk to make sure you are on the same page for what data you are going after.

This is a HUGE step in the right direction for the next layer of cohesive and more (dare I say it) B.P.M. type of systems along with B.I. and related reporting.

Now, it might take years for this to all come together and get "mainstream" but honestly I have to say that I am excited to be a part of it.

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/