Monday, June 14, 2010

All Things Hadoop Blog And Podcast

I started a new blog specifically for my interest in Hadoop Besides this just being a Hadoop focused blog I also have a Podcast going with both informative content in regards to Hadoop as well as interviews with Hadoop professionals. I also have a twitter account going too just for Hadoop related tweets

Life is truly an adventure... sometimes you are the bear and sometimes you are the salmon =8^)

Joe Stein

Tuesday, April 6, 2010

Cloudera for Hadoop, Hive & Pig

So, I am finishing up my Cloudera training for Hadoop development. My instructor was great! I might be biased as I worked with him in the past but he did a good job in helping to drill into all of the facets of the Hadoop ecosystem. I have one day left but already feel like it was totally worth it.

So why are you reading this? Perhaps by this sentence you aren't anymore but if you are I hope to not disappoint.

Is this a promotion for taking the Cloudera Hadoop training? If you are looking to just get your feet wet and catch up to what you have been missing these last couple of years then they have a lot of good online resources (the videos are great)

I also recommend Tom White's book on Hadoop though you may not be the audience it was meant for. If you are a systems administrator or developer then definitely it is a must read.

Now for everyone else (bus dev, marketing, sales, analysts, scientists, stats folks) you just do not have a lot of resources yet to get you into the game.

What game is this?

It is the ability for you to do your job better than you have ever been able to-do it before. Hadoop, provides YOU the infrastructure to get all that data you always wanted that was either locked away across 5,000 databases or would have required "mountains to be moved" and there are just not enough dentists on the planet for all the teeth pulling it takes with IT to get it done.

So the good news is that the tech is finally out there to help bring about the aggregated information [when data exists] that you want to slice and dice and predict/forecast/present on (etc, etc).

Hadoop is still wrapped up in IT but the IT folks now like getting at the data for you because it is much easier than it has ever been and each time they do it is cutting edge. It is cutting edge because there are still many moving parts and the ecosystem is yet to have created "middleware".

Well, this is only partly true. There are some middleware projects going on but there is yet an event processing system to orchestrate and coordinate them. Cloudera has developed Sqoop (allows DB import to Hadoop) and I suspect they are in process developing other inter data transfer and management components.

** insert proverbial crystal ball here **

I do not see this being in IT forever. There are currently projects well underway and maturely being used (HIVE & PIG) that provide a higher level abstraction to the Hadoop system. Now, while this is not really meant for non tech folk (yet) it gives them the ability to make your job easier to get to this data. Hive and Pig were released by Facebook & Yahoo (respectively) and these "modules" hide all of that techie gobly gook from you to get at your information. It is still a lot of gobly gook but pseudo code enough for you to sit with the tech folk to make sure you are on the same page for what data you are going after.

This is a HUGE step in the right direction for the next layer of cohesive and more (dare I say it) B.P.M. type of systems along with B.I. and related reporting.

Now, it might take years for this to all come together and get "mainstream" but honestly I have to say that I am excited to be a part of it.

Joe Stein

Thursday, December 31, 2009

Linda, Tuples, Rinda, DRb & Parallel Processing For Distributed Computing

When building enterprise solutions often you start to get into orchestration of your data and process and really you just need to distribute your load. Traditionally this has been solved using MessageQueues, the occasional multi-threaded server (filled with pools and queues) etc, etc, etc. More recently solutions like Hadoop, Gearman, et el have sprung up but one way to solve this problem is with your own solution implementing Linda and tuple spaces.

There are a lot of open source and commercial software that can help with this paradigm but when you are building your own software then you sometimes just need straight up inter-process communication (without having to build your own socket based messaging platform) so that your software can pass objects off to another process (literally) in your architecture and parallelize the "crunching" of your data accordingly.

Now, when this inter-process communication is available across the network and even self aware (meaning that clients can find the server to register to get the data it needs) we really have a powerful solution... but I digress.

"In computer science, Linda is a model of coordination and communication among several parallel processes operating upon objects stored in and retrieved from shared, virtual, associative memory. This model is implemented as a "coordination language" in which several primitives operating on ordered sequence of typed data objects, "tuples," are added to a sequential language, such as C, and a logically global associative memory, called a tuplespace, in which processes store and retrieve tuples."

Ok, Ok, enough of the esoteric academic theory... let me introduce you to Rinda.

Rinda is the Ruby implementation of Linda and Rinda is a built in library to Ruby (specifically DRb which is Distributed Ruby). Yes this comes out of the pervebial box of Ruby 1.8.1 and greater.

What DRb elegantly provides you with Rinda is a RingServer which is basically a solution to manage the tuple spaces and a service for auto-magically finding the server providing you all of the inter-process (and via network) communication with your tuple spaces.

Without further ado I would like to send you here for your first look at Rinda as I found it especially useful and within 15 minutes I had it read, understood and implemented in my software.

Now if you do not know Ruby this might be a good reason to learn it.

Or (if you are like me) do not care what language you use [just using the language to implement solutions for the problems that are at hand] then you can check out some other Linda implementations. I have never used any of these yet but I am sure I will.

# Linda for C++
# Linda for C
# Erlinda (for Erlang)
# Linda for Java
# Linda for Prolog
# PyLinda (for Python)
# Rinda (for Ruby)
# Linda for Scala (on top of Scala Actors)

As cloud computing continues to evolve solutions like Linda and various implementations could become more and more how software frameworks will be implemented as multi-threading is a dead end (better than a dead lock) when trying to parallelize processing.

Sometimes all you need are chopsticks to catch a fly =8^)

Joe Stein

Thursday, October 29, 2009

Los Angeles goes to cloud computing with Google

It is somewhat appropriate that the city of Angels makes this move to get into cloud computing.

What is even more ironic is that they are doing it with Microsoft’s money.

“Google has pushed Google Apps as an option for government agencies, promising to ship a product called Government Cloud, which will be certified under the Federal Information Security Management Act (FISMA), sometime next year”

"According to a Sept. 15 memo from the Los Angeles Information Technology Agency, Google will "provide a new separate data environment called 'GovCloud.' The GovCloud will store both applications and data in a completely segregated environment that will only be used by public agencies.""

This is a big win for cloud computing on a few fronts as it continues to be seen as a way to save money while keeping (and at times enhancing) the confidentiality, integrity and availability of information systems.

Joe Stein

Wednesday, October 28, 2009

Seeing through Windows into the Cloud at the Eclipse

Microsoft(r) has announced collaboration for interoperability between Eclipse (my favorite Java IDE) and Microsoft Windows(r).

There are a couple of great highlights here and some fluff.

First the fluff (nothing wrong with looking nice while out on the town). Eclipse is going to be made useful for "next generation" user experience development for Windows 7 features.

Now on to the more exciting juicy pieces.

Microsoft has collaborated with Soyatec, a France-based IT solutions provider, to develop three solutions: These will open up the Azure cloud solution to not be 100% Microsoft based as well as give Microsoft a new following for it's Silverlight client framework in a community often with Sun in their eyes. More than anything this will open up the storage arena for MS to play a part.

Along with the SDK there is a Storage Explorer of Windows Azure Tools for Eclipse—it allows developers to browse data contained in the Windows Azure storage component, including blobs, tables, and queues. Storage Explorer was developed in Java (like any Eclipse extension), and they realized during the Windows Azure Tools for Eclipse development with Soyatec that abstracting the RESTful communication aspect between the Storage Explorer user interface and the Azure storage component made a lot of sense. So this led them to package the Windows Azure SDK for Java developers as open source, which is available at

Their interoperability strategy and open source direction is becoming competitive.

Joe Stein

Wednesday, October 21, 2009

Mobile Internet Outpaces Desktop Internet Adoption

Mobile internet is taking off faster than the desktop.

iPhone + iTouch users = 8X AOL Users 8 Quarters after launch.

Mary Meeker's Awesome Internet Presentation From Web 2.0 (Morgan Stanley). Click Here for the entire presentation.

Joe Stein

Cloud computing is not about providing a software architecture for scale... that is what Open Source does.

Recently I heard the comment "WOW, elastic cloud computing is great. I can take on a lot more stress with any load and in a few minutes stand up an instance to accommodate usage on demand and keep the app running without long term cost or even contractual commitments". While this person is right they did not know that by starting another instance you are likely just turning on another problem if the software application was never designed to be distributed.

Cloud computing (and the "elasticity" it can provide with Infrastructure as a Service IaaS) is not about providing a software architecture for scale. Let me repeat this again, cloud computing is not about providing a software architecture for scale. So what is it then you ask?

Cloud computing provides an on-demand infrastructure so that your well designed distributed enterprise software application can quickly scale based on the spikes and valleys of usage and interactions of your system (pay as you go for only what you need). IaaS is about hardware resourcing given to your software to reduce expenses but if your software is not designed to take advantage then the opposite will happen with CPU & Memory running away with a false sense of security.

The issue is often that the internal workings of a software system are designed (to coin the phrase) "cloud monolithic". This means that software is usually designed to execute on a single server with a database (often a cluster) and to scale it you just add more servers and put the clusters together. Over the last 3-5 years many *VERY* large cloud based services have emerged and they have open sourced the solutions for how they scaled.

It is important to understand the inner workings of:

1) a-synchronous processing
2) global caching
3) distributed and parallel processing

Without all three of these patterns working together you will actually compound your stress with load bottleneck for each blocking call inside of your software. Your safety net of cloud computing turns into the proverbial wet blanket faster than it ever did before.

Lets break each of these patterns out and how they apply and what solutions exists. In another post I will explain how these apply when dealing with a ridicules amount of information processing on a large scale with the time of process exponentially reduced (because of using the algorithms for map/reduce). I bring this up now because the way the map/reduce algorithms achieve this ability to handle and process so much information exponentially faster is realized from the software written to implement them which also use the technologies that are explained here.

1) A-Synchronous Processing - Ok well this is not really a new one and often there are too many solutions to choose from all with their own pros & cons ( you have to make this call yourself ). Queuing systems have been around for a long time and have numerous implementations in the marketplace. It is so numerous that often each language has it's own set of queue servers to choose from. This being said they are often NOT used correctly because #2 & #3 are not implemented also. I have seen many systems make use of an a-synchronous process which allows the bottleneck of a blocking synchronous call to more expediently return creating a perceived performance gain. The problem here is that you are just passing the problem for another process to either eat up unnecessary cycles or not utilize unused cycles on other parts of your infrastructure (ultimately requiring you to get more servers or now turn on more instances).

Creating a performant software application is about taking both synchronous and a-synchronous operations and making sure that they are utilizing information that has already "crunched" by other parts of the infrastructure [#2 global caching] and maximizes the hardware so the crunching happens on the parts of the infrastructure that currently has the least "crunching" occurring [#3 parallel distributed processing].

So now maybe you are getting the problem and solution so here is how to implement it.

Global Caching with Memcached "memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.".

What this means is two fold. Instead of your application querying the database for information it firsts checks the cache to see if that information is available. What makes this more powerful than using some static variable or custom solution is that memcached is a server that runs on EVERY machine that an instance of your application is running on. So, if server 3 pulls information from the database and adds it to the cache it is a "global cache" that the memcached server replicates for ALL instances/servers to make use of. This is extremely powerful because now every instance of your application is benefiting when all parts of the infrastructure are being used. In this scenario now you have un-compartmentalized "crunching" to no longer have to repeat some "crunch" of information to get to a result that another instance/server has already gotten to for their request/response.

Now this is a HUGE reduction to what often stresses a system but in of itself will not reduce the process to the degree that we are trying to get to because "at the end of the day" that "crunching" still has to occur. The crunching in the memcached implementation will still happen (hopefully a-synchronously once you find that it is not in your global cache and you have to "crunch" =8^0 ).

Now you need to crunch because your data is not memcached or perhaps you have to crunch for some other reason (that is what software does, right?) Just moving this to happen in the background off and onto another process provides no benefit within a multi-server environment.

A la "distributing the processing" and "executing it in parallel" which is where Gearman comes in "Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages. It can be used in a variety of applications, from high-availability web sites to the transport of database replication events. In other words, it is the nervous system for how distributed processing communicates."

Both memcached and Gearman are servers with a great following with multi-language client implementation support. They are written in C so that they will execute better than if the had to deal with an interpreter or virtual machine. They might be over kill but if you find yourself with bottlenecks I hope you think about your design and internal architecture of the system before you throw more hardware at your problem (especially now that this can be done with a few clicks to launch an instance).

Joe Stein