Hadoop...Pigs, Hives, and Zookeepers, Oh My!
If there is one aspect of Hadoop that I find particularly entertaining, it is the naming of the various tools that surround Hadoop. In my 7/3 post, I introduced Hadoop, the reasons for its growing popularity, and the core framework features. In this post, I will introduce you to the many different tools, and their clever names, that augment Hadoop and make it more powerful. And yes, the names in the title of this blog are actual tools.
The power behind Pig is that it provides developers with a simple scripting language that performs rather complex MapReduce queries. Originally developed by a team at Yahoo and named for its ability to devour any amount and any kind of data. The scripting language (yes, you guessed it, called Pig Latin) provides the developer with a set of high level commands to do all kinds of data manipulation like joins, filters, and sorts. Unlike the SQL language, Pig is a more procedure or script-oriented query language. SQL, by design, is more declarative. The benefit of a procedural design is that you have more control over the processing of your data. For example, you can inject user code at any point within the process to control the flow.
To complement Pig, Hive provides developers a declarative query language similar to SQL. For many developers who are familiar with building SQL statements for relational databases like SQL Server and Oracle, Hive will be significantly easier to master. Originally developed by a team at Facebook, it has quickly become one of the most popular methods of retrieving data from Hadoop. Hive uses a SQL-like implementation called HiveQL or HQL. Although it doesn't strictly conform to the SQL '92 standard, it does provide many of the same commands. The key language limitation relative to the standard is that there is no transactional support. HQL supports both ODBC and JDBC so developers can leverage many different programming languages like Java, C#, PHP, Python, and Ruby.
To tie these query languages together for complex tasks requires an advanced workflow engine. Enter Oozie-a workflow scheduler for Hadoop that allows multiple queries from multiple query languages to be assembled into a convenient automated step-by-step process. With Oozie, you have total control over the flow to perform branching, decision-making, joining, and more. It can be configured to run at specific times or intervals and reports back logging and status information to the system. Oozie workflows can also accept user input parameters to add additional control. This allows developers to tweak the flow based on changing states or conditions of the system.
When deploying a Hadoop solution, one of the first steps is populating the system with data. Although data can come from many different sources, the most likely would be a relational database like Oracle, MySQL or SQL Server. For moving data to and from relational databases, Apache's Sqoop is great tool to use. The name is derived from combining "SQL" and "Hadoop"; signifying the connection between SQL and Hadoop data. Part of Sqoop's power comes from the intelligence built-in to optimize the transfer of data both on the SQL side and the Hadoop side.
It can query the SQL table's schema to determine the structure of the incoming data, translate it into a set of intelligent data classes, and configure MapReduce to import the data efficiently into a Hadoop data store like HBase. Sqoop also provides the developer more granular control over the transfer by allowing them to import subsets of the data; for example, Sqoop can be told to only import specific columns within the table instead of the whole table. Sqoop was even chosen by Microsoft as their preferred tool for moving SQL Server data into Hadoop.
Another popular data source for Hadoop outside of relational data is log or streaming data. Web sites, in particular, have a propensity to generate massive amounts of log data and more and more companies are finding out how valuable this data is to better understand their audience and their buying habits. So another challenge for the Hadoop community to solve was how to move log-based data into Hadoop. Apache tackled that challenge and released Flume (yes, think of a log flume).
The flume metaphor symbolizes the fact that this tool is dealing with streaming data like water down a rushing river. Unlike Sqoop which is typically moving static data, Flume needs to manage constant changes in data flow and be able to adjust to handle very busy times. For example, Web data may be coming in at an extreme high rate during a promotion. Flume is designed to scale itself to handle these changes in rates. Flume can also receive data from multiple streaming sources, even beyond Web logs, and does so with guaranteed delivery.
There are many more tools I could cover but I'm going to wrap it up with one of my favorite tool names- Zookeeper. This tool comes into play when dealing with very large Hadoop installations. At some point in the growth of the system, as more and more computers are added to the cluster, there will be an increasing need to be able to manage and optimize the various nodes involved. Zookeeper collects information about all nodes and organizes them in a hierarchy similar to how your operating system will create a hierarchy of all the files on your hard drive to make them easier to manage.
The Zookeeper service is an in-memory service making it extremely fast, although it is limited by available RAM which may affect its scalability. It replicates itself across many of the nodes in the Hadoop system so that it maintains high availability and does not create a weak-link situation. Zookeeper becomes the main hub that client machines connect to in order to obtain health information about the system as a whole. It is constantly monitoring all the nodes and logging events as they happen. With Zookeeper's organized map of the system, it makes what could be a cumbersome task of checking on and maintaining each of the nodes individually a more enjoyable and manageable experience.
I hope this give you a taste of the many support tools that are available to Hadoop as well as illustrates the community's commitment to this project. As technology goes, Hadoop is in the very early stages of its lifespan and components and tools are constantly changing. For more information about these and other tools, be sure to check out our new Hadoop course.