Apache Hadoop project gains momentum
Yahoo and Twitter among a host of corporate giants now contributing to the project as Linux User & Developer’s Rory MacDonald explains…
Subscribe to Linux User & Developer magazine to save more than 30% and receive our exclusive money back guarantee – click here to find out more.
Get your first digital copy of the magazine for iPhone and iPad free – just search for ‘Linux User’ on the Apple App Store now!
Last month’s Hadoop Summit in Santa Clara, California, saw a number of interesting developments around this increasingly hot open source project. Hadoop is a top-level Apache project that provides a Java software framework for storing, managing, processing and analysing the massive datasets produced by enterprise web and cloud computing applications. Inspired by Google’s internal software framework and file system, Hadoop includes the HBase distributed database, HDFS distributed file system, MapReduce application framework, Hive data warehousing application and a number of other open source tools, languages and common utilities.
Hadoop’s list of users increasingly reads like a who’s who of the web. And, as revealed at the summit, many are now contributing significant, enterprise-tested technologies back into the project.
Yahoo announced it was handing over ‘Hadoop with Security’ and its in-house Oozie workflow engine to the open source community. Hadoop with Security is a custom integration of the open source Kerberos authentication standard to enable secure collaboration and sharing of datasets, as well as hardware sharing between different instances. Oozie, meanwhile, is a workflow and job management tool for complex work processes and global scale ETL (extract, transform, load). Oozie works with most of Hadoop’s core components and, coming straight from Yahoo’s operational systems, has been pretty thoroughly tested.
Twitter also announced that it was open sourcing Crane, a tool for moving data from MySQL into a Hadoop infrastructure. Crane has, again, been pretty thoroughly tested, with Twitter using the tool as part of its arsenal to lift 7TB of data into Hadoop every day. While all of this may seem like immense corporate benevolence, Hadoop is a prime example of free software working as it should.
Hadoop’s large corporate users are effectively forced to contribute their most critical tools and tweaks back into the community. The alternative is to risk the project’s community developing alternative tools or taking the core components off in directions that leave them constantly refactoring their proprietary tools and developments.
Somewhat at odds with this then, Cloudera – the company that now employs some of the creators, committers and key contributors to the projects that make up Hadoop – used the summit to announce a risky new proprietary business model with a set of closed ‘enterprise’ tools for Hadoop.