Official website for Linux User & Developer
FOLLOW US ON:
May
28

Data processing with MapReduce

by Christian Baun

You need to process lots of data (more than 1 TB)? You want to parallelise across hundreds or thousands of CPUs and you want to make this easy? You need MapReduce for large-scale parallel data processing. You have lower demands? No problem. MapReduce can do that too

This article originally appeared in issue 87 of Linux User & Developer magazine.

Linux User & Developer, one of the nation’s favourite Linux and Open Source publications, is now part of the award winning Imagine Publishing family. Readers can subscribe and save more than 30% and receive our exclusive money back guarantee – click here to find out more.

Don’t forget to follow us on Twitter or get your first digital copy of the magazine for iPhone and iPad free – just search for ‘Linux User’ on the app store now!

Google uses MapReduce everywhere for processing huge datasets using a large number of nodes, eg to run PageRank. The MapReduce programming model borrows from functional programming and implements two basic functions: Map and Reduce. Before the map phase can start, the input data is structured in (key, value) pairs.

Map: The master node splits the input into smaller sub-problems and distributes these to the worker nodes. The worker nodes process the smaller problems and pass them back to their master node.
Reduce: The master node takes the answers of the sub-problems and combines them in a way that the original problem is answered.

The Map functions, creating different intermediate values from different input data sets, run in parallel. The Reduce functions also run in parallel, each working on a different output key. All values are processed independently. There is just a single bottleneck. The reduce phase can’t start until the map phase is completely finished.

While MapReduce is the name of the patented software framework and algorithm used by Google, various open source implementations exist. Hadoop is a re-implementation of Google’s proprietary software infrastructure. The Hadoop project was started in 2004 by Doug Cutting and is written in Java. Hadoop re-implements the Google software infrastructure as an open source ecosystem of services to process large-scale data.

Hadoop implements the parallel programming model MapReduce as open source. Hadoop includes its own file system, Hadoop Distributed File System (HDFS). But various file systems are supported, including the Amazon S3 file system and CloudStore. Hadoop also includes a BigTable-model database called HBase, the programming environment Pig and much more. Hadoop can be used for lots of applications, such as distributed grep and sort, document clustering, log file analysis and various kinds of statistics.

The largest Hadoop cluster is run by Yahoo! and consists of 4.000 nodes with 32.000 cores with 64TB of RAM and 16PB of storage. Also, Amazon with its Amazon Web Services, is involved in the MapReduce topic. With Amazon Elastic MapReduce the user can easily start a Hadoop cluster (master and slaves) inside Amazon Elastic Compute Cloud (EC2). The Amazon Simple Storage Service (S3) serves as the source for the data being analysed, and the results are also copied over to S3.

You want your own Hadoop installation at home? Try Cloudera. Its free Hadoop distribution is easy to install and run on top of all up-to-date Linux distributions, inside VMware or inside the Amazon Web Services.

Click here to see what else featured in issue 87 of Linux User & Developer magazine…

  • Tell a Friend
  • Follow our Twitter to find out about all the latest Linux news, reviews, previews, interviews, features and a whole more.