MapReduce

A smiling man working on a laptop in an office, representing learning and implementing MapReduce.

Apache MapReduce is an open-source project designed to generate and store large data sets. The MapReduce system runs on a clustered computer system. The processing consists of two phases. The mapper phase processes input data that is transformed into key-value pairs. The output from the mapper process subsequently feeds to the reducer process, which compresses its input records.

Why is MapReduce Important?

If you need fast and economical processing of petabytes of data, you need MapReduce and Hadoop. Today, Hadoop is almost synonymous with Big Data use because it allows organizations to employ thousands of commodity processors to process large datasets.

MapReduce evolved from the need for systems to efficiently handle data sets that could scale into the petabyte range. Data comes in many formats that must be processed, stored, and retrieved. The Hadoop Distributed File System (HDFS) uses MapReduce as a core component of its architecture to process a wide variety of data in a scalable and flexible manner.

MapReduce can be invoked in popular languages, including Python, C++, and Java.

Performing a MapReduce Operation

Performing a MapReduce on a data set requires using two core methods: the Mapper interface and the Reducer interface.

The Mapper

The tasks that transform input records into intermediate records are known as Maps. One input pair can map to zero or more output (intermediate records), which can be a datatype from the input value pair. The Map process can be run in parallel with one worker task per input data file, so if a dataset is split into ten separate files, ten map tasks will be created.

Applications can use a built-in Counter to get statistics on the operation. The output can be grouped, sorted, and partitioned before being passed on to the Reducer tasks. There will be one Reducer task per partition unless the developer decides to do some aggregation using a combiner to control the volume of data at this stage.

The intermediate, sorted output records are in a (key-len, key, value-len, value) format. Applications can specify a compression codec in the configuration.

Parallelizing the Map Tasks

The number of parallel Map tasks is governed by the total number of file blocks and the number of workers per node, which can amount to thousands for a multi-terabyte file.

The Reduce Method

The intermediate values produced by the Mapper method can be run through the Reduce method, which creates a smaller number of values that share a key. The number of Reducer jobs is configurable.

The Reducer method uses a series of steps that start with a shuffle and sort, which occur concurrently, followed by the reduce step.

The Shuffle step involves the framework fetching a given Mapper output stream to the Reducer job. In the Sort step, the framework groups Mapper output by key value. Secondary key sorts are configurable.

The Reduce method is called for every <key, (list of values)> pair in the grouped inputs, which is eventually written to the file system in an unsorted form.

The output of the Reducer is not sorted. The reduction can be made optional by setting the number of tasks to Zero.

Partitioning

The intermediate map outputs can be partitioned using a hash function that uses a portion of the key value. The number of partitions dictates the number of reduced tasks.

Both the Mapper and Reducer can use the Counter function to report statistics.

The Benefits of MapReduce

Below is a summary of the key benefits of MapReduce:

  • A highly scalable way to compress high-volume data.
  • Provides open-source economics.
  • Supports multiple data formats.
  • The MapReduce library is delivered as a standard part of the Apache Hadoop distribution.
  • Easy for developers to use.

Actian Data Platform and Hadoop

The Actian Data Platform is designed to scale within a Hadoop environment, enabling parallelism across nodes and access to all data formats that Apache Spark supports. Hadoop support is the key to supporting scalable operations across a cluster using partitioned data accessed using Apache Spark. Actian also provides scalable data access to streamed data sources such as Azure Blob Storage or GCS.