Data Lab: MapReduce Fundamentals

MapReduce Fundamentals

MapReduce is a programming model for efficient distributed computing.
The core concept of MapReduce in Hadoop is that input may be split into logical chunks, and each chunk may be initially processed independently, by a map task. The results of these individual processing chunks can be physically partitioned into distinct sets, which are then sorted. Each sorted chunk is passed to a reduce task.
Every MapReduce program must specify a Mapper and typically a Reducer.

Input and Output Formats

A MapReduce may specify how it’s input is to be read by specifying an InputFormat to be used
– InputSplit
– RecordReader
A MapReduce may specify how it’s output is to be written by specifying an OutputFormat to be used.
These default to TextInputFormat and TextOutputFormat, which process line-based text data.
SequenceFile: SequenceFileInputFormat and SequenceFileOutputFormat.
These are file-based, but they are not required to be.

How many Maps and Reduces

Maps
– Usually as many as the number of HDFS blocks being processed, this is the default.
– Else the number of maps can be specified as a hint.
– The number of maps can also be controlled by specifying the minimum split size.
– The actual sizes of the map inputs are computed by:

• max(min(block_size, data/#maps), min_split_size)

Reduces
– Unless the amount of data being processed is small

• 0.95*num_nodes*mapred.tasktracker.reduce.tasks.maximum

Data Lab

Friday, 15 March 2013

MapReduce Fundamentals

MapReduce Fundamentals

Input and Output Formats

How many Maps and Reduces

No comments:

Post a Comment