MapReduce Fundamentals
MapReduce is a programming model for efficient distributed computing.The core concept of MapReduce in Hadoop is that input may be split into logical chunks, and each chunk may be initially processed independently, by a map task. The results of these individual processing chunks can be physically partitioned into distinct sets, which are then sorted. Each sorted chunk is passed to a reduce task.
Every MapReduce program must specify a Mapper and typically a Reducer.
Input and Output Formats
A MapReduce may specify how it’s input is to be read by specifying an InputFormat to be used– InputSplit
– RecordReader
A MapReduce may specify how it’s output is to be written by specifying an OutputFormat to be used.
These default to TextInputFormat and TextOutputFormat, which process line-based text data.
SequenceFile: SequenceFileInputFormat and SequenceFileOutputFormat.
These are file-based, but they are not required to be.
How many Maps and Reduces
Maps– Usually as many as the number of HDFS blocks being processed, this is the default.
– Else the number of maps can be specified as a hint.
– The number of maps can also be controlled by specifying the minimum split size.
– The actual sizes of the map inputs are computed by:
• max(min(block_size, data/#maps), min_split_size)Reduces
– Unless the amount of data being processed is small
• 0.95*num_nodes*mapred.tasktracker.reduce.tasks.maximum
No comments:
Post a Comment