The MapReduce programming model

The scope of this page is to give you more background information on MapReduce to ease the development of the second assignment.

MapReduce:

In MapReduce, the computation is expressed with two functions: map and reduce. These two functions process key/value pairs, therefore we are required to encode our input and output accordinly. The map function reads the input, and returns a set of intermediate key/value pairs. After the map function is applied on all the input, the framework will group the intermediate pairs and apply the reduce function on each group.

On the Web there is plenty of documentation on MapReduce. I suggest you read the following links if you want to learn more on this topic:

Hadoop:

Hadoop is a opensource implementation of MapReduce developed by Yahoo! Currently, Hadoop is the most popular MapReduce framework and widely used in different contexts.

Recently Hadoop has been spit in several components. MapReduce is the subcomponent responsible to execute MapReduce programs; HDFS is the distributed filesystem where the data is stored, and so on.

Again, on the Web there is plenty of documentation on Hadoop. I report some of the main pointers below. In order to ease the development of your assignment, I strongly encourage to read them!

Hadoop is installed on the DAS-4. Therefore, you can just use the command “hadoop” to interact with the cluster. The webinterface of the JobTracker is available at the port 50130 of fs0. The webinterface of the NameNode is available at port 50190.