CGC Programming Assignment 2

Overview

In the second assignment you are called to develop a simple MapReduce program that analyses a medium collection of data. The purpose of this page is to describe the assignment, therefore read it carefully.

Develop and test a real-world application

What you should develop is a simple MapReduce program and test its performance on the DAS4 cluster. Your MapReduce program will have to process a large collection of RDF data and process some information about the connections between different datasets.

This operation is an important task in order to understand the ”shape” of the Web. To facilitate your work, I have copied on the fs0 a little sample of Web data of about 750 million statements. The input is at the directory on fs0:

/var/scratch/jurbani/semweb

Also, I had copied the same directory in HDFS (to avoid you copy it to the distributed filesystem) in:

/user/jurbani/semweb

The statements in this datasets come from different datasets. We recognize which dataset a URI is part of by looking at the domain in the website. For example, consider example statement reported here. There, we use URIs from three different datasets: “http://www.vu.nl”, ”http://w3c.org” and “http://dbpedia.org”.

Your MapReduce program(s) will read the RDF statements in input and calculate some analytical information about the datasets in the collection.

More in particular, your program should calculate the list of pairs of datasets that are used in the same triple and, for each of these pairs, calculate:

  • The total number of triples which contain both datasets
  • The most popular URI that is used to connect these two datasets
  • The number of triples that uses the most popular URI

The output should consists of a list of tuples of the form: “dataset1 dataset2 count_triples pop_URI n_triples_URI” where count_triples is the number of statements which contains URIs from both datasets, pop_URI the most popular URI that connect them and n_triples_URI is the number of triples that use that URI. This list should be duplicated free: that means that a couple of datasets should appear only once. In your program do not consider pairs with less than 1000 triples.

For example, consider that in input we have these two statements:

<http://www.vu.nl> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/resource/University>

and

<http://www.vu.nl> <http://www.geonames.org/located> <http://www.geonames.org/Amsterdam>

In this case your output should be:

<http://www.vu.nl> <http://www.w3.org> 1 <http://dbpedia.org/resource/University> 1

<http://www.geonames.org> <http://www.vu.nl> 1 <http://www.geonames.org/located> 1

<http://dbpedia.org> <http://www.vu.nl> 1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 1

<http://dbpedia.org> <http://www.w3.org> 1 <http://www.vu.nl> 1

Important

- The Hadoop framework takes care of the execution, but it is still your responsability to write programs where the computation is spread equally across the reduce tasks.

- Your code should be clear and easy to understand. Undocumented or ”encrypted” code will have a negative impact on your grade.

- You should submit:

  • Your code with a jar of your program;
  • A report (max 2 pages)

- The report should contain

  • A brief explanation of your approach, highlighting the problems and solutions you decided to take
  • A table/graph where you report the execution time using different settings (e.g. 2-4-8-16-32-64 reducers) over the input that I set. The execution time should take into account eventual overhead if the cluster is busy
  • An analysis of the scalability. What do you think might hurt if we launch your program over a dataset that is 100 times larger? Look at the min/max/avg time spent by the mappers and reducers. What conclusions can you make?

- If your program works and your report does not include mistakes (and includes the mentioned points) you pass the assignment. If you want to get a very good grade, you must work to improve the performance and scalability.