Lab 1 Solution




This lab requires you to work with a large set of documents and search for relevant documents. You will need to understand the concepts of RDD, transformation and action in Apache Spark, and implement an algorithm at the RDD level. This is an individual lab assignment.

Users of an online shopping website would like to find the relevant reviews of items they are interested in. Write a Spark program that finds that top-k relevant documents given a query comprising of set of keywords.

A document can be modelled as a vector of words (or terms). Each entry in the vector is a TF-IDF value that reflects how important a word is to a document in a collection, computed as follows:

TF-IDF = (1 + log (TF)) * log (N/DF)

where N is total number of documents, TF is the count of the word in a document, and DF is the count of documents having the word.

A query can also be represented as a vector where each entry represents a word with a value 1 if the word is in the query, and 0 otherwise.

We can compute a relevance score for each document d to a query q based on the based on the cosine similarity of their corresponding vectors V1 and V2 as follows:

�����⃗ ������⃗

relevance (q, d) = (����, ����) =


With this, we can rank the documents with respect to a query.

Algorithm: There are 5 main steps.

Step 1. Compute term frequency (TF) of every word in a document.

This is similar to the Word Count program in Lab0.

Step 2. Compute TF-IDF of every word w.r.t a document.

You can use key-value pair RDD and the groupByKey() API. Choose a proper key for your pair RDD.

Step 3. Compute normalized TF-IDF of every word w.r.t. a document.

Suppose the TF-IDF value of word1 in doc1 is t1 and the sum of squares of the TF-IDF of all the words in doc1 is S, then the normalized TF-IDF value of word1

is 1 .

Step 4. Compute the relevance of each document w.r.t a query

Step 5. Sort and get top-k documents


  • Set of keywords for a query (in query.txt)

  • Stopwords to remove (in stopwords.txt)

Construct a document from each review in bookreviews.json by combining the information in “reviewText” and “summary”. The document ID is obtained by combining

the “reviewerID” and product ID “asin”. You should have as many documents as there

are reviews in the file reviews.json. Note that stop-words should be removed from the

documents constructed.

Output: The output should contain one line per relevant review in the following format:

<document ID> <relevance score>

The output should be sorted in descending order of the relevance of the documents to the query. You need to only give the top-20 most relevant reviews.


Upload your Spark program (with documentation within the code) to the Lab1 folder in IVLE


error: Content is protected !!