Introduction to Data Mining Homework 1 Solution






Please use a word processing software (e.g., Microsoft word) to write your answers and submit a printed copy to me at the beginning of the class on Feb 8. The rationale is that it is sometimes hard to read and understand the hand-written answers.


All homeworks should be done individually.



Analytical Part (40 points)


Q1. Consider the following market-basket data, where each row is a basket and shows the list of items that are part of that basket.


  1. fA; B; Cg


  1. fA; C; D; Eg


  1. fA; B; F; G; Hg


  1. fA; B; X; Y; Zg


  1. fA; C; D; P; Q; R; Sg


  1. fA; B; L; M; Ng


  1. What is the absolute support of item set fA; Bg ? (3 points)


  1. What is the relative support of item set fA; Bg ? (3 points)


  1. What is the con dence of association rule A ) B ? (3 points)


Q2. Answer the below questions about storing frequent pairs using triangular matrix and tabular method.


  1. Suppose we use a triangular matrix to count pairs and the number of items n = 20. If we store this triangular matrix as a ragged one-dimensional array Count, what is the index where count of pair (7; 8) is stored? (3 points)


  1. Suppose you are provided with the prior knowledge that only ten percent of the total pairs will have a non-zero count. In this case, which method among triangular matrix and



tabular method should be preferred and why? (3 points)


Q3. This question is about the PCY algorithm for counting frequent pairs of items. Suppose we have six items numbered 1, 2, 3, 4, 5, 6. Consider the following twelve baskets.


  1. f1; 2; 3g


  1. f2; 3; 4g


  1. f3; 4; 5g


  1. f4; 5; 6g


  1. f1; 3; 5g


  1. f2; 4; 6g


  1. f1; 3; 4g


  1. f2; 4; 5g


  1. f3; 5; 6g


  1. f1; 2; 4g


  1. f2; 3; 5g


  1. f3; 4; 6g


Suppose the support threshold is 4. On the rst pass of the PCY algorithm, we use a hash table with 11 buckets, and the set fi; jg is hashed to i j mod 11.

  1. By any method, compute the support for each item and each pair of items. (5 points)


  1. Which pairs hash to which buckets? (5 points)


  1. Which buckets are frequent? (3 points)


  1. Which pairs are counted on the second pass of the PCY algorithm? (2 points)


Q4. Please read the following paper and write a brief summary of the main points in at most ONE page. You can skip the theoretical parts. (10 points)


Saul Schleimer, Daniel Shawcross Wilkerson, Alexander Aiken: Winnowing: Local Algo-


rithms for Document Fingerprinting. SIGMOD Conference 2003: 76-85


Programming and Experimental Part (60 points)


Product Recommendations: The action or practice of selling additional products or ser-vices to existing customers is called cross-selling. Giving product recommendation is one of the examples of cross-selling that are frequently used by online retailers. One simple method to give product recommendations is to recommend products that are frequently browsed together by the customers.


Suppose we want to recommend new products to the customer based on the products they have already browsed on the online website. Write a program using the A-priori algorithm to nd products which are frequently browsed together. Fix the support to s =100 (i.e., product pairs need to occur together at least 100 times to be considered frequent) and nd itemsets of size 2 and 3.


Use the online browsing behavior dataset provided with this homework. Each line represents a browsing session of a customer. On each line, each string of 8 characters represents the id of an item browsed during that session. The items are separated by spaces.


  1. Identify pairs of items (X; Y ) such that the support of fX; Y g is at least 100. For all such pairs, compute the con dence scores of the corresponding association rules: X ) Y , Y ) X. Sort the rules in decreasing order of con dence scores and list the top 5 rules in the writeup. Break ties, if any, by lexicographically increasing order on the left hand side of the rule.


  1. Identify item triples (X; Y; Z) such that the support of fX; Y; Zg is at least 100. For all such triples, compute the con dence scores of the corresponding association rules: (X; Y ) ) Z, (X; Z) ) Y , (Y; Z) ) X. Sort the rules in decreasing order of con dence scores and list the top 5 rules in the writeup. Order the left-hand-side pair lexicographically and break ties, if any, by lexicographical order of the rst then the second item in the pair.


Instructions for Code Submission and Output Format.


Please follow the below instructions. It will help us in grading your programming part of the homework. We will provide a dropbox folder link for code submission.


Supported programming languages: Python, Java, C++


Store all the relevant les in a folder and submit the corresponding zip le named after your student-id, e.g.,


This folder should have a script  le named


Executing this script should do all the necessary steps required for executing the code including compiling, linking, and execution





Assume relative  le paths in your code. Some examples: “./filename.txt” or “../hw1/filename.txt”


The output of your program should be dumped in a le named \output.txt” in the following format:




FRO11987 FRO12685 0.4325 FRO11987 ELE11375 0.4225 FRO11987 GRO94758 0.4125 FRO11987 SNA80192 0.4025 FRO11987 FRO18919 0.4015



FRO11987 FRO12685 DAI95741 0.4325 FRO11987 ELE11375 GRO73461 0.4225 FRO11987 GRO94758 ELE26917 0.4125 FRO11987 SNA80192 ELE28189 0.4025 FRO11987 FRO18919 GRO68850 0.4015




{ Line 1 should have \Output A”


{ Next  ve lines should have the top  ve rules with decreasing con dence scores for part (a) of the programming question.  Format: < item1 > < item2 > <


confidence > meaning fitem1g ) item2 { Line 7 should have \Output B”


{ Next ve lines should have the top ve rules with decreasing con dence scores for part (b) of the programming question. Format: < item1 > < item2 > < item3 > < confidence > meaning fitem1; item2g ) item3


Make sure the output.txt  le is dumped when you execute the script


Zip the entire folder and submit it as <student_id>.zip


Grading Rubric


Each question in the students work will be assigned a letter grade of either A,B,C,D, or F by the Instructor and TAs. This ve-point (discrete) scale is described as follows:


  1. A) Exemplary (=100%).


Solution presented solves the problem stated correctly and meets all requirements of the problem.


Solution is clearly presented.


Assumptions made are reasonable and are explicitly stated in the solution.


Solution represents an elegant and e ective way to solve the problem and is not overly complicated than is necessary.



  1. B) Capable (=75%).


Solution is mostly correct, satisfying most of the above criteria under the exemplary category, but contains some minor pitfalls, errors/ aws or limitations.



  1. C) Needs Improvement (=50%).


Solution demonstrates a viable approach toward solving the problem but contains some major pitfalls, errors/ aws or limitations.



  1. D) Unsatisfactory (=25%)


Critical elements of the solution are missing or signi cantly  awed.


Solution does not demonstrate su cient understanding of the problem and/or any rea-sonable directions to solve the problem.



  1. F) Not attempted (=0%) No solution provided.



The points on a given homework question will be equal to the percentage assigned (given by the letter grades shown above) multiplied by the maximum number of possible points worth for that question. For example, if a question is worth 6 points and the answer is awarded a B grade, then that implies 4.5 points out of 6.
















error: Content is protected !!