Description
NOTE 1: Please use a word processing software (e.g., Microsoft word or Latex) to write your answers and submit a printed copy to me at the beginning of class on Oct 23. The rationale is that it is sometimes hard to read and understand the handwritten answers.
NOTE 2: Please ensure that all the graphs are appropriately labeled (xaxis, yaxis, and each curve). The caption or heading of each graph should be informative and selfcontained.

(10 points) Suppose x = (x_{1}; x_{2}; ; x_{d}) and z = (z_{1}; z_{2}; ; z_{d}) be any two points in a highdimensional space (i.e., d is very large).
a. (5 points) Try to prove the following, where the righthand side quantity represent the standard Euclidean distance.

d
d
!
2
d
p_{d}
p
(1)
_{=1} ^{x}i
d _{i=1} ^{z}^{i}
_{i=1} ^{(x}i ^{z}i^{)}^{2}
1
^{X}i
1
X
X
Hint: Use Jensen’s inequality { If X is a random variable and f is a convex function, then f(E[X]) E[f(X)].


(5 points) We know that the computation of nearest neighbors is very expensive in the highdimensional space. Discuss how we can make use of the above property to make the nearest neighbors computation e cient?


(10 points) We brie y discussed in the class about Locality Sensitive Hashing (LSH) algorithm to make the nearest neighbor classi er e cient. Please read the following paper and brie y summarize the key ideas as you understood:
Alexandr Andoni, Piotr Indyk: Nearoptimal hashing algorithms for approximate nearest
neighbor in high dimensions. Communications of ACM 51(1): 117122 (2008) http:// people.csail.mit.edu/indyk/p117andoni.pdf

(15 points) We know that we can convert any decision tree into a set of ifthen rules, where there is one rule per leaf node. Suppose you are given a set of rules R = fr_{1}; r_{2}; ; r_{k}g, where r_{i} corresponds to the i^{th} rule. Is it possible to convert the rule set R into an equivalent decision tree? Explain your construction or give a counterexample.

(10 points) You are provided with a training set of examples (see Figure 1). Which feature will you pick rst to split the data as per the ID3 decision tree learning algorithm? Show all your work: compute the information gain for all the four attributes and pick the best one.

(10 points) Please read the following paper and brie y summarize the key ideas as you understood (You can skip the proofs, but it is important to understand the main results):
Andrew Y. Ng, Michael I. Jordan: On Discriminative vs. Generative Classi ers: A comparison
of logistic regression and naive Bayes. NIPS 2001: 841848 http://ai.stanford.edu/_{~}ang/ papers/nips01discriminativegenerative.pdf

(10 points)


Let us assume that the training data satis es the Naive Bayes assumption (i.e., features are independent given the class label). As the training data approaches in nity, which classi er will produce better results, Naive Bayes or Logistic Regression? Please explain your reasoning.

Figure 1: Table with training examples. Each row corresponds to a single training example. There are four features, namely, outlook, temperature, humidity, and wind. \PlayTennis” is the class label.


Let us assume that the training data does NOT satisfy the Naive Bayes assumption. As the training data approaches in nity, which classi er will produce better results, Naive Bayes or Logistic Regression? Please explain your reasoning.


(10 points)


Can we compute P (X) from the learned parameters of a Naive Bayes classi er? Please explain your reasoning.



Can we compute P (X) from the learned parameters of a Logistic Regression classi er? Please explain your reasoning.


(15 points) In the class, we looked at the loglikelihood derivation and the corresponding gradient ascent algorithm to nd the parameters of a binary logistic regression classi er (see slide 12 and slide 13). We want to extend the loglikelihood derivation and parameter learning algorithm to the multiclass case. Suppose we have K di erent classes, and the posterior probability can be represented using the socalled softmax function (see slide 18):

P (y = k
x) =
exp(w_{k} x)
(2)
^{P}_{i}^{K}_{=1} exp(w_{i}
j
x)


Derive the loglikelihood and the corresponding gradient ascent algorithm to nd the parameters.



Add a regularization term to the loglikelihood objective (see slide 16), and derive the gradient ascent update rule with the additional change.


(30 points) Fortune Cookie Classi er^{1}
You will build a binary fortune cookie classi er. This classi er will be used to classify fortune cookie messages into two classes: messages that predict what will happen in the future (class 1) and messages that just contain a wise saying (class 0). For example,
^{1}Thanks to WengKeen Wong and his advisor Andrew Moore for sharing the data.
\Never go in against a Sicilian when death is on the line” would be a message in class 0.
\You will get an A in Machine learning class” would be a message in class 1.
Files Provided There are three sets of les. All words in these les are lower case and punctuation has been removed.
1) The training data:
traindata.txt: This is the training data consisting of fortune cookie messages.
trainlabels.txt: This le contains the class labels for the training data.
2) The testing data:
testdata.txt: This is the testing data consisting of fortune cookie messages.
testlabels.txt: This le contains the class labels for the testing data.
3) A list of stopwords: stoplist.txt
There are two steps: the preprocessing step and the classi cation step. In the preprocessing step, you will convert fortune cookie messages into features to be used by your classi er. You will be using a bag of words representation. The following steps outline the process involved:
Form the vocabulary. The vocabulary consists of the set of all the words that are in the training data with stop words removed (stop words are common, uninformative words such as \a” and \the” that are listed in the le stoplist.txt). The vocabulary will now be the features of your training data. Keep the vocabulary in alphabetical order to help you with debugging.
Now, convert the training data into a set of features. Let M be the size of your vocabulary. For each fortune cookie message, you will convert it into a feature vector of size M. Each slot in that feature vector takes the value of 0 or 1. For these M slots, if the ith slot is 1, it means that the ith word in the vocabulary is present in the fortune cookie message; otherwise, if it is 0, then the ith word is not present in the message. Most of these feature vector slots will be 0. Since you are keeping the vocabulary in alphabetical order, the rst feature will be the rst word alphabetically in the vocabulary.

Implement the Naive Bayes Classi er (with Laplace Smoothing) and run it on the training data. Compute the training and testing accuracy.
To debug and test your implementation, you can employ Weka (weka.classi ers.bayes.NaiveBayes): http://www.cs.waikato.ac.nz/ml/weka/downloading.html or scikitlearn (http://scikitlearn. org/stable/modules/naive_bayes.html)


Run the o theshelf Logistic Regression classi er from Weka (weka.classi ers.functions.Logistic) or scikitlearn (http://scikitlearn.org/stable/modules/generated/sklearn.linear_ model.LogisticRegression.html) on the training data. Compute the training and testing accuracy.


(30 points) Income Classi er using Decision Trees. You will use the Adult Income dataset from HW1 for this question.


Implement the ID3 decision tree learning algorithm that we discussed in the class. The

key step in the decision tree learning is choosing the next feature to split on. Implement the information gain heuristic for selecting the next feature. Please see lecture notes or https://en.wikipedia.org/wiki/ID3_algorithm for more details. I explained how to select candidate thresholds for continuous features: Sort all candidate values for feature f from training data. Suppose f_{1}; f_{2}; ; f_{n} is the sorted list. The candidate thresholds are chosen as f_{i} + (f_{i+1} f_{i})=2 for i=1 to n.

Run the decision tree construction algorithm on the training examples. Compute the accuracy on validation examples and testing examples.

Implement the decision tree pruning algorithm discussed in the class (via validation data).

Run the pruning algorithm on the decision tree constructed using training examples. Compute the accuracy on validation examples and testing examples. List your observations by comparing the performance of decision tree with and without pruning.
To debug and test your implementation, you can employ Weka (weka.classi ers.trees.J48): http://www.cs.waikato.ac.nz/ml/weka/downloading.html or scikitlearn (http://scikitlearn. org/stable/modules/tree.html).
Instructions for Code Submission and Output Format.
Please follow the below instructions. It will help us in grading your programming part of the homework. We will provide a dropbox folder link for code submission.
Supported programming languages: Python, Java, C++
Store all the relevant les in a folder and submit the corresponding zip le named after your studentid, e.g., 114513209.zip
This folder should have a script le named
run_code.sh
Executing this script should do all the necessary steps required for executing the code including compiling, linking, and execution
Assume relative le paths in your code. Some examples: ‘‘./filename.txt’’ or ‘‘../hw2/filename.txt’’
The output of your program should be dumped in a le named \output.txt” Make sure the output.txt le is dumped when you execute the script
run_code.sh
Zip the entire folder and submit it as <student_id>.zip
Grading Rubric
Each question in the students work will be assigned a letter grade of either A,B,C,D, or F by the
Instructor and TAs. This vepoint (discrete) scale is described as follows:
A) Exemplary (=100%).
Solution presented solves the problem stated correctly and meets all requirements of the problem.
Solution is clearly presented.
Assumptions made are reasonable and are explicitly stated in the solution.
Solution represents an elegant and e ective way to solve the problem and is not overly complicated than is necessary.
B) Capable (=75%).
Solution is mostly correct, satisfying most of the above criteria under the exemplary category, but contains some minor pitfalls, errors/ aws or limitations.
C) Needs Improvement (=50%).
Solution demonstrates a viable approach toward solving the problem but contains some major pitfalls, errors/ aws or limitations.
D) Unsatisfactory (=25%)
Critical elements of the solution are missing or signi cantly awed.
Solution does not demonstrate su cient understanding of the problem and/or any reasonable directions to solve the problem.
F) Not attempted (=0%) No solution provided.
The points on a given homework question will be equal to the percentage assigned (given by the letter grades shown above) multiplied by the maximum number of possible points worth for that question. For example, if a question is worth 6 points and the answer is awarded a B grade, then that implies 4.5 points out of 6.