Description
NOTE 1: Please use a word processing software (e.g., Microsoft word or Latex) to write your answers and submit a printed copy to me at the begining of class on Feb 27. The rationale is that it is sometimes hard to read and understand the handwritten answers.
NOTE 2: Please ensure that all the graphs are appropriately labeled (xaxis, yaxis, and each curve). The caption or heading of each graph should be informative and selfcontained.

(5 points) Answer the following questions with a yes or no along with proper justi cation.


Is the decision boundary of voted perceptron linear?



Is the decision boundary of averaged perceptron linear?


(5 points) In the class, we saw the PassiveAggressive (PA) update that tries to achieve a margin equal to one after each update. Derive the PA weight update for achieving margin M.

(10 points) Consider the following setting. You are provided with n training examples: (x_{1}; y_{1}; h_{1}); (x_{2}; y_{2}; h_{2}); ; (x_{n}; y_{n}; h_{n}), where x_{i} is the input example, y_{i} is the class label (+1 or 1), and h_{i} > 0 is the importance weight of the example. The teacher gave you some additional information by specifying the importance of each training example.


How will you modify the perceptron algorithm to be able to leverage this extra information? Please justify your answer.



How can you solve this learning problem using the standard perceptron algorithm? Please justify your answer. I’m looking for a reduction based solution.


(10 points) Consider the following setting. You are provided with n training examples: (x_{1}; y_{1}); (x_{2}; y_{2}); ; (x_{n}; y_{n}), where x_{i} is the input example, and y_{i} is the class label (+1 or – 1). However, the training data is highly imbalanced (say 90% of the examples are negative and 10% of the examples are positive) and we care more about the accuracy of positive examples.


How will you modify the perceptron algorithm to solve this learning problem? Please justify your answer.



How can you solve this learning problem using the standard perceptron algorithm? Please justify your answer. I’m looking for a reduction based solution.

5. (10 points) Suppose we have n_{+} positive training examples and n negative training examples. Let C_{+} be the center of the positive examples and C be the center of the negative
examples, i.e., C_{+} = _{n}^{1}_{+} ^{P}_{i:} _{y}_{i}_{=+1} x_{i} and C = _{n}^{1} ^{P}_{i:} _{y}_{i}_{= 1} x_{i}. Consider a simple classi er called CLOSE that classi es a test example x by assigning it to the class whose center is
closest.
Show that the decision boundary of the CLOSE classi er is a linear hyperplane of the form sign(w x + b). Compute the values of w and b in terms of C_{+} and C .
Recall that the weight vector can be written as a linear combination of all the training

examples: w =
n_{+}+n
_{i} y_{i} x_{i}. Compute the dual weights ( ’s). How many of the
i=1
training
examples are support vectors?
P
6. (5 points) Suppose we use the following radial basis function (RBF) kernel: K(x_{i}; x_{j} ) = exp( ^{1}_{2} kx_{i} x_{j} k^{2}), which has some implicit unknown mapping (x).
Prove that the mapping (x) corresponding to RBF kernel has in nite dimensions.
Prove that for any two input examples x_{i} and x_{j} , the squared Euclidean distance of their corresponding points in the higherdimensional space de ned by is less than 2, i.e., k (x_{i}) (x_{j} )k^{2} 2.

(5 points) The decision boundary of a SVM with a kernel function (via implicit feature mapping (:)) is de ned as follows:
w (x) + b = _{i}_{2}_{SV }y_{i i}K(x_{i}; x) + b = f(x; ; b)

where w and b are parameters of the decision boundary in the feature space phi de ned by the kernel function K, SV is the set of support vectors, and _{i} is the dual weight of the i^{th} support vector.
Let us assume that we use the radial basis function (RBF) kernel K(x_{i}; x_{j} ) = exp( ^{1}_{2} kx_{i} x_{j} k^{2}); also assume that the training examples are linearly separable in the feature space and SVM nds a decision boundary that perfectly separates the training examples.
If we choose a testing example x_{f ar} that is far away from any training instance x_{i} (distance here is measured in the original feature space <^{d}). Prove that f(x_{f ar}; ; b) b:

(5 points) The function K(x_{i}; x_{j} ) = h x_{i}; x_{j} i is a valid kernel. Prove or Disprove it.

(5 points) You are provided with n training examples: (x_{1}; y_{1}); (x_{2}; y_{2}); ; (x_{n}; y_{n}; ), where x_{i} is the input example, y_{i} is the class label (+1 or 1). The teacher gave you some additional
information by specifying the costs for di erent mistakes C_{+} and C , where C_{+} and C stand for the cost of misclassifying a positive and negative example respectively.


How will you modify the Softmargin SVM formulation to be able to leverage this extra information? Please justify your answer.


(5 points) Consider the following setting. You are provided with n training examples: (x_{1}; y_{1}; h_{1}); (x_{2}; y_{2}; h_{2}); ; (x_{n}; y_{n}; h_{n}), where x_{i} is the input example, y_{i} is the class label (+1 or 1), and h_{i} > 0 is the importance weight of the example. The teacher gave you some additional information by specifying the importance of each training example.


How will you modify the Softmargin SVM formulation to be able to leverage this extra information? Please justify your answer.



How can you solve this learning problem using the standard SVM training algorithm? Please justify your answer.


(15 points) You are provided with a set of n training examples: (x_{1}; y_{1}); (x_{2}; y_{2}); ; (x_{n}; y_{n}; ), where x_{i} is the input example, y_{i} is the class label (+1 or 1). Suppose n is very large (say in the order of millions). In this case, standard SVM training algorithms will not scale due to large training set.
Tom wants to devise a solution based on \CoarsetoFine” framework of problem solving. The basic idea is to cluster the training data; train a SVM classi er based on the clusters (coarse problem); re ne the clusters as needed ( ne problem); perform training on the ner problem; and repeat until convergence. Suppose we start with k_{+} positive clusters and k negative clusters to begin with (a cluster is de ned as a set of examples). Please specify the mathematical formulation (de ne all the variables used in your formulation) and concrete algorithm for each of the following steps to instantiate this idea:
a) How to de ne the SVM training formulation for a given level of coarseness: a set of k_{+}
positive clusters and a set of k negative clusters?

How to re ne the clusters based on the resulting SVM classi er?

What is the stopping criteria?
Optional question: For what kind of problems will this solution fail?

(20 points) You are provided with a set of n training examples: (x_{1}; y_{1}); (x_{2}; y_{2}); ; (x_{n}; y_{n}; ), where x_{i} is the input example, y_{i} is the class label (+1 or 1). Suppose n is very large (say in the order of millions). In this case, online kernelized Perceptron algorithms will not scale if the number of allowed support vectors are unbounded.


Suppose you have trained using kernelized Perceptron algorithm (without any bounds on support vectors) and got a set of support vectors SV . Tom wants to use this classi er for realtime prediction and cannot a ord more than B kernel evaluations for each classi cation decision. Please give an algorithm to select B support vectors from SV . You need to motivate your design choices in order to convince Tom to use your solution.



Tom wants to train using kernelized Perceptron algorithm, but wants to use at most B support vectors during the training process. Please modify the standard kernelized Perceptron training algorithm (from class slides) for this new setting. You need to motivate your design choices in order to convince Tom to use your solution.


(50 points) Programming and empirical analysis question.
Implement a binary classi er with both perceptron and passiveaggressive (PA) weight update as shown below.
Algorithm 1 Online BinaryClassi er Learning Algorithm
Input: D = Training examples, T = maximum number of training iterations
Output: w, the nal weight vector

Initialize the weights w = 0
2: for each training iteration itr 2 f1; 2; ; T g do

for each training example (x_{t}; y_{t}) 2 D do

y^_{t} = sign(w x_{t}) // predict using the current weights

if mistake then

w = w + y_{t} x_{t} // update the weights

end if

end for

end for

return nal weight vector w
For standard perceptron, you will use = 1, and for PassiveAggressive (PA) algorithm, you will compute the learning rate as follows.

=
1 y_{t} (w x_{t})
(1)
kx_{t}k^{2}
You are provided with the income data. See https://archive.ics.uci.edu/ml/datasets/ census+income for a description of di erent features. You are provided with a training set, development (held out) set, and testing set. The format of the le is as follows. Each line is one classi cation example: sequence of feature values with the class label at end ( 50K or >50K) in comma separated values (CSV) format.
Perceptron, PassiveAggressive, and Averaged Classi er
Convert all features into binary format (0 or 1). Describe your conversion. How many features do you have (i.e., the dimensionality)?
Implement Perceptron and PassiveAggressive based classi er from these binary features. Compute the online learning curve for both Perceptron and PA algorithm by plotting the number of training iterations (1 to 5) on the xaxis and the number of mistakes on
the yaxis. Compare the two curves and list your observations.
Compute the accuracy of both Perceptron and PA algorithm on the training data, development data, and testing data for each training iteration (1 to 5). So you will have three accuracy curves for Perceptron and another three accuracy curves for PA algorithm. Compare the six curves and list your observations.
Implement the averaged perceptron algorithm in both the naive way (one I wrote on the white board) and the smart way (Algorithm 7 from Hal’s book chapter). Measure the training time to perform 5 iterations for both implementations. Compute the accuracies on training data, development data, and testing data with averaged classi er and compare with those obtained using standard perceptron. List your observations.
Compute the general learning curve (vary the number of training examples starting from 5000 in the increments of 5000) for 5 training iterations. Plot the number of training examples on xaxis and the accuracy (development and testing) on the yaxis. List your observations from these curves.
Optional things you can try. (a) Shu ing the training examples during each iteration. What did you observe?; and (b) Variable learning rate with a schedule of your choice. What did you observe?

(50 points) Support Vector Machines (SVM) Classi er and Kernels
(30 points) Empirical analysis question. You can use a publicly available SVM classier implementation (e.g., LibSVM or scikitlearn) for SVM related experiments. LibSVM (http://www.csie.ntu.edu.tw/_{~}cjlin/libsvm/ and scikitlearn (http://scikitlearn. org/stable/modules/svm.html).



Using a linear kernel, train the SVM on the training data for di erent values of C parameter: 10 ^{4}; 10 ^{3}; 10 ^{2}; 10 ^{1}; 10^{0}; 10^{1}; 10^{2}; 10^{3}; 10^{4}. Compute the training accuracy, validation accuracy, and testing accuracy for the SVM obtained with di erent values of the C parameter. Plot the training accuracy, validation accuracy, and testing accuracy as a function of C (C value on xaxis and Accuracy on yaxis) { one curve each for training, validation, and testing data. Also, plot the number of support vectors (see the training log at the end of SVM training for LibSVM or use the scikitlearn API if applicable) as a function of C. List your observations.





Select the best value of hyperparameter C based on the accuracy on validation set and train a linear SVM on the combined set of training and validation examples. Compute the testing accuracy and the corresponding confusion matrix: a 2 2 matrix.





Repeat the experiment (a) with the best C value from (a) with polynomial kernel of degree 2, 3, and 4. Compare the training, validation, testing accuracies, and the number of support vectors for di ernt kernels (linear, polynomial kernel of degree 2, polynomial kernel of degree 3, and polynomial kernel of degree 4). List your observations.


(20 points) Programming question. You will implement the standard kernelized Perceptron training algorithm (from class slides) for binary classi cation.
(a) Train the binary classi er for 5 iterations with polynomial kernel (pick the best degree out of 2, 3, and 4 from the above experiment). Plot the number of mistakes as a function of training iterations. Compute the training, development, and testing accuracy at the end of 5 iterations.
Instructions for Code Submission and Output Format.
Please follow the below instructions. It will help us in grading your programming part of the homework. We will provide a dropbox folder link for code submission.
Supported programming languages: Python, Java, C++
Store all the relevant les in a folder and submit the corresponding zip le named after your studentid, e.g., 114513209.zip
This folder should have a script le named
run_code.sh
Executing this script should do all the necessary steps required for executing the code including compiling, linking, and execution
Assume relative le paths in your code. Some examples: ‘‘./filename.txt’’ or ‘‘../hw1/filename.txt’’
The output of your program should be dumped in a le named \output.txt” Make sure the output.txt le is dumped when you execute the script
run_code.sh
Zip the entire folder and submit it as <student_id>.zip
Grading Rubric
Each question in the students work will be assigned a letter grade of either A,B,C,D, or F by the
Instructor and TAs. This vepoint (discrete) scale is described as follows:
A) Exemplary (=100%).
Solution presented solves the problem stated correctly and meets all requirements of the problem.
Solution is clearly presented.
Assumptions made are reasonable and are explicitly stated in the solution.
Solution represents an elegant and e ective way to solve the problem and is not overly complicated than is necessary.
B) Capable (=75%).
Solution is mostly correct, satisfying most of the above criteria under the exemplary category, but contains some minor pitfalls, errors/ aws or limitations.
C) Needs Improvement (=50%).
Solution demonstrates a viable approach toward solving the problem but contains some major pitfalls, errors/ aws or limitations.
D) Unsatisfactory (=25%)
Critical elements of the solution are missing or signi cantly awed.
Solution does not demonstrate su cient understanding of the problem and/or any reasonable directions to solve the problem.
F) Not attempted (=0%) No solution provided.
The points on a given homework question will be equal to the percentage assigned (given by the letter grades shown above) multiplied by the maximum number of possible points worth for that question. For example, if a question is worth 6 points and the answer is awarded a B grade, then that implies 4.5 points out of 6.