# Homework #1 Solution

\$35.00 \$29.05

Category:

## Description

NOTE 1: Please use a word processing software (e.g., Microsoft word or Latex) to write your answers and submit a printed copy to me at the begining of class on Feb 27. The rationale is that it is sometimes hard to read and understand the hand-written answers.

NOTE 2: Please ensure that all the graphs are appropriately labeled (x-axis, y-axis, and each curve). The caption or heading of each graph should be informative and self-contained.

1. (5 points) Answer the following questions with a yes or no along with proper justi cation.

1. Is the decision boundary of voted perceptron linear?

1. Is the decision boundary of averaged perceptron linear?

1. (5 points) In the class, we saw the Passive-Aggressive (PA) update that tries to achieve a margin equal to one after each update. Derive the PA weight update for achieving margin M.

1. (10 points) Consider the following setting. You are provided with n training examples: (x1; y1; h1); (x2; y2; h2); ; (xn; yn; hn), where xi is the input example, yi is the class label (+1 or -1), and hi > 0 is the importance weight of the example. The teacher gave you some additional information by specifying the importance of each training example.

1. How will you modify the perceptron algorithm to be able to leverage this extra information? Please justify your answer.

1. How can you solve this learning problem using the standard perceptron algorithm? Please justify your answer. I’m looking for a reduction based solution.

1. (10 points) Consider the following setting. You are provided with n training examples: (x1; y1); (x2; y2); ; (xn; yn), where xi is the input example, and yi is the class label (+1 or – 1). However, the training data is highly imbalanced (say 90% of the examples are negative and 10% of the examples are positive) and we care more about the accuracy of positive examples.

1. How will you modify the perceptron algorithm to solve this learning problem? Please justify your answer.

1. How can you solve this learning problem using the standard perceptron algorithm? Please justify your answer. I’m looking for a reduction based solution.

5. (10 points) Suppose we have n+ positive training examples and n negative training exam-ples. Let C+ be the center of the positive examples and C be the center of the negative

examples, i.e., C+ = n1+ Pi: yi=+1 xi and C = n1 Pi: yi= 1 xi. Consider a simple classi er called CLOSE that classi es a test example x by assigning it to the class whose center is

closest.

Show that the decision boundary of the CLOSE classi er is a linear hyperplane of the form sign(w x + b). Compute the values of w and b in terms of C+ and C .

Recall that the weight vector can be written as a linear combination of all the training

 examples: w = n++n i yi xi. Compute the dual weights ( ’s). How many of the i=1 training examples are support vectors? P

6. (5 points) Suppose we use the following radial basis function (RBF) kernel: K(xi; xj ) = exp( 12 kxi xj k2), which has some implicit unknown mapping (x).

Prove that the mapping (x) corresponding to RBF kernel has in nite dimensions.

Prove that for any two input examples xi and xj , the squared Euclidean distance of their corresponding points in the higher-dimensional space de ned by is less than 2, i.e., k (xi) (xj )k2 2.

1. (5 points) The decision boundary of a SVM with a kernel function (via implicit feature mapping (:)) is de ned as follows:

w (x) + b = i2SV yi iK(xi; x) + b = f(x; ; b)

• where w and b are parameters of the decision boundary in the feature space phi de ned by the kernel function K, SV is the set of support vectors, and i is the dual weight of the ith support vector.

Let us assume that we use the radial basis function (RBF) kernel K(xi; xj ) = exp( 12 kxi xj k2); also assume that the training examples are linearly separable in the feature space and SVM nds a decision boundary that perfectly separates the training examples.

If we choose a testing example xf ar that is far away from any training instance xi (distance here is measured in the original feature space <d). Prove that f(xf ar; ; b) b:

1. (5 points) The function K(xi; xj ) = h xi; xj i is a valid kernel. Prove or Disprove it.

1. (5 points) You are provided with n training examples: (x1; y1); (x2; y2); ; (xn; yn; ), where xi is the input example, yi is the class label (+1 or -1). The teacher gave you some additional

information by specifying the costs for di erent mistakes C+ and C , where C+ and C stand for the cost of misclassifying a positive and negative example respectively.

1. How will you modify the Soft-margin SVM formulation to be able to leverage this extra information? Please justify your answer.

1. (5 points) Consider the following setting. You are provided with n training examples: (x1; y1; h1); (x2; y2; h2); ; (xn; yn; hn), where xi is the input example, yi is the class label (+1 or -1), and hi > 0 is the importance weight of the example. The teacher gave you some additional information by specifying the importance of each training example.

1. How will you modify the Soft-margin SVM formulation to be able to leverage this extra information? Please justify your answer.

1. How can you solve this learning problem using the standard SVM training algorithm? Please justify your answer.

1. (15 points) You are provided with a set of n training examples: (x1; y1); (x2; y2); ; (xn; yn; ), where xi is the input example, yi is the class label (+1 or -1). Suppose n is very large (say in the order of millions). In this case, standard SVM training algorithms will not scale due to large training set.

Tom wants to devise a solution based on \Coarse-to-Fine” framework of problem solving. The basic idea is to cluster the training data; train a SVM classi er based on the clusters (coarse problem); re ne the clusters as needed ( ne problem); perform training on the ner problem; and repeat until convergence. Suppose we start with k+ positive clusters and k negative clusters to begin with (a cluster is de ned as a set of examples). Please specify the mathematical formulation (de ne all the variables used in your formulation) and concrete algorithm for each of the following steps to instantiate this idea:

a) How to de ne the SVM training formulation for a given level of coarseness: a set of k+

positive clusters and a set of k negative clusters?

1. How to re ne the clusters based on the resulting SVM classi er?

1. What is the stopping criteria?

Optional question: For what kind of problems will this solution fail?

1. (20 points) You are provided with a set of n training examples: (x1; y1); (x2; y2); ; (xn; yn; ), where xi is the input example, yi is the class label (+1 or -1). Suppose n is very large (say in the order of millions). In this case, online kernelized Perceptron algorithms will not scale if the number of allowed support vectors are unbounded.

1. Suppose you have trained using kernelized Perceptron algorithm (without any bounds on support vectors) and got a set of support vectors SV . Tom wants to use this classi er for real-time prediction and cannot a ord more than B kernel evaluations for each classi cation decision. Please give an algorithm to select B support vectors from SV . You need to motivate your design choices in order to convince Tom to use your solution.

1. Tom wants to train using kernelized Perceptron algorithm, but wants to use at most B support vectors during the training process. Please modify the standard kernelized Perceptron training algorithm (from class slides) for this new setting. You need to motivate your design choices in order to convince Tom to use your solution.

1. (50 points) Programming and empirical analysis question.

Implement a binary classi er with both perceptron and passive-aggressive (PA) weight update as shown below.

Algorithm 1 Online Binary-Classi er Learning Algorithm

Input: D = Training examples, T = maximum number of training iterations

Output: w, the nal weight vector

1. Initialize the weights w = 0

2: for each training iteration itr 2 f1; 2; ; T g do

1. for each training example (xt; yt) 2 D do

2. y^t = sign(w xt) // predict using the current weights

1. if mistake then

2. w = w + yt xt // update the weights

1. end if

1. end for

1. end for

1. return nal weight vector w

For standard perceptron, you will use = 1, and for Passive-Aggressive (PA) algorithm, you will compute the learning rate as follows.

 = 1 yt (w xt) (1) kxtk2

You are provided with the income data. See https://archive.ics.uci.edu/ml/datasets/ census+income for a description of di erent features. You are provided with a training set, development (held out) set, and testing set. The format of the le is as follows. Each line is one classi cation example: sequence of feature values with the class label at end ( 50K or >50K) in comma separated values (CSV) format.

Perceptron, Passive-Aggressive, and Averaged Classi er

Convert all features into binary format (0 or 1). Describe your conversion. How many features do you have (i.e., the dimensionality)?

Implement Perceptron and Passive-Aggressive based classi er from these binary features. Compute the online learning curve for both Perceptron and PA algorithm by plotting the number of training iterations (1 to 5) on the x-axis and the number of mistakes on

the y-axis. Compare the two curves and list your observations.

Compute the accuracy of both Perceptron and PA algorithm on the training data, de-velopment data, and testing data for each training iteration (1 to 5). So you will have three accuracy curves for Perceptron and another three accuracy curves for PA algorithm. Compare the six curves and list your observations.

Implement the averaged perceptron algorithm in both the naive way (one I wrote on the white board) and the smart way (Algorithm 7 from Hal’s book chapter). Measure the training time to perform 5 iterations for both implementations. Compute the accuracies on training data, development data, and testing data with averaged classi er and compare with those obtained using standard perceptron. List your observations.

Compute the general learning curve (vary the number of training examples starting from 5000 in the increments of 5000) for 5 training iterations. Plot the number of training examples on x-axis and the accuracy (development and testing) on the y-axis. List your observations from these curves.

Optional things you can try. (a) Shu ing the training examples during each iteration. What did you observe?; and (b) Variable learning rate with a schedule of your choice. What did you observe?

1. (50 points) Support Vector Machines (SVM) Classi er and Kernels

(30 points) Empirical analysis question. You can use a publicly available SVM classi-er implementation (e.g., LibSVM or scikit-learn) for SVM related experiments. LibSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/ and scikit-learn (http://scikit-learn. org/stable/modules/svm.html).

1. Using a linear kernel, train the SVM on the training data for di erent values of C parameter: 10 4; 10 3; 10 2; 10 1; 100; 101; 102; 103; 104. Compute the training accuracy, validation accuracy, and testing accuracy for the SVM obtained with di erent values of the C parameter. Plot the training accuracy, validation accuracy, and testing accuracy as a function of C (C value on x-axis and Accuracy on y-axis) { one curve each for training, validation, and testing data. Also, plot the number of support vectors (see the training log at the end of SVM training for LibSVM or use the scikit-learn API if applicable) as a function of C. List your observations.

1. Select the best value of hyper-parameter C based on the accuracy on validation set and train a linear SVM on the combined set of training and validation examples. Com-pute the testing accuracy and the corresponding confusion matrix: a 2 2 matrix.

1. Repeat the experiment (a) with the best C value from (a) with polynomial kernel of degree 2, 3, and 4. Compare the training, validation, testing accuracies, and the number of support vectors for di ernt kernels (linear, polynomial kernel of degree 2, polynomial kernel of degree 3, and polynomial kernel of degree 4). List your observations.

(20 points) Programming question. You will implement the standard kernelized Per-ceptron training algorithm (from class slides) for binary classi cation.

(a) Train the binary classi er for 5 iterations with polynomial kernel (pick the best degree out of 2, 3, and 4 from the above experiment). Plot the number of mistakes as a function of training iterations. Compute the training, development, and testing accuracy at the end of 5 iterations.

Instructions for Code Submission and Output Format.

Please follow the below instructions. It will help us in grading your programming part of the homework. We will provide a dropbox folder link for code submission.

Supported programming languages: Python, Java, C++

Store all the relevant les in a folder and submit the corresponding zip le named after your student-id, e.g., 114513209.zip

This folder should have a script le named

run_code.sh

Executing this script should do all the necessary steps required for executing the code including compiling, linking, and execution

Assume relative le paths in your code. Some examples: ‘‘./filename.txt’’ or ‘‘../hw1/filename.txt’’

The output of your program should be dumped in a le named \output.txt” Make sure the output.txt le is dumped when you execute the script

run_code.sh

Zip the entire folder and submit it as <student_id>.zip

Each question in the students work will be assigned a letter grade of either A,B,C,D, or F by the

Instructor and TAs. This ve-point (discrete) scale is described as follows:

A) Exemplary (=100%).

Solution presented solves the problem stated correctly and meets all requirements of the prob-lem.

Solution is clearly presented.

Assumptions made are reasonable and are explicitly stated in the solution.

Solution represents an elegant and e ective way to solve the problem and is not overly com-plicated than is necessary.

B) Capable (=75%).

Solution is mostly correct, satisfying most of the above criteria under the exemplary category, but contains some minor pitfalls, errors/ aws or limitations.

C) Needs Improvement (=50%).

Solution demonstrates a viable approach toward solving the problem but contains some major pitfalls, errors/ aws or limitations.

D) Unsatisfactory (=25%)

Critical elements of the solution are missing or signi cantly awed.

Solution does not demonstrate su cient understanding of the problem and/or any reasonable directions to solve the problem.

F) Not attempted (=0%) No solution provided.

The points on a given homework question will be equal to the percentage assigned (given by the letter grades shown above) multiplied by the maximum number of possible points worth for that question. For example, if a question is worth 6 points and the answer is awarded a B grade, then that implies 4.5 points out of 6.