Homework 3 Solution






Submit your write-up on Gradescope as a neatly typeset (not scanned nor handwritten) PDF document by 11:59 PM of the due date.


On Gradescope, be sure to select the pages containing your answer for each problem. More details can be found on the Gradescope Student Workflow help page:



(If you don’t select the pages containing your answer to a problem, you’ll receive a zero for that problem.)


Make sure your name and your UNI appears prominently on the first page of your write-up.



Source code


Please combine all requested source code files into a single ZIP file1, along with a plain text file called README that contains your name and briefly describes all of the other files in the ZIP file. Do not include the data files. Submit this ZIP file on Courseworks.


Clarity and precision


One of the goals in this class is for you to learn to reason about machine learning problems and algorithms.


To reason about these things, you must be able to make clear and precise claims and arguments about them.


A clear and precise argument is not the same as a long, excessively detailed argument. Unnecessary details and irrelevant side-remarks often make an argument less clear. And non-factual statements also detract from the clarity of an argument.


Points may be deducted for answers and arguments that lack sufficient clarity or precision. Moreover, a time-economical attempt will be made to interpret such answers/arguments, and the grade you receive will be based on this interpretation.


Problem 1 (40 points)


In this problem, you will implement the Online Perceptron algorithm with an online-to-batch conversion, and evaluate it on a classification task with a few different data representations.


Restaurant review data set


Download the review data set reviews_tr.csv (training data) and reviews_te.csv (test data) from Courseworks. This data set is comprised of reviews of restaurants in Pittsburgh; the label indicates whether or not the reviewer-assigned rating is at least four (on a five-point scale). The data are in CSV format (where the first line is the header); the first column is the label (label; 0 or 1), and the second column is the review text (text). The text has been processed to remove non-alphanumeric symbols and capitalization.


Data representations


In this problem, you will experiment with the following four different data representations.


  1. Unigram representation (i.e., term frequency).


In this representation, there is a feature for every word t, and the feature value associated with a word t in a document d is


tf(t; d) := number of times word t appears in document d.


  1. Term frequency-inverse document frequency (tf-idf).


This is like the unigram representation, except the feature associated with a word t in a document d from a collection of documents D (e.g., training data) is


tf(t; d) × log10(idf(t; D)),


where tf(t; d) is as defined above, and



idf(t; D) := number of documents in D that contain word t .

This representation puts more emphasis on rare words and less emphasis on common words. (There are many variants of tf-idf that are unfortunately all referred to by the same name.)


Note: When you apply this representation to a new document (e.g., a document in the test set), you should still use the idf defined with respect to D. This, however, becomes problematic if a word t appears in a new document but did not appear in any document in D: in this case, idf(t; D) = |D|/0 = ∞. It is not obvious what should be done in these cases. For this homework assignment, simply ignore words t that do not appear in any document in D.


  1. Bigram representation.


In addition to the unigram features, there is a feature for every pair of words (t1, t2) (called a bigram), and the feature value associated with a bigram (t1, t2) in a given document d is


tf((t1, t2); d) := number of times bigram (t1, t2) appears consecutively in document d.


In the sequence of words “a rose is a rose”, the bigrams that appear are: (a, rose), which appears twice; (rose, is); and (is, a).


  1. One more data representation of your own choosing (e.g., trigrams, skip-grams). It should be non-trivially different from the representations above (e.g., not just bigrams plus a few extra features).


In all four of these representations, include an “intercept” feature whose value is always equal to one (as in affine expansion).


Online Perceptron with online-to-batch conversion


Implement the Online Perceptron algorithm with the following online-to-batch conversion process (similar to one suggested in lecture):


  1. Run Online Perceptron to make two passes through the training data. Before each pass, randomly shuffle the order of the training examples. Note that a total of 2n + 1 linear classifiers wˆ1, . . . , wˆ2n+1 are created during the run of the algorithm, where n is the number of training examples.


  1. Return the linear classifier wˆfinal given by the simple average of the final n + 1 linear classifiers:


1 2n+1
wˆfinal := wˆi.
n + 1

Of course, you should not use (or even look at the source code for) any existing implementation of Perceptron or Online Perceptron; that would defeat the purpose of this assignment. You can, however, use existing library functions for basic operations needed in Online Perceptron (e.g., vector inner products), and also for loading the CSV files and creating the desired data representations. (You are responsible for determining whether or not an existing library function correctly implements the desired data representation.) As always, provide proper citations for any software packages you use.


Recall that as Online Perceptron is making its pass through the training examples, it only updates its weight vector in rounds in which it makes a prediction error. So, for example, if wˆ1 does not make a mistake in classifying the first training example, then wˆ2 = wˆ1. Use this fact to keep the memory usage of your code relatively modest.


You will be using this learning algorithm with the restaurant review data represented in four ways described above. The total number of features with each representation can be very large. However, because each review has only a relatively small numbers of words, it should still be possible to compactly represent the data. Use this fact to keep the running time of your code close to linear in the size of the training data.


Do the following with each of the four data representations above. Run your code on the training data to learn a linear classifier wˆfinal. Compute the training error rate of wˆfinal on this training data, and compute the test error rate of wˆfinal on the test data.


Post-hoc analysis


Now consider the classifier based on the unigram representation. To get a sense as to its behavior, determine the 10 words that have the highest (i.e., most positive) weights; also determine the 10 words that have the lowest (i.e., most negative) weights.


Pick two training examples that this classifier (based on unigram representation) incorrectly classifies. For each of these examples x = (x1, . . . , xd), identify the 10 words with the highest (i.e., most positive) wixi value, and also the 10 words with the lowest (i.e., most negative) wixi value. Attempt to explain why the classifier incorrectly classifies these examples.


What to submit in your write-up:


  • A precise specification of your fourth chosen data representation. Briefly explain why you chose this representation. Provide proper citations of any sources you used to come up with this representation.
  • Training error rates of the linear classifiers with all four data representations.


  • Test error rates of the linear classifiers with all four data representations.


  • The lists of words with highest/lowest weights, described above. (Sort each list alphabetically.)


  • The text of two misclassified examples, with associated lists of words, and explanations for why the examples were misclassified.


Please submit your source code on Courseworks.

Problem 2 (30 points)


In this problem, you’ll analyze an algorithm that finds a large margin linear separator whenever one exists.


Recall that the Perceptron algorithm quickly finds a linear separator for a data set S whenever there exists a large margin linear separator for S. But the linear separator returned by Perceptron is not necessarily one that achieves a large margin on S.


An alternative is the following algorithm, which we call Margin Perceptron. This algorithm, like Perceptron, takes as input a collection S of labeled examples from Rd × {−1, +1}).

  • Begin with wˆ1 := 0 ∈ Rd.
  • For t = 1, 2, . . .:

If there is a labeled example in S (call it (xt, yt)) such that ytwˆt>xt < 1, then set wˆt+1 := wˆt +ηytxt.

Else, return wˆt.


Above, there are two differences relative to the original Perceptron algorithm. The first is the criteria under which an example in S is used to perform an update: ytwˆt>xt < 1 (rather than ≤ 0). The second is the update itself: wˆt+1 := wˆt + ηytxt. Here, η > 0 is a parameter of the algorithm, which we’ll assume is set as



η :=



where L :=  max kxk2.



If Margin Perceptron terminates, it returns wˆt satisfying


ywˆt>x ≥ 1    for all (x, y) ∈ S.


But this fact alone does not mean that wˆt achieves a large margin on S. After all, if w is any linear separator satisfying yw>x > 0 for all (x, y) ∈ S, we can construct another w˜ satisfying yw˜>x ≥ 1 for all (x, y) ∈ S, simply by letting w˜ be a particular scaling of w.


(a) Assume that there exists a vector w? ∈ Rd such that


min  yw?>x = 1.



Mimic the proof of the Perceptron convergence theorem to prove that Margin Perceptron halts after at most 3kw?k22L2 loop iterations. Clearly explain every step of the proof, especially where it differs from that of the original Perceptron convergence theorem.


(b) Under the same assumption as in Part (a), prove that Margin Perceptron returns wˆt satisfying


kwˆtk2 ≤ 3kw?k2.


Hint: Use the claim you proved in Part (a) (and, possibly, part of its proof).


  • Very briefly explain why this means that the margin achieved by wˆt is (almost) as large as the margin achieved by w?.


  • (Optional.) For any given > 0, explain how to modify Margin Perceptron so that, under the same conditions as in the previous parts, it returns wˆt satisfying


kwˆtk2 ≤ (2 + )kw?k2.


Problem 3 (30 points)


In this problem, you’ll use Lagrange duality to derive the “dual” forms of ridge regression and a variant of the soft-margin SVM problem.


The ridge regression optimization problem can be written in a form that is reminiscent of the soft-margin


SVM problem (for C > 0):


1 X
min w 2 + n ξ2
2 k 2 i=1
w∈Rd∈Rn   k2 i
s.t. ξi = yi xi>w for all i = 1, . . . , n.

Above, ξ = (ξ1, . . . , ξn) ∈ Rn has a similar role as the slack variables in soft-margin SVM, but note that they are not required to be non-negative here.


The Lagrangian function associated with this optimization problem is


1 C n n
  X X
L(w, ξ, α) = kwk22 + ξi2 +    αi(yi xi>w ξi).
2 2 i=1


Here, α = (α1, . . . , αn) ∈ Rn are the Lagrange multipliers, which are unconstrained.


  • Fix α ∈ Rn. Derive expressions for the w and ξ that, together, minimize L(w, ξ, α). The expressions should be given in terms of α (as well as the data and C). Briefly justify each step of your derivations.


  • The dual objective function g : Rn → R is equal to the Lagrangian function evaluated at wα, ξα, α, where wα and ξα are the values of w and ξ from Part (a) that minimize L:


g(α) = L(wα, ξα, α) =          min       L(w, ξ, α).



Derive an expression for the α that maximizes g(α). The expression should be given in terms of the data and C. Moreover, the expression should only involve the xi’s through the matrix K ∈ Rn×n whose (i, j)-th entry is x>ixj. Briefly justify each step of your derivation.


  • (Optional.) A variant of the soft-margin SVM problem that is similar to ridge regression is the following optimization problem (for C > 0):


1 X
min w 2 + n ξ2
2 k 2 i=1
w∈Rd∈Rn   k2 i
s.t. yixi>w ≥ 1 − ξi for all i = 1, . . . , n.


This is almost the same as soft-margin SVM, except the slack variables are squared in the objective. The Lagrangian is


1 C n n
  X X
L(w, ξ, α) = kwk22 + ξi2 +    αi(1 − yixi>w ξi).
2 2 i=1


However, since the constraints are inequality constraints, the associated Lagrange multipliers α ∈ Rn are constrained to be non-negative.


Derive expressions for the w and ξ that, together, minimize L(w, ξ, α), and then derive an expression for the dual objective g(α) = minwRdRn L(w, ξ, α). Explain why the dual objective only depends on the xi’s through the matrix K ∈ Rn×n whose (i, j)-th entry is x>ixj.







error: Content is protected !!