Description
Instructions
Submit your write-up on Gradescope as a neatly typeset (not scanned nor handwritten) PDF document by 11:59 PM of the due date.
On Gradescope, be sure to select the pages containing your answer for each problem. More details can be found on the Gradescope Student Workflow help page:
(If you don’t select the pages containing your answer to a problem, you’ll receive a zero for that problem.)
Make sure your name and your UNI appears prominently on the first page of your write-up.
Source code
Please combine all requested source code files into a single ZIP file^{1}, along with a plain text file called README that contains your name and briefly describes all of the other files in the ZIP file. Do not include the data files. Submit this ZIP file on Courseworks.
Clarity and precision
One of the goals in this class is for you to learn to reason about machine learning problems and algorithms.
To reason about these things, you must be able to make clear and precise claims and arguments about them.
A clear and precise argument is not the same as a long, excessively detailed argument. Unnecessary details and irrelevant side-remarks often make an argument less clear. And non-factual statements also detract from the clarity of an argument.
Points may be deducted for answers and arguments that lack suﬃcient clarity or precision. Moreover, a time-economical attempt will be made to interpret such answers/arguments, and the grade you receive will be based on this interpretation.
Problem 1 (40 points)
In this problem, you will implement the Online Perceptron algorithm with an online-to-batch conversion, and evaluate it on a classification task with a few diﬀerent data representations.
Restaurant review data set
Download the review data set reviews_tr.csv (training data) and reviews_te.csv (test data) from Courseworks. This data set is comprised of reviews of restaurants in Pittsburgh; the label indicates whether or not the reviewer-assigned rating is at least four (on a five-point scale). The data are in CSV format (where the first line is the header); the first column is the label (label; 0 or 1), and the second column is the review text (text). The text has been processed to remove non-alphanumeric symbols and capitalization.
Data representations
In this problem, you will experiment with the following four diﬀerent data representations.
- Unigram representation (i.e., term frequency).
In this representation, there is a feature for every word t, and the feature value associated with a word t in a document d is
tf(t; d) := number of times word t appears in document d.
- Term frequency-inverse document frequency (tf-idf).
This is like the unigram representation, except the feature associated with a word t in a document d from a collection of documents D (e.g., training data) is
tf(t; d) × log_{10}(idf(t; D)),
where tf(t; d) is as defined above, and
|D|
^{idf(t;} ^{D}^{) :=} number of documents in D that contain word t ^{.}
This representation puts more emphasis on rare words and less emphasis on common words. (There are many variants of tf-idf that are unfortunately all referred to by the same name.)
Note: When you apply this representation to a new document (e.g., a document in the test set), you should still use the idf defined with respect to D. This, however, becomes problematic if a word t appears in a new document but did not appear in any document in D: in this case, idf(t; D) = |D|/0 = ∞. It is not obvious what should be done in these cases. For this homework assignment, simply ignore words t that do not appear in any document in D.
- Bigram representation.
In addition to the unigram features, there is a feature for every pair of words (t_{1}, t_{2}) (called a bigram), and the feature value associated with a bigram (t_{1}, t_{2}) in a given document d is
tf((t_{1}, t_{2}); d) := number of times bigram (t_{1}, t_{2}) appears consecutively in document d.
In the sequence of words “a rose is a rose”, the bigrams that appear are: (a, rose), which appears twice; (rose, is); and (is, a).
- One more data representation of your own choosing (e.g., trigrams, skip-grams). It should be non-trivially diﬀerent from the representations above (e.g., not just bigrams plus a few extra features).
In all four of these representations, include an “intercept” feature whose value is always equal to one (as in aﬃne expansion).
Online Perceptron with online-to-batch conversion
Implement the Online Perceptron algorithm with the following online-to-batch conversion process (similar to one suggested in lecture):
- Run Online Perceptron to make two passes through the training data. Before each pass, randomly shuﬄe the order of the training examples. Note that a total of 2n + 1 linear classifiers wˆ_{1}, . . . , wˆ_{2n+1} are created during the run of the algorithm, where n is the number of training examples.
- Return the linear classifier wˆ_{final} given by the simple average of the final n + 1 linear classifiers:
1 | 2n+1 | ||
i=^{X}n+1 | |||
wˆ_{final} := | wˆ_{i}. | ||
n + 1 |
Of course, you should not use (or even look at the source code for) any existing implementation of Perceptron or Online Perceptron; that would defeat the purpose of this assignment. You can, however, use existing library functions for basic operations needed in Online Perceptron (e.g., vector inner products), and also for loading the CSV files and creating the desired data representations. (You are responsible for determining whether or not an existing library function correctly implements the desired data representation.) As always, provide proper citations for any software packages you use.
Recall that as Online Perceptron is making its pass through the training examples, it only updates its weight vector in rounds in which it makes a prediction error. So, for example, if wˆ_{1} does not make a mistake in classifying the first training example, then wˆ_{2} = wˆ_{1}. Use this fact to keep the memory usage of your code relatively modest.
You will be using this learning algorithm with the restaurant review data represented in four ways described above. The total number of features with each representation can be very large. However, because each review has only a relatively small numbers of words, it should still be possible to compactly represent the data. Use this fact to keep the running time of your code close to linear in the size of the training data.
Do the following with each of the four data representations above. Run your code on the training data to learn a linear classifier wˆ_{final}. Compute the training error rate of wˆ_{final} on this training data, and compute the test error rate of wˆ_{final} on the test data.
Post-hoc analysis
Now consider the classifier based on the unigram representation. To get a sense as to its behavior, determine the 10 words that have the highest (i.e., most positive) weights; also determine the 10 words that have the lowest (i.e., most negative) weights.
Pick two training examples that this classifier (based on unigram representation) incorrectly classifies. For each of these examples x = (x_{1}, . . . , x_{d}), identify the 10 words with the highest (i.e., most positive) w_{i}x_{i} value, and also the 10 words with the lowest (i.e., most negative) w_{i}x_{i} value. Attempt to explain why the classifier incorrectly classifies these examples.
What to submit in your write-up:
- A precise specification of your fourth chosen data representation. Briefly explain why you chose this representation. Provide proper citations of any sources you used to come up with this representation.
- Training error rates of the linear classifiers with all four data representations.
- Test error rates of the linear classifiers with all four data representations.
- The lists of words with highest/lowest weights, described above. (Sort each list alphabetically.)
- The text of two misclassified examples, with associated lists of words, and explanations for why the examples were misclassified.
Please submit your source code on Courseworks.
Problem 2 (30 points)
In this problem, you’ll analyze an algorithm that finds a large margin linear separator whenever one exists.
Recall that the Perceptron algorithm quickly finds a linear separator for a data set S whenever there exists a large margin linear separator for S. But the linear separator returned by Perceptron is not necessarily one that achieves a large margin on S.
An alternative is the following algorithm, which we call Margin Perceptron. This algorithm, like Perceptron, takes as input a collection S of labeled examples from R^{d} × {−1, +1}).
- Begin with wˆ_{1} := 0 ∈ R^{d}.
- For t = 1, 2, . . .:
– If there is a labeled example in S (call it (x_{t}, y_{t})) such that y_{t}wˆ_{t}^{>}x_{t} < 1, then set wˆ_{t}_{+1} := wˆ_{t} +ηy_{t}x_{t}.
– Else, return wˆ_{t}.
Above, there are two diﬀerences relative to the original Perceptron algorithm. The first is the criteria under which an example in S is used to perform an update: y_{t}wˆ_{t}^{>}x_{t} < 1 (rather than ≤ 0). The second is the update itself: wˆ_{t}_{+1} := wˆ_{t} + ηy_{t}x_{t}. Here, η > 0 is a parameter of the algorithm, which we’ll assume is set as
1
^{η }^{:=}
_{L}2
where L := max kxk_{2}.
(x,y)∈S
If Margin Perceptron terminates, it returns wˆ_{t} satisfying
ywˆ_{t}^{>}x ≥ 1 for all (x, y) ∈ S.
But this fact alone does not mean that wˆ_{t} achieves a large margin on S. After all, if w is any linear separator satisfying yw^{>}x > 0 for all (x, y) ∈ S, we can construct another w˜ satisfying yw˜^{>}x ≥ 1 for all (x, y) ∈ S, simply by letting w˜ be a particular scaling of w.
(a) Assume that there exists a vector w_{?} ∈ R^{d} such that
min yw_{?}^{>}x = 1.
(x,y)∈S
Mimic the proof of the Perceptron convergence theorem to prove that Margin Perceptron halts after at most 3kw_{?}k^{2}_{2}L^{2} loop iterations. Clearly explain every step of the proof, especially where it diﬀers from that of the original Perceptron convergence theorem.
(b) Under the same assumption as in Part (a), prove that Margin Perceptron returns wˆ_{t} satisfying
kwˆ_{t}k_{2} ≤ 3kw_{?}k_{2}.
Hint: Use the claim you proved in Part (a) (and, possibly, part of its proof).
- Very briefly explain why this means that the margin achieved by wˆ_{t} is (almost) as large as the margin achieved by w_{?}.
- (Optional.) For any given > 0, explain how to modify Margin Perceptron so that, under the same conditions as in the previous parts, it returns wˆ_{t} satisfying
kwˆ_{t}k_{2} ≤ (2 + )kw_{?}k_{2}.
Problem 3 (30 points)
In this problem, you’ll use Lagrange duality to derive the “dual” forms of ridge regression and a variant of the soft-margin SVM problem.
The ridge regression optimization problem can be written in a form that is reminiscent of the soft-margin
SVM problem (for C > 0):
1 | X | |||||||
min | _{w} 2 | + | C ^{n} | _{ξ}2 | ||||
_{2} k | ^{2} i=1 | |||||||
w∈R^{d},ξ∈R^{n} | ^{ } | k_{2} | i | |||||
s.t. | ξ_{i} = y_{i} | − x_{i}^{>}w | for all i = 1, . . . , n. |
Above, ξ = (ξ_{1}, . . . , ξ_{n}) ∈ R^{n} has a similar role as the slack variables in soft-margin SVM, but note that they are not required to be non-negative here.
The Lagrangian function associated with this optimization problem is
1 | C | n | n | ||
X | X | ||||
L(w, ξ, α) = | kwk_{2}^{2} + | ξ_{i}^{2} + α_{i}(y_{i} − x_{i}^{>}w − ξ_{i}). | |||
2 | 2 | i=1 | |||
i=1 |
Here, α = (α_{1}, . . . , α_{n}) ∈ R^{n} are the Lagrange multipliers, which are unconstrained.
- Fix α ∈ R^{n}. Derive expressions for the w and ξ that, together, minimize L(w, ξ, α). The expressions should be given in terms of α (as well as the data and C). Briefly justify each step of your derivations.
- The dual objective function g : R^{n} → R is equal to the Lagrangian function evaluated at w_{α}, ξ_{α}, α, where w_{α} and ξ_{α} are the values of w and ξ from Part (a) that minimize L:
g(α) = L(w_{α}, ξ_{α}, α) = min L(w, ξ, α).
w∈R^{d},ξ∈R^{n}
^{ }
Derive an expression for the α that maximizes g(α). The expression should be given in terms of the data and C. Moreover, the expression should only involve the x_{i}’s through the matrix K ∈ R^{n}^{×n} whose (i, j)-th entry is x^{>}_{i}x_{j}. Briefly justify each step of your derivation.
- (Optional.) A variant of the soft-margin SVM problem that is similar to ridge regression is the following optimization problem (for C > 0):
1 | X | ||||||||
min | w ^{2} | + | C ^{n} | ξ^{2} | |||||
2 ^{k} | ^{2} i=1 | ||||||||
w∈R^{d},ξ∈R^{n} | ^{ } | k_{2} | i | ||||||
s.t. | y_{i}x_{i}^{>}w ≥ 1 − ξ_{i} | for all i = 1, . . . , n. |
This is almost the same as soft-margin SVM, except the slack variables are squared in the objective. The Lagrangian is
1 | C | n | n | ||
X | X | ||||
L(w, ξ, α) = | kwk_{2}^{2} + | ξ_{i}^{2} + α_{i}(1 − y_{i}x_{i}^{>}w − ξ_{i}). | |||
2 | 2 | i=1 | |||
i=1 |
However, since the constraints are inequality constraints, the associated Lagrange multipliers α ∈ R^{n} are constrained to be non-negative.
Derive expressions for the w and ξ that, together, minimize L(w, ξ, α), and then derive an expression for the dual objective g(α) = min_{w}_{∈}_{R}d_{,ξ}_{∈}_{R}n L(w, ξ, α). Explain why the dual objective only depends on the x_{i}’s through the matrix K ∈ R^{n}^{×n} whose (i, j)-th entry is x^{>}_{i}x_{j}.
5