Description
Instructions
Submit your write-up on Gradescope as a neatly typeset (not scanned nor handwritten) PDF document by 11:59 PM of the due date.
On Gradescope, be sure to select the pages containing your answer for each problem. More details can be found on the Gradescope Student Workflow help page:
(If you don’t select the pages containing your answer to a problem, you’ll receive a zero for that problem.)
Make sure your name and your UNI appears prominently on the first page of your write-up.
Source code
Please combine all requested source code files into a single ZIP file^{1}, along with a plain text file called README that contains your name and briefly describes all of the other files in the ZIP file. Do not include the data files. Submit this ZIP file on Courseworks.
Clarity and precision
One of the goals in this class is for you to learn to reason about machine learning problems and algorithms.
To reason about these things, you must be able to make clear and precise claims and arguments about them.
A clear and precise argument is not the same as a long, excessively detailed argument. Unnecessary details and irrelevant side-remarks often make an argument less clear. And non-factual statements also detract from the clarity of an argument.
Points may be deducted for answers and arguments that lack suﬃcient clarity or precision. Moreover, a time-economical attempt will be made to interpret such answers/arguments, and the grade you receive will be based on this interpretation.
Problem 1 (30 points)
In this problem, you will practice using linear regression on a simple data set.
Download the wine data set wine.mat from the course website. The first feature (given in data) is a constant feature, always equal to one; I have already applied “aﬃne expansion” to the data. The next d = 11 features are fixed acidity, volatile acidity, . . . , alcohol as described in http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names. I have applied a “standardization” transformation to these features so that, on the training data, the empirical mean of each feature is zero, and the empirical standard deviation of each feature is one; the same transformation was applied to the test data. The output (given in labels) is quality, the quality grade of the wine (an integer between 0 and 10).
Ordinary least squares
Compute the ordinary least squares estimator based on the training data (i.e., the squared loss ERM). You can use any software package you like to do this provided that you include proper citations.^{2} Make sure
the solution satisfies the appropriate normal equations! This is a way to check if you’ve done things correctly. Compute the test squared loss risk on the test data (testdata and testlabels).
Sparse linear predictor
ˆ | ˆ | ˆ | ˆ | ˆ | for 1 ≤ i ≤ d are | |
Write a program to find a coeﬃcient vector β | = (β_{0} | , β_{1} | , . . . , β_{d}) where at most three of the β_{i} | |||
ˆ | ||||||
non-zero, with smallest empirical risk on the training data. (The coeﬃcient β_{0} corresponds to the always-one | ||||||
feature and is also allowed to be non-zero.) This can be done by enumerating over all | ^{d}3 | triplets of the d | ||||
features. For this chosen coeﬃcient vector, compute the test squared loss risk on the | test data. | |||||
For each of the (at most) three variables v ∈ {fixed acidity, volatile acidity, . . . , alcohol} with non-
ˆ
zero coeﬃcients in β, determine the two variables (diﬀerent from v) with highest and second-highest (in absolute value) sample correlation with variable v, as based on the test data. Record both the variable names and the corresponding correlation values.
What to submit in your write-up:
- Test risks of the ordinary least squares estimator and of the sparse linear predictor.
- Names (e.g., fixed acidity, volatile acidity) of the variables with non-zero coeﬃcients in the
ˆ
sparse linear predictor, along with the coeﬃcient values. Also give the value of the coeﬃcient β_{0}.
- For each variable v with non-zero coeﬃcient, the names of the two other variables most correlated (in absolute value) with variable v, along with the corresponding correlation values.
Please submit your source code on Courseworks.
Problem 2 (35 points)
In this problem, you will study an “approximation error-vs-variance” trade-oﬀ in the fixed design setting (which was covered in the reading assignment).
Suppose you are given n distinct real numbers z_{1}, . . . , z_{n} ∈ [−π, π], and you observe the realizations of n uncorrelated real-valued random variables Y_{1}, . . . , Y_{n}, where E(Y_{i}) = h(z_{i}) for some unknown continuous function h: R → R, and var(Y_{i}) = 1 for all i = 1, . . . , n. Taking inspiration from the Weierstrass approximation theorem, you construct a feature vector for each z_{i} using a degree-k polynomial expansion (for some k ∈ N),
x | := (1, z | , z^{2}, . . . , z^{k}) | for each i = 1, . . . , n, | |||
i | i | i | i | |||
ˆ | ∈ R | k+1 | > ˆ | |||
and use ordinary least squares to construct β | to approximate each h(z_{i}) with x_{i} | β. |
(a) Suppose it turns out that
h(z_{i}) = cos(z_{i}) for all i = 1, . . . , n.
Derive an upper-bound on the expected (fixed design) risk
E | “_{n} | _{ } | (x_{i} β − E(Y_{i})) | 2 | |
1 | n | ^{>} ˆ | |||
X | |||||
i=1
#
.
The bound should be expressed as a simple function of k and n. Give detailed justifications for each step in your derivation. (It is fine to use asymptotic notations, like O(1/n).)
Hint: Use Taylor’s (remainder) theorem.
- If k is chosen (as a function of n) to make the expected risk from part (a) as small as possible, what is this best possible expected risk? Your answer should be expressed only in terms of n. Briefly justify your answer. (It is fine to use asymptotic notations, like O(1/n).)
- Let n = 1001, and let z_{1}, . . . , z_{n} be uniformly spaced in the interval [−π, π] (with z_{1} = −π and z_{n} = π). Simulate the random variables Y_{1}, . . . , Y_{n}, where Y_{i} = sin(3z_{i}/2) + ε_{i} for each i, and ε_{1}, . . . , ε_{n} are iid N(0, 1)-random variables. For each k = 1, . . . , 20, compute the predictions of E(Y_{i}) obtained using ordinary least squares on the degree-k polynomial expansion, along with the fixed design risk
1 | n | > ˆ | 2 |
X | (x_{i} β − E(Y_{i})) . | ||
n | i=1 | ||
Repeat this process 1000 times, and average the 1000 resulting fixed design risks (to estimate the expected fixed design risk) for each k. Report these average fixed design risks for each k, both in a table and in a plot. (Make sure the table and plot are clearly labeled.)
Please submit your source code on Courseworks.
Problem 3 (35 points)
In this problem, you will carry out a simulation that illustrates a phenomenon called Freedman’s paradox.
Download the data set freedman.mat from the course website. The n × d matrix in data contains n feature vectors x^{(1)}, . . . , x^{(n)} ∈ R^{d} (as rows), and the n × 1 vector in labels contains the corresponding labels y^{(1)}, . . . , y^{(n)} ∈ R. Note that n < d, so we don’t expect ordinary least squares to work well.
Note that the features and labels are approximately standardized (i.e., mean zero and variance one).
Consider the following three-step procedure:
- Compute the following estimates of the correlations between the features and the label:
ρˆ_{j} := ^{1} | n | _{x}_{j}^{(i)}_{y}(i) | for j = 1, . . . , d. | |
X | ||||
n | i=1 | |||
√
- Let Jb:= {j_{√}∈ {1, . . . , d} : |ρˆ_{j}| > 1.75/ n} be the features for which the estimated correlation is larger than 1.75/ n, and discard the features not in J
ˆ | d | ||||||||||||||||||
3. Construct the ordinary least squares estimate β(J) ∈ R | using only features in J. In other words, | ||||||||||||||||||
ˆ | b | 1 | n | > (i) | (i) 2 | b | |||||||||||||
β | J | ∈ | min | X | β x | y | . | ||||||||||||
( | ) ∈ | arg | n | ( | − | ) | |||||||||||||
d | |||||||||||||||||||
b | β_{j} = 0 for all j / J | i=1 | |||||||||||||||||
β | R : |
∈ b
Steps 1 & 2 comprise a screening procedure whose purpose is to remove irrelevant features. If the number of features k := |Jb| remaining after the screening is much smaller than n, then one might think that ordinary least squares (using just the features in Jb) will work well.
(a) How many features remain (in Jb) after the screening? (It should be much smaller than n.)
ˆ
(b) What is the empirical risk of β(Jb), i.e.,
X | ˆ | b | > | (i) | (i) 2 | ||||
1 ^{n} | |||||||||
(β(J) | x | − y | ) | ? | |||||
n |
i=1
(It should be much smaller than the empirical variance of the labels.)
ˆ ˆ _{{} _{}}
(c) Construct β(I) in the same manner as β(Jb) in Step 3 above, except using I := 1, . . . , 75 in place of
ˆ
Jb. What is the empirical risk of β(I)? (It should be higher than what you got in part (b).)
(d) It turns out that the features and labels are drawn independently from N(0, 1)—i.e., there is no real
ˆ
correlation between the features and labels. Yet based on the relatively small empirical risk of β(Jb), ˆ
together with the small number of features used in β(Jb) relative to the sample size n, you might have thought that there is a relationship between the features and labels! If this seems unbelievable to you, write some code to generate the data yourself, and vary the numerical parameters of the simulation (e.g., n and d) to see how robust the phenomenon is to changes to these parameters. Try to explain why this happens, in your own words.
(e) Suppose you were able to carry out Step 3 using a new, independent but identically distributed set of
ˆ
data. What would you expect β(Jb) to be? And what would you expect its empirical risk (on this new data set) to be? Approximate answers are fine, but briefly justify your answers.
Please submit your source code on Courseworks.
4