Homework 2 Solution






Submit your write-up on Gradescope as a neatly typeset (not scanned nor handwritten) PDF document by 11:59 PM of the due date.


On Gradescope, be sure to select the pages containing your answer for each problem. More details can be found on the Gradescope Student Workflow help page:



(If you don’t select the pages containing your answer to a problem, you’ll receive a zero for that problem.)


Make sure your name and your UNI appears prominently on the first page of your write-up.



Source code


Please combine all requested source code files into a single ZIP file1, along with a plain text file called README that contains your name and briefly describes all of the other files in the ZIP file. Do not include the data files. Submit this ZIP file on Courseworks.


Clarity and precision


One of the goals in this class is for you to learn to reason about machine learning problems and algorithms.


To reason about these things, you must be able to make clear and precise claims and arguments about them.


A clear and precise argument is not the same as a long, excessively detailed argument. Unnecessary details and irrelevant side-remarks often make an argument less clear. And non-factual statements also detract from the clarity of an argument.


Points may be deducted for answers and arguments that lack sufficient clarity or precision. Moreover, a time-economical attempt will be made to interpret such answers/arguments, and the grade you receive will be based on this interpretation.



Problem 1 (30 points)


In this problem, you will practice using linear regression on a simple data set.


Download the wine data set wine.mat from the course website. The first feature (given in data) is a constant feature, always equal to one; I have already applied “affine expansion” to the data. The next d = 11 features are fixed acidity, volatile acidity, . . . , alcohol as described in http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names. I have applied a “standardization” transformation to these features so that, on the training data, the empirical mean of each feature is zero, and the empirical standard deviation of each feature is one; the same transformation was applied to the test data. The output (given in labels) is quality, the quality grade of the wine (an integer between 0 and 10).


Ordinary least squares


Compute the ordinary least squares estimator based on the training data (i.e., the squared loss ERM). You can use any software package you like to do this provided that you include proper citations.2 Make sure


the solution satisfies the appropriate normal equations! This is a way to check if you’ve done things correctly. Compute the test squared loss risk on the test data (testdata and testlabels).


Sparse linear predictor


ˆ ˆ ˆ ˆ ˆ for 1 ≤ id are
Write a program to find a coefficient vector β = (β0 , β1 , . . . , βd) where at most three of the βi
non-zero, with smallest empirical risk on the training data. (The coefficient β0 corresponds to the always-one
feature and is also allowed to be non-zero.) This can be done by enumerating over all d3 triplets of the d
features. For this chosen coefficient vector, compute the test squared loss risk on the test data.

For each of the (at most) three variables v ∈ {fixed acidity, volatile acidity, . . . , alcohol} with non-


zero coefficients in β, determine the two variables (different from v) with highest and second-highest (in absolute value) sample correlation with variable v, as based on the test data. Record both the variable names and the corresponding correlation values.


What to submit in your write-up:


  • Test risks of the ordinary least squares estimator and of the sparse linear predictor.


  • Names (e.g., fixed acidity, volatile acidity) of the variables with non-zero coefficients in the


sparse linear predictor, along with the coefficient values. Also give the value of the coefficient β0.

  • For each variable v with non-zero coefficient, the names of the two other variables most correlated (in absolute value) with variable v, along with the corresponding correlation values.


Please submit your source code on Courseworks.


Problem 2 (35 points)


In this problem, you will study an “approximation error-vs-variance” trade-off in the fixed design setting (which was covered in the reading assignment).


Suppose you are given n distinct real numbers z1, . . . , zn ∈ [−π, π], and you observe the realizations of n uncorrelated real-valued random variables Y1, . . . , Yn, where E(Yi) = h(zi) for some unknown continuous function h: R → R, and var(Yi) = 1 for all i = 1, . . . , n. Taking inspiration from the Weierstrass approximation theorem, you construct a feature vector for each zi using a degree-k polynomial expansion (for some k ∈ N),

x := (1, z , z2, . . . , zk) for each i = 1, . . . , n,  
i i i   i  
ˆ ∈ R k+1 > ˆ
and use ordinary least squares to construct β to approximate each h(zi) with xi β.

(a) Suppose it turns out that

h(zi) = cos(zi)      for all i = 1, . . . , n.



Derive an upper-bound on the expected (fixed design) risk

E n   (xi β − E(Yi)) 2
1 n > ˆ







The bound should be expressed as a simple function of k and n. Give detailed justifications for each step in your derivation. (It is fine to use asymptotic notations, like O(1/n).)


Hint: Use Taylor’s (remainder) theorem.


  • If k is chosen (as a function of n) to make the expected risk from part (a) as small as possible, what is this best possible expected risk? Your answer should be expressed only in terms of n. Briefly justify your answer. (It is fine to use asymptotic notations, like O(1/n).)


  • Let n = 1001, and let z1, . . . , zn be uniformly spaced in the interval [−π, π] (with z1 = −π and zn = π). Simulate the random variables Y1, . . . , Yn, where Yi = sin(3zi/2) + εi for each i, and ε1, . . . , εn are iid N(0, 1)-random variables. For each k = 1, . . . , 20, compute the predictions of E(Yi) obtained using ordinary least squares on the degree-k polynomial expansion, along with the fixed design risk


1 n > ˆ 2
X (xi β − E(Yi)) .
n i=1


Repeat this process 1000 times, and average the 1000 resulting fixed design risks (to estimate the expected fixed design risk) for each k. Report these average fixed design risks for each k, both in a table and in a plot. (Make sure the table and plot are clearly labeled.)


Please submit your source code on Courseworks.


Problem 3 (35 points)


In this problem, you will carry out a simulation that illustrates a phenomenon called Freedman’s paradox.


Download the data set freedman.mat from the course website. The n × d matrix in data contains n feature vectors x(1), . . . , x(n) ∈ Rd (as rows), and the n × 1 vector in labels contains the corresponding labels y(1), . . . , y(n) ∈ R. Note that n < d, so we don’t expect ordinary least squares to work well.

Note that the features and labels are approximately standardized (i.e., mean zero and variance one).


Consider the following three-step procedure:


  1. Compute the following estimates of the correlations between the features and the label:
ρˆj := 1 n xj(i)y(i) for j = 1, . . . , d.
n i=1


  1. Let Jb:= {j∈ {1, . . . , d} : |ρˆj| > 1.75/ n} be the features for which the estimated correlation is larger than 1.75/ n, and discard the features not in J
ˆ d  
3. Construct the ordinary least squares estimate β(J) ∈ R using only features in J. In other words,
ˆ b 1 n   > (i) (i) 2 b
β   J   min X β x   y   .
( ) ∈ arg n   ( )
b βj = 0 for all j / J   i=1
β   R :

∈ b


Steps 1 & 2 comprise a screening procedure whose purpose is to remove irrelevant features. If the number of features k := |Jb| remaining after the screening is much smaller than n, then one might think that ordinary least squares (using just the features in Jb) will work well.

(a) How many features remain (in Jb) after the screening? (It should be much smaller than n.)



(b) What is the empirical risk of β(Jb), i.e.,


X ˆ b > (i) (i) 2
(β(J) x   y ) ?



(It should be much smaller than the empirical variance of the labels.)

ˆ                                                      ˆ                                                                            {                 }

(c) Construct β(I) in the same manner as β(Jb) in Step 3 above, except using I :=  1, . . . , 75  in place of


Jb. What is the empirical risk of β(I)? (It should be higher than what you got in part (b).)


(d) It turns out that the features and labels are drawn independently from N(0, 1)—i.e., there is no real


correlation between the features and labels. Yet based on the relatively small empirical risk of β(Jb), ˆ


together with the small number of features used in β(Jb) relative to the sample size n, you might have thought that there is a relationship between the features and labels! If this seems unbelievable to you, write some code to generate the data yourself, and vary the numerical parameters of the simulation (e.g., n and d) to see how robust the phenomenon is to changes to these parameters. Try to explain why this happens, in your own words.


(e) Suppose you were able to carry out Step 3 using a new, independent but identically distributed set of


data. What would you expect β(Jb) to be? And what would you expect its empirical risk (on this new data set) to be? Approximate answers are fine, but briefly justify your answers.

Please submit your source code on Courseworks.








error: Content is protected !!