# Homework 4 Solution

\$30.00

Category:

## Description

Instructions

Submit your write-up on Gradescope as a neatly typeset (not scanned nor handwritten) PDF document by 11:59 PM of the due date.

On Gradescope, be sure to select the pages containing your answer for each problem. More details can be found on the Gradescope Student Workflow help page:

(If you don’t select the pages containing your answer to a problem, you’ll receive a zero for that problem.)

Source code

Please combine all requested source code files into a single ZIP file1, along with a plain text file called README that contains your name and briefly describes all of the other files in the ZIP file. Do not include the data files. Submit this ZIP file on Courseworks.

Clarity and precision

One of the goals in this class is for you to learn to reason about machine learning problems and algorithms.

To reason about these things, you must be able to make clear and precise claims and arguments about them.

A clear and precise argument is not the same as a long, excessively detailed argument. Unnecessary details and irrelevant side-remarks often make an argument less clear. And non-factual statements also detract from the clarity of an argument.

Points may be deducted for answers and arguments that lack suﬃcient clarity or precision. Moreover, a time-economical attempt will be made to interpret such answers/arguments, and the grade you receive will be based on this interpretation.

Problem 1 (30 points)

In this problem, you will study a case where maximum likelihood fails but empirical risk minimization succeeds.

Consider the following probability distribution P on X × Y, for X = {0, 1}2 and Y = {0, 1}.

 P (X = (x1 , x2)) : x1 = 0 x1 = 1 P (Y = 1 X = (x1, x2)) : x1 = 0 x1 = 1 1− 1− x = 0 | x 2 = 0 0 1 2 2 2 x2 = 1 x2 = 1 1 0 2 2

Above, regard as a small positive number between 0 and 1. Let f? be the Bayes optimal classifier for P . (In this problem, we are concerned with zero-one loss.)

Now also consider the statistical model P for Y | X: P = {PA, PB}, where

 | (5 if x1 = 1; | (1 if x1 = 1. PA(Y = 1  X = (x1, x2)) = 4 if x1 = 0, PB(Y = 1  X = (x1, x2)) = 0 if x1 = 0, 5 1

Note that P/ P, and that distributions in P do not specify the distribution of X (like logistic regression).

Let fA and fB be the Bayes optimal classifiers, respectively, for PA and PB.

The maximum likelihood approach to learning a classifier selects the distribution in P of highest likelihood given training data (which are regarded as an iid sample), then returns the optimal classifier for the chosen distribution (i.e., fA if PA is the maximum likelihood distribution, otherwise fB).

(a) Give simple expressions for the Bayes optimal classifiers for P , PA, and PB. E.g.,

(

f?(x) =       1    if x1 + x2 = 2,

• otherwise.

• What is the risk of each classifier from Part (a) under distribution P ?

• Suppose in the training data, the number of training examples of the form ((x1, x2), y) is equal to Nx1,x2,y, for (x1, x2, y) ∈ {0, 1}3. Give a simple rule for determining the maximum likelihood distribution in P in terms of Nx1,x2,y for (x1, x2, y) ∈ {0, 1}3. Briefly justify your answer.

• If the training data is a typical iid sample from P (with large sample size n), which classifier from Part
• is returned by the maximum likelihood approach? Briefly justify your answer.

• If the training data is an iid sample from P of size n, then how large should n be so that, with probability at least 0.99, Empirical Risk Minimization returns the classifer in {fA, fB} with smallest risk? Briefly justify your answer.

Problem 2 (20 points)

In this problem, you will practice verifying the convexity of certain functions.

• Let D be a finite non-empty subset of Rd, and consider the statistical model {Pθ : θ ∈ Rd} for iid random variables X1, . . . , Xn, where the probability mass function for X1 is given by

exp(θ>x)

pθ(x) = P               ,      x ∈ D,

y D exp(θ>y)

and pθ(x) = 0 for x 6∈ D. Prove that for any x1, . . . , xn ∈ D, the log-likelihood function

!

n

Y

LL(θ) = ln             pθ(xi)          for all θ ∈ Rd

i=1

is concave (i.e., − LL is convex).

• A twice-diﬀerentiable function f : Rd → R is strictly convex if its second-derivative matrix at any x ∈ Rd

is positive definite. Is the function f : R → R given by

 f(x) = exp(−x) for all x ∈ R strictly convex? Explain (with a short proof) why or why not. (c) Let θ ∈ Rd be a non-zero vector. Is the function f : Rd → R given by f(x) = kx − θk22 + 2θ>x + 1  for all x ∈ Rd strictly convex? Explain (with a short proof) why or why not. (d) Let θ ∈ Rd be a non-zero vector. Is the function f : Rd → R given by f(x) = exp(θ>x) for all x ∈ Rd

strictly convex? Explain (with a short proof) why or why not.

Problem 3 (20 points)

In this problem, you will formulate optimization problems as convex optimization problems using only “simple” objective and constraint functions.

Let (x1, y1), . . . , (xn, yn) ∈ Rd × R be given data. Write each of the following optimization problems as a convex optimization problem in standard form in which the objective function and constraint functions are linear or aﬃne functions. You may introduce additional variables and constraints, but the number of variables and constraints should be no more than a polynomial function of n and d (with constant degree). Briefly explain why the optimization problem you formulate has the same optimal solutions as the given one.

 (a) min max x>w − y i| . w∈Rd i=1,…,n | i (b) n min X ), φ(x>w − y w∈Rd i=1 i i where φ: R → R is the function defined by ( t |t| ( ) := 1  if > 1. φ t 0 if t ≤ 1, | | − | | (c) d min X j=1 | w j| w∈Rd s.t. |xi>w − yi| ≤ 1 for i = 1, . . . , n.

(d)

k

X

min                   |r|(j)

w∈Rd,r∈Rn

j=1

s.t.    x>iw yi = ri       for i = 1, . . . , n.

Above, k is a given positive integer between 1 and d. Also, for a vector v = (v1, . . . , vn),

|v|(j) = |vπ(j)|

for the permutation π on {1, . . . , n} that orders the entries of v in non-increasing absolute value:

|v|(1) ≥ |v|(2) ≥ · · · ≥ |v|(n).

Problem 4 (30 points)

In this problem, you will experimentally study convergence behavior of gradient descent for logistic regression.

Let (x1, y1), . . . , (xn, yn) be training examples from Rd × {0, 1} for a binary classification problem. The following optimization problem specifies the logistic regression MLE parameters (with explicit aﬃne expansion):

 β0 1  n n ln (1 + exp( 0 +  i )) − i( 0 +  i  )o ∈R,β∈Rd n i=1 min X β x>β y β x>β  .

• Give concise and unambiguous pseudocode for a gradient descent algorithm that approximately solves this optimization problem. Be explicit about how the gradients are computed. Assume the initial solution, step sizes, and number of iterations are provided as inputs.

• Implement the gradient descent algorithm from part (a), except use β0 = 0 and β = 0 as the initial solution, choose the step sizes using a backtracking line search (with initial step size η = 1)2, and use as many iterations as are required to achieve a prescribed objective value. You can use library functions that implement standard linear algebraic operations and simple functions such as exp and log; of course, you should not use (or look at the source code for) existing implementations of gradient descent or other optimization algorithms. Run your gradient descent code on the data set logreg.mat from the course website (which has training features vectors and labels stored as data and labels, respectively). How many iterations of gradient descent are needed to achieve an objective value that is at most 0.65064?3

• The feature vectors in the data set from logreg.mat are three-dimensional, so they are (relatively) easy to inspect. Investigate the data by plotting it and/or computing some statistics about the features. Do you notice anything peculiar about the features? Use what you discover to design an invertible linear transformation of the feature vectors xi 7→Axi such that running gradient descent on this transformed data (Ax1, y1), . . . , (Axn, yn) reaches an objective value of 0.65064 in (many) fewer iterations. Describe the steps and reasoning in this investigation, as well as your chosen linear transformation (as a 3×3 matrix). How many iterations of gradient descent were required to achieve this stated objective value?

• Create a new version of your gradient descent code with the following changes.

1. Use only the first b0.8nc examples to define the objective function; keep the remaining n − b0.8nc examples as a validation set.4
2. Use the following stopping condition. After every power-of-two (20, 21, 22, etc.) iterations of gradient descent, record the validation error rate (i.e., zero-one loss validation risk) for the linear classifier based on the current (β0, β). If this validation error rate is more than 0.99 times that of the best validation error rate previously computed, and the number of iterations executed is at least 32 (which is somewhat of an arbitrary number), then stop.

Run this modified gradient descent code on the original logreg.mat data, as well as the linearly transformed data (from part (c)). In each case, report: (1) the number of iterations executed, (2) the final objective value, and (3) the final validation error rate.