## Description

In this assignment we will be working with the Boston Houses dataset. This dataset contains 506 entries. Each entry consists of a house price and 13 features for houses within the Boston area. We suggest working in python and using the scikit-learn package to load the data.

Starter code written in python is provided for each question.

**1 **** ****Learning**** ****basics **** ****of**** ****regression**** ****in**** ****Python **** ****(3%)**

This question will take you step-by-step through performing basic linear regression on the Boston Houses dataset. You need to submit modular code for this question. Your code needs to be func- tional once downloaded. Non-functional code will result in losing all marks for this question. If your code is non-modular or otherwise difficult to understand, you risk losing a significant portion of the marks, even if the code is functional.

**Environment **** ****setup: **** **For this question you are strongly encouraged to use the following python packages:

• sklearn

• matplotlib

• numpy

It is strongly recommended that you download and install Anaconda 3.4 to manage the above instal- lations. This is a Data Science package that can be downloaded from https://www.anaconda. com/download/.

You will submit a complete regression analysis for the Boston Housing data. To do that, here are the necessary steps:

• Load the Boston housing data from the sklearn datasets module

• Describe and summarize the data in terms of number of data points, dimensions, target, etc

• Visualization: present a single grid containing plots for each feature against the target. Choose the appropriate axis for dependent vs. independent variables. **Hint:**** **use pyplot.tight layout function to make your grid readable

• Divide your data into training and test sets, where the training set consists of 80% of the data points (chosen at random). **Hint:**** **You may find numpy.random.choice useful

• Write code to perform linear regression to predict the targets using the training data. Remem- ber to add a bias term to your model.

• Tabulate each feature along with its associated weight and present them in a table. Explain what the sign of the weight means in the third column (’INDUS’) of this table. Does the sign match what you expected? Why?

• Test the fitted model on your test set and calculate the Mean Square Error of the result.

• Suggest and calculate two more error measurement metrics; justify your choice.

• Feature Selection: Based on your results, what are the most significant features that best pre- dict the price? Justify your answer.

**2 **** ****Locally**** ****reweighted**** ****regression**** ****(6%)**

1. Given {(x^{(1)} , y^{(1)} ), .., (x^{(}^{N} ^{)}, y^{(N} ^{)} )} and positive weights a^{(1)} , …, a^{(}^{N} ^{)} show that the solution to the weighted least square problem

w^{∗} = arg min

is given by the formula

_{N}

^{2}

^{2}

^{1} ^{X} _{a}^{(i}^{)}_{(}_{y}^{(i)} _{−} _{w}^{T} _{x}^{(}^{i}^{)} _{)}^{2} _{+} ^{λ} _{||}_{w}_{||}^{2} _{(1)}

i=1

_{w}^{∗} _{=} _{X}^{T} _{AX} _{+} _{λI} ^{−}^{1} _{X}^{T} _{Ay } _{(2)}

where X is the design matrix (defined in class) and A is a diagonal matrix where A_{ii} = a^{(}^{i}^{)}

2. Locally reweighted least squares combines ideas from k-NN and linear regression. For each new test example x we compute distance-based weights for each training example a^{(i) } =

__ ____ ____ ___{exp(}_{−}_{||}_{x}_{−}_{x}^{(}^{i}^{)} _{||}^{2} _{/}_{2}_{τ}__ __^{2} _{) }__ __ _{,} _{computes} _{w}_{∗} _{=} _{arg} _{min} _{1} ^{P}_{N}

_{a}_{(}_{i}_{)} _{(}_{y}_{(}_{i}_{)}

_{w}_{T} _{x}_{(i}_{)}_{)}_{2} _{+} _{λ}

_{w } _{2} _{and} _{p}_{r}_{edicts}

^{P}_{j} ^{exp(}^{−}^{||}^{x}^{−}^{x}^{(}^{j}^{)} ^{||}^{2} ^{/}^{2}^{τ} ^{2} ^{)}

_{2 } i=1 ^{−}

2 ^{|| } ^{||}

yˆ = x^{T}

w^{∗}. Complete the implementation of locally reweighted least squares by providing the

missing parts for q2.py.

Important things to notice while implementing: First, do not invert any matrix, use a linear

_{solver} _{(nump}_{y}_{.linalg.solve} _{is} _{one} _{example).} _{Second,} _{notice} _{that} __ ____ ____exp(____A___{i} __) ____ __ _{= } __ ____ ____exp(____A___{i} __−____B____) ____ __ _{but}

^{P}_{j} ^{exp(}^{A}^{j} ^{)}

^{P}_{j} ^{exp(}^{A}^{j} ^{−}^{B}^{)}

^{P}

_{if} _{we} _{use} _{B} _{=} _{max}_{j} _{A}_{j} _{it} _{is} _{much} _{mo}_{r}_{e} _{numerically} _{stable} _{as} __ ____ __^{exp(}^{A}__i__ ^{) }__ __ _{overflows/underflows}

_{j} ^{exp(}^{A}^{j} ^{)}

easily. This is handled automatically in the scipy package with the scipy.misc.logsumexp function.

3. Use k-fold cross-validation to compute the average loss for different values of τ in the range [10,1000] when performing regression on the Boston Houses dataset. Plot these loss values for each choice of τ .

4. How does this algorithm behave when τ → ∞? When τ → 0?

**3 **** ****Mini-batch**** ****SGD **** ****Gradient**** ****Estimator**** ****(6%)**

Consider a dataset D of size n consisting of (x, y) pairs. Consider also a model M with parameters

_{θ} _{to} _{be} _{optimized} _{with} _{r}_{espect } _{to} _{a} _{loss} _{function} _{L(x,} _{y}_{,} _{θ}_{)} _{=} ^{1} ^{P}_{n}

_{`}_{(}_{x}_{(}_{i}_{)} _{,} _{y}_{(i}_{)}_{,} _{θ}_{)}_{.}

n ^{i}^{=1}

We will aim to optimize L using mini-batches drawn randomly from D of size m. The indices of these points are contained in the set I = {i_{1} , . . . , i_{m} }, where each index is distinct and drawn uniformly without replacement from {1, . . . , n}. We define the loss function for a single mini-batch as,

_{1}

^{L}^{I} ^{(x,} ^{y}^{,} ^{θ}^{)} ^{=} _{m}

^{X}

i∈I

`(x

(i)

_{,} _{y}(i)

, θ) (3)

^{1.} ^{Given} ^{a} ^{set} ^{{}^{a}_{1}^{,} ^{.} ^{.} ^{.} ^{,} ^{a}_{n} ^{}} ^{and} ^{random} ^{mini-batches} ^{I} ^{of} ^{size} ^{m,} ^{show } ^{that}

^{“} _{1}

^{E}^{I } _{m}

# X _{a}_{i}

i∈I

_{n}

= ^{1} ^{X} a _{n} i

i=1

2. Show that E_{I} [∇L_{I} (x, y, θ)] = ∇L(x, y, θ)

3. Write, in a sentence, the importance of this result.

4. (a) Write down the gradient, ∇L above, for a linear regression model with cost function

_{`(x,} _{y}_{,} _{θ}_{)} _{=} _{(y} _{−} _{w}^{T} _{x}_{)}^{2}_{.}

(b) Write code to compute this gradient.

5. Using your code from the previous section, for m = 50 and K = 500 compute

_{1 } _{K}

__ ____ __ ^{X}

_{K}

^{k}^{=1}

∇L_{I}_{k} (x, y, θ)

, where I_{k} is the mini-batch sampled for the kth time.

Randomly initialize the weight parameters for your model from a N (0, I ) distribution. Com- pare the value you have computed to the true gradient, ∇L, using both the squared distance metric and cosine similarity. Which is a more meaningful measure in this case and why?

_{[Note:} _{Cosine} _{similarity} _{between} _{two} _{vectors } _{a} _{and} _{b} _{is} _{given} _{by} _{cos(}_{θ}_{)} _{=} __ ____ __^{a}__ __^{·}__ __^{b }__ __ _{.]}

^{||}^{a}^{||}_{2} ^{||}^{b}^{||}_{2}

6. For a single parameter, w_{j} , compare the sample variance, σ˜_{j} , of the mini-batch gradient esti- mate for values of m in the range [1,400] (using K = 500 again). Plot log σ˜_{j} against log m.

**Submission**

Assignments must be submitted by 10:00 pm on the day it is due; a late penalty of 10% per day will be assessed thereafter (up to 3 days, then submission is blocked). We highly recommend us- ing Python 3.6 embedded in Anaconda 3.4 to solve the programming questions. However, you are

welcome to use any programming language you like, given that your code solves the problem, and is functional and modular.

**What**** ****to**** ****submit**

• All work to be submitted via Markus (details to follow soon).

• All mathematical work has to be clearly marked by the question it belongs to, we highly recommend using a math editor such as MathType, Word Equations or LaTex. However, if you prefer to use paper and pen(cil), you are welcome to do so as long as your hand writing is readable. Unreadable work will result in losing the full mark of the question.

• Provide a separate file containing your code for each question. Each file name has to be in- dicative to the question it belongs to.

• Your code must be clearly structured with an obvious entry point.

• All graphs and plots have to be clearly marked by its question. Use readable grids whenever applicable.