# Assignment 3 Linear Regression Solution

\$30.00

Category:

## Description

Introduction

In this assignment you will explore gradient descent and perform linear regression on a dataset using cross-validation to analyze your results.

As with all homeworks, you cannot use any functions that are against the \spirit” of the assignment. For this assignment that would mean an linear regression functions. You may use statistical and linear algebra functions to do things like:

mean std cov

inverse

matrix multiplication transpose

etc…

And as always your code should work on any dataset that has the same general form as the provided one.

Although all assignments will be weighed equally in computing your homework grade, below is the grading rubric we will use for this assignment:

 Part 1 (Theory) 15pts Part 2 (Gradient Descent) 20pts Part 3 (Closed-form LR) 40pts Part 4 (S-Folds LR) 15pts Report 10pts TOTAL 100

1

Datasets

Fish Length Dataset (x06Simple.csv) This dataset consists of 44 rows of data each of the form:

1. Index

1. Age (days)

1. Temperature of Water (degrees Celsius)

1. Length of Fish

The rst row of the data contains header information.

Data obtained from: http://people.sc.fsu.edu/ jburkardt/datasets/regression/regression.html

2

• Theory

1. (10pts) Consider the following data:

• 3

2 1

• 5 47

• 7

• 3 1 7

• 7

60 37

• 7

• 8 117

• 7

• 2 5 7

• 7

61 07

• 7

65 17

• 7

41 35

6 1

1. Compute the coe cients for the linear regression using least squares estimate (LSE), where the second value (column) is the dependent variable (the value to be predicted) and the rst column is the sole feature. Show your work and remember to add a bias feature and to standardize the features. Compute this model using all of the data (don’t worry about separating into training and testing sets).

1. For the function g(x) = (x 1)4, where x is a single value (not a vector or matrix):

1. Plot x vs g(x) use a software package of your choosing.

3

In this section we want to visualize the gradient descent process on the function g(x) = (x 1)4. You should have already derived (pun?) the gradient of this function in the theory section. To bootstrap the process, initialize x = 0 and terminate the process when the change in the x from one iteration to another is less that 2 23.

2.1 Fixed Learning Rate

First experiment with your choice of the learning parameter, . From your theory work you should know what the actual minima is. A common starting guess is = 1:0.

In your report you will need

1. Plot iteration vs g(x).

1. Plot iteration vs x.

1. Chosen value of .

Next let’s try to \intelligently” adapt the learning rate. Start with = 1:0 and reduce by 1=2 whenever the sign of the gradient changes. This indicates that we may have over-jumped the minima.

In your report you will need

1. Plot iteration vs g(x).

1. Plot iteration vs x.

4

• Closed Form Linear Regression

Download the dataset x06Simple.csv from Blackboard. This dataset has header information in its rst row and then all subsequent rows are in the format:

ROW Id; Xi;1; Xi;2; Yi

Your code should work on any CSV data set that has the rst column be header information, the rst column be some integer index, then D columns of real-valued features, and then ending with a target value.

Write a script that:

1. Reads in the data, ignoring the rst row (header) and rst column (index).

1. Randomizes the data

1. Selects the rst 2/3 (round up) of the data for training and the remaining for testing

1. Standardizes the data (except for the last column of course) using the training data

1. Computes the closed-form solution of linear regression

1. Applies the solution to the testing samples

 1 N ^ 2 ^ Computes the root mean squared error (RMSE): P Y Y Y 7. qN i=1( i i) . where i is the predicted value for observation Xi.

Implementation Details

1. Seed the random number generate with zero prior to randomizing the data

1. Don’t forget to add in the bias feature!

In your report you will need:

1. The nal model in the form y = 0 + 1x1 + :::

1. The root mean squared error.

RMSE: Around 800

Table 2: Closed Form Regression Evaluation

5

• S-Folds Cross-Validation

Cross-Validation is a technique used to get reliable evaluation results when we don’t have that much data (and it is therefore di cult to train and/or test a model reliably).

In this section you will do S-Folds Cross-Validation for a few di erent values of S. For each run you will divide your data up into S parts (folds) and test S di erent models using S-Folds Cross-Validation and evaluate via root mean squared error. In addition, to observe the a ect of system variance, we will repeat these experiments several times (shu ing the data each time prior to creating the folds). We will again be doing our experiment on the provided sh dataset.

Write a script that:

1. Reads in the data, ignoring the rst row (header) and rst column (index).

1. 20 times does the following:

1. Randomizes the data

1. Creates S folds.

1. For i = 1 to S

i. Select fold i as your testing data and the remaining (S 1) folds as your training data

1. Standardizes the data (except for the last column of course) based on the training data

1. Train a closed-form linear regression model

1. Compute the squared error for each sample in the current testing fold

1. You should now have N squared errors. Compute the RMSE for these.

1. You should now have 20 RMSE values. Compute the mean and standard deviation of these. The former should give us a better \overall” mean, whereas the latter should give us feel for the variance of the models that were created.

Implementation Details

1. Don’t forget to add in the bias feature!

1. Set your seed value at the very beginning of your script (if you set it within the 20 tests, each test will have the same randomly shu ed data!).

In your report you will need:

1. The average and standard deviation of the root mean squared error for S = 3 over the 20 di erent seed values..

1. The average and standard deviation of the root mean squared error for S = 5 over the 20 di erent seed values.

1. The average and standard deviation of the root mean squared error for S = 20 over 20 di erent seed values.

6

1. The average and standard deviation of the root mean squared error for S = N (where N is the number of samples) over 20 di erent seed values. This is basically leave-one-out cross-validation.

 S Average RMSE Std of RMSE 3 650 45 5 650 35 20 620 10 N 620 0

Table 3: Evaluation Using S-Fold Cross Validation

7

Submission

For your submission, upload to Blackboard a single zip le with no spaces in the le or directory names and contains:

1. PDF Writeup

1. Source Code

The readme.txt le should contain information on how to run your code to reproduce results for each part of the assignment.

The PDF document should contain the following:

1. Part 1:

1. Your solutions to the theory question

1. Part 2:

1. Your two gures using value of your choosing, as well as that value.

1. Part 3:

1. Final Model

1. RMSE

1. Part 4:

1. Average and standard deviations of RMSEs for di erent cross-validations.

8

error: Content is protected !!