Description

Instructions:
You may form small groups (e.g. of up to four people) to work on this assignment, but you must write up all solutions by yourself. List your study partners for the homework on the rst page, or \none” if you had no partners.
Keep all responses brief, a few sentences at most. Show all work for full credit.
Start each problem on a new page, and be sure to clearly label where each problem and subproblem begins. All problems must be submitted in order (all of P1 before P2, etc.).
No late homeworks will be accepted. This is not out of a desire to be harsh, but rather out of fairness to all students in this large course.
1

Over tting
Over tting is a common problem when doing datascience work.

How can you tell if a model you have trained is over tting?

Why would it be bad to deploy a model that’s been over tted to your training data?


Brie y describe how over tting can occur.


In lecture and discussion we’ve discussed multiple methods for dealing with over tting. One such technique was regularization. Please describe two di erent regularization techniques and explain how they help to mitigate over tting.

Over tting Mitigations
For each of the below strategies, state whether or not it might help to mitigate over tting and explain why.

Using a smaller training dataset

Restricting the maximum value any parameter can take on


Training your neural network for longer (more iterations)


Training a model with more parameters


Randomly setting the outputs of 50% of the nodes in your neural network to zero



Incorporating additional sparse features into your model



Initializing your parameters randomly instead of to zero


Training your model on a graphics processing unit or specialized accelerator chip instead of a
CPU

KNearest Neighbors

Why is it important to normalize your data when using the knearest neighbors algorithm (KNN)?

When doing knearest neighbors, an odd value for k is typically used. Why is this?


Say you have the dataset in Table 1, where x and y are features and L is the label. Normalize the features by scaling them so that all values in a column lie in the range [0; 1], so that they can be used with KNN.


Use KNN with K=3 to make predictions for the data points in Table 2.
(e) Use KNN with k = 1 to make predictions for the data points in Table 2.
(f) Use KNN with k = 5 to make predictions for the data points in Table 2.

What is a potential issue with using a low k for KNN?

What is a potential issue with selecting a k that is too high when doing KNN?
2
x y L
0.2 350 0
0.1 750 0
0.3 700 0
0.1 500 0
0.2 500 0
1.2 400 1
0.9 410 1
1.1 390 1
0.1 760 1
Table 1: Raw KNN data for training
^{x} ^{y} ^{L}pred
.1 760
.2 700
.6 200
Table 2: Raw KNN data for evaluation

Say you had a dataset and wanted to understand how well KNN with a k of 3 performed on it. How could you quantify its performance? Assume you do not have access to any samples beyond those in your dataset.

Say you had a dataset consisting of 200 samples. How might you select the optimal k to use for KNN?

Principal Component Analysis
For each of the below situations, state whether or not PCA would work well, and brie y explain why.

Data with a linear structure

Data lying on a hyperbolic plane


A dataset containing nonnormalized features

(d) A dataset where each feature is statistically independent of all others
3

Arti cial Neural Networks
Consider the following computation graph for a simple neural network for binary classi cation. Here x_{1} and x_{2} are input features and y is their associated class. The network has multiple parameters, including weights w and biases b, as well as nonlinearity function g. The network will output a value y_{pred}, representing the probability that a sample belongs to class 1. We use a loss function Loss to help train our model. The network is initialized with the parameters in Table 3.
You will rst train the model using some sample datapoints and then evaluate its performance.
For any questions that ask for performance metrics, generate them using the samples in Table 4.
w_{0}
b0
^{w}4
w1 +
g
x1
+ g ^{ y}pred Loss
x2 y
^{ g } b_{2}
w2 +
w_{5}
^{b}1
w_{3}
Figure 1: Neural network architecture

Initialize the neural network with the parameters in Table 3 and then train it using the samples in Table 4. Use gradient descent (forward pass followed by backpropagation) to update the parameters.

Suppose the loss function is quadratic, Loss(y; y_{pred}) =
1
(y y_{pred})^{2}, and g is the sigmoid
2
1
function g(z) =
(note: it’s typically better to use a di erent type of loss, crossentropy,
1+e
z
for classi cation problems, but using a quadratic loss function keeps the math easier for this assignment).
Assume that your learning rate = :1 .
Pass the samples through the network one at a time in order from top to bottom. Report the nal values of all parameters. Show all work.

What is your model’s accuracy?
(c) What is your model’s precision?

What is your model’s recall?
(e) What is your model’s F_{1} score?
4
CS 188, 
HW 3 


Parameter 
Initial value 

b_{0} 
1 

b_{1} 
6 

^{b}2 
3.93 

w_{0} 
3 

w_{1} 
4 

w_{2} 
6 
x_{1} 
x_{2} 
y 

w_{3} 
5 
0 
0 
0 

w_{4} 
2 
1 
1 
1 

w_{5} 
4 
0 
1 
1 

Table 3: ANN initial state 
Table 4: ANN training data 

Plot the ROC curve.

What is your model’s AUCROC?

Reinforcement Learning
Consider the following instance of reinforcement learning
3 +1
2 1

Start
1 2 3 4
Figure 2: Reinforcement learning grid world
Assume the following:

The agent lives in a grid

The agent starts in the bottom left

Walls (shaded grids) block the agent’s path

The agent’s actions do not always go as planned:
{ 80% of the time, the action North takes the agent North (if there is no wall there) { 10% of the time, North takes the agent West
{ 10% of the time, North takes the agent East
{ If there is a wall in the direction the agent would have taken, the agent stays put

= :8

Big rewards come at the end

Goal: maximize sum of rewards
Use Qlearning to nd an optimal path in the grid. Show each step. State all your assumptions.
5

(Money)ball So Hard
Watch the 2011 movie Moneyball (directed by Bennett Miller and written by Steven Zaillian and Aaron Sorkin).

Write a detailed formulation of the problem discussed in the movie and outline, stepbystep, how you would go about solving the problem (data collection, data cleaning, ML, and all other steps).

Discuss what can go wrong when using this method in practice and how it can be improved.
6