## Description

Q0 (0pts correct answer, -1,000pts incorrect answer: (0,-1,000) pts): A correct answer to the following questions is worth 0pts. An incorrect answer is worth -1,000pts, which carries over to other homeworks and exams, and can result in an F grade in the course.

- Student interaction with other students / individuals:

- I have copied part of my homework from another student or another person (plagiarism).

- Yes, I discussed the homework with another person but came up with my own answers. Their name(s) is (are)

- No, I did not discuss the homework with anyone

- On using online resources:

- I have copied one of my answers directly from a website (plagiarism).

- I have used online resources to help me answer this question, but I came up with my own answers (you are allowed to use online resources as long as the answer is your own). Here is a list of the websites I have used in this homework:

- I have not used any online resources except the ones provided in the course website.

1

- Homework 2: Stochastic Gradient Descent and Adaptive Learn-ing Rates

Learning Objectives: Let students understand Stochastic Gradient Descent (SGD) and optimization algorithms for adaptive learning rates.

Learning Outcomes: After you nish this homework, you should be capable of explaining and implement-ing SGD with minibatch from scratch and realize algorithms for adaptive learning rates.

1.1 Theoretical Part

Please answer the following questions concisely. All the answers, along with your name and email, should be clearly typed in some editing software, such as Latex or MS Word.

- Minibatch Stochastic Gradient Descent (SGD) has been shown to give competitive results using deep neural networks in many tasks. Compared to the regular Gradient Descent (GD), the gradients com-puted by minibatch SGD are noisy. Can you use the Robbins-Monro stochastic approximation algo-rithm to explain why the noisy gradients need to be unbiased (i.e., in average, the noisy gradient is the true gradient). What would happen if the gradients had a bias?

- In Robbins-Monro stochastic approximation the adaptive learning rate
_{t}must satisfy two conditions

P | 1 | 1 | 2 |

P | |||

(a) _{t=1} _{t} = 1 and (b) |
_{t=1} _{t} < 1. Explain what would happen with the approximation if |

condition (a) was not satis ed. Now explain what would happen with the approximation if condition

(b) was not satis ed.

- Give an intuitive de nition of loss \ atness” in the parameter space that helps explain why the local minima found by SGD tends to generalize well in the test data.

- Early stopping uses the validation data to decide which parameters we should keep during our SGD optimization. Explain why models obtained by early stopping tend to generalize better than models obtained by running a xed number of epochs. Also explain why early stopping should never use the training data.

- Give a mathematical reason to why CNNs are sensitive to image rotations. Describe the role of max-pooling on a CNN and why it should be applied over small image patches.

- Explain why min-pooling is never applied to CNNs with ReLU activations.

2

1.2 Programming Part

In this part, you are going to implement multiple (1) gradient descent variants and (2) regularization methods. This HW is built on top of HW2. In case that you did not work out HW2, we provide the skeleton code, which is basically the solution of HW2, so that you can start from a solid base. If you are con dent of your own HW2 implementation, you are welcome to use it, but you need to make sure that it works well with the new executable provided. The rule of thumbs is that you can do any changes in the “my neural networks,” but need to keep the main executable, such as hw3 minibatch.py, untouched.

Skeleton Package: A skeleton package is available on Piazza with the execution script hw3 minibatch.py You should be able to download it and use the folder structure provided, like the HW2.

1.2.1 HW Overview

You are going to add a few new components to the previous HW2 package. optimizers.py is a module that contains four learning algorithms (optimizers) that you will explore. Those optimizers should be called in networks.py for updating network parameters. Moreover, in networks.py, you will add L2 regularization. So spend most of your e orts on improving these two modules.

Again, we plot the folder structure that you should use:

3

[your purdue login] hw3

my neural networks

init .py

optimizers.py

networks.py

CNN networks.py

activations.py

mnist.py

minibatcher.py

shu ed labels.py

any others.py

report.pdf

ReadMe

hw3 minibatch.py

[your purdue login] hw3: the top-level folder that contains all the les required in this homework. You should replace the le name with your name and follow the naming convention.

report.pdf: Your written solutions to all the homework questions, including theoretical and program-ming parts. Should be submitted in pdf format.

ReadMe: Your ReadMe should begin with a couple of example commands, e.g., “python hw3.py data”, used to generate the outputs you report. TA would replicate your results with the commands provided here. More detailed options, usages and designs of your program can be followed. You can also list any concerns that you think TA should know while running your program. Note that put the information that you think it’s more important at the top. Moreover, the le should be written in pure text format that can be displayed with Linux “less” command.

hw3 minibatch.py: The main executable to run training with minibatch SGD.

my neural networks: Your Python neural network package. The package name in this homework is my neural networks, which should NOT be changed while submitting it. Two modules should be at least included:

Two modules that you are going to develop:

{ optimizers.py { networks.py

4

The detail will be provided in the task descriptions. All other modules are just there for your con-venience. It is not requried to use them, but exploring that will be a good practice of re-using code. Again, you are welcome to architect the package in your own favorite. For instance, adding another module, called utils.py, to facilitate your implementation.

We also recommend that you explore minibatcher.py, because you could use it to creates minibatches.

You are also welcome to do it completely by yourself.

1.2.2 Data: MNIST

You are going to conduct a simple classi cation task, called MNIST (__http://yann.lecun.com/exdb/__ __mnist/__). It classi es images of hand-written digits (0-9). Each example thus is a 28 28 image.

The full dataset contains 60k training examples and 10k testing examples.

We provide a data loader (read images(.) and read labels(.) in my neural networks/mnist.py) that will automatically download the data.

1.2.3 Gradient Descent Variants

Gradient Descent does its job decently when the dataset is small. However, that is not the case we see in Deep Learning. Multiple variants of learning algorithms have been proposed to realize training with huge amount of data. We’ve learned several in the class. Now it’s time to test your understanding about them.

Task 1a: Minibatch Stochastic Gradient Descent (SGD)

This is a warm-up task. You are going adapt your HW2 from full-batch to minibatch SGD. Most of the codes are given. Only the minibatch training requires some of your e orts.

In Minibatch SGD, the gradients come from minibatches:

1 | N_{b} |
_{ } |
||

X_{i} |
_{ } |
|||

L(W ) = _{N}_{b} |
L_{i}(x_{i}; y_{i}; W ) |
(1) | ||

=1 | ||||

1 | N_{b} |
_{ } |
||

^{X}i |
||||

rW ^{L(W} ^{) =} _{N}_{b} |
r_{W} L_{i}(x_{i}; y_{i}; W ); |
(2) | ||

=1 |

where N_{b} is the number of examples in one batch. It is similar to the full-batch case, but now you only feed a subset of the training data and update the weights according to the gradients calculated from this subset.

Related Modules:

5

hw3 minibatch.py

my neural networks/optimizers.py my neural networks/networks.py

Action Items:

- Go through all the related modules. Speci cally, you should understand the networks.BasicNeuralNetwork well. Note that the “train one epoch” method is renamed as “train one batch” to accommodate our purpose, but does the same thing: consider all the input examples and update the weights.

- You should notice that we move the weight updates to the optimizers.py. The optimizers.SGDOptimizer is fully implemented. You should look into it and understand its interactions with the network. Inside optimizers.py you will nd a basic skeleton for your implementation. You are required to ll in the missing parts of the code, denoted with h ll ini.

- Now, move to hw3 minibatch.py. This le basically has the same structure as the executable in HW2. We only did minor changes to adapt to the library.

- In this le, implement the training part using minibatch SGD (You should see the com-ments that ask you to ll in your code).

- There is a “{minibatch size” command-line option that speci es that minibatch size. And you should use the “networks.BasicNeuralNetwork.train one batch(.)” for training one batch.

- Once you nish the implementation, run your code with the command “python hw3 minibatch.py -n 20000 -m 300 -v data” (supposed your corpus is in a folder called “data”), which speci es that minibatch size is 300 and in default the maximum number of epochs is 100.

- Report your results by lling up Table
__1__for losses, training accuracies, and testing accuracies (You should have 3 tables). Use the batch sizes and learning rates given and report the results in the nal epoch. Describe your observations in the report.

- The hw3 minibatch.py can output two plots: “Loss vs. Epochs” and “Accuracies vs. Epochs,” if you collect the results correctly in the code. Make use of those plots for debugging and analyzing the models.

- Running the program without specifying the batch size (remove the “{minibatch 300” option) gives you the regular Gradient Descent (GD). Compare the results of using regular GD and minibatch SGD, and state your observations in the report.

6

H | |||||

^{H}H |
LearningRate | ||||

H | |||||

^{H}H |
H | 1e-3 | 1e-4 | 1e-5 | |

BatchSize | |||||

^{H}H |
|||||

100 | |||||

500 | |||||

3000 | |||||

5000 | |||||

Table 1: The results should be reported.

Task 1b: Adaptive Learning Rate Algorithms

You should have learned the adaptive learning rate algorithms in the lecture. In addition to SGD, here we are going to implement several more: Momentum, Nesterov Momentum, and Adam.

Note that you can still report the results with full-batch training if you skipped the minibatch SGD imple-mentation in Task 1a.

Since each algorithm might have minor versions, here we de ne the exact one we want you to implement:

SGD: (You already have it. Just for reference). For each parameter x_{t} at the iteration t, you update it with:

x_{t+1} = x_{t} rf(x_{t}); |
(3) |

where is the learning rate.

Momentum: for each parameter x_{t} at the iteration t, and the corresponding velocity v_{t}, we have

^{v}t+1 |
= v_{t} + rf(x_{t}) |
(4) |

^{x}t+1 |
= x_{t} v_{t+1}; |
(5) |

where is the learning rate, is the hyperparameter for the momentum rate. A common default option for is 0:9.

Nesterov Momentum: for each parameter x_{t} at the iteration t, and the corresponding velocity v_{t}, we have

where to actual move

^{v}t+1

^{x}t+1= _{jump}v rf(xcorrection+v)

t t t

x_{t} + v_{t+1};

(6)

send this updated weight for backprop

(7)

where is the learning rate, is the hyperparameter for the momentum rate. A common default option for is 0:9.

7

Adam: for each parameter x_{t} at the iteration t, we have

m_{1} = _{1} m_{1} + (1 _{1})rf(x_{t})

^{m2 = 2} ^{m2 + (1} ^{2)(rf(xt))}2

u_{1} |
= | m_{1} |
_{ } |
|||||||

1 | ^{t} |
^{ } |
||||||||

1 | ||||||||||

u_{2} |
= | m_{2} |
_{ } |
|||||||

1 | ^{t} |
^{ } |
||||||||

2 | u_{1} |
_{ } |
||||||||

^{x}t+1 ^{=} ^{x}t |
; | |||||||||

(^{p} |
^{ } |
+ ) | ||||||||

u_{2} |

(8)

(9)

(10)

(11)

(12)

where is the learning rate, m_{1} and m_{2} are the rst and second moments, u_{1} and u_{2} are the rst and second moments’ bias correction, and _{1}, _{2}, and are hypterparameters. A set of common choices of the parameters are _{1} = 0:9, _{2} = 0:999, = 10 ^{5}, and = 1e 4. We initialize m_{1} = m_{2} = 0.

Related Modules:

my neural networks/optimizers.py Action Items:

- The corresponding class prototypes for the three algorithms are given in the skeleton. You should nish each of them.

- There is a “{optimizer” command-line option in hw3 minibatch.py. So you can test your optimizers individually with the following commands:

“python hw3 minibatch.py -n 20000 -m 300 {optimizer sgd data”

“python hw3 minibatch.py -n 20000 -m 300 {optimizer momentum data” “python hw3 minibatch.py -n 20000 -m 300 {optimizer nesterov data”

“python hw3 minibatch.py -n 20000 -m 300 {optimizer adam data”

- Plot “Loss vs. Epochs” and “Accuracies vs. Epochs” for each algorithm (include SGD) and attach them to the report. You should have 8 plots in total.

- Describe your observations from the plots in the report. Make algorithm comparisons.

1.2.4 Regularization

You will be implementing two common regularization methods in neural networks and analyze their di er-ence.

Note that you can still report the results with full-batch training if you skipped the minibatch SGD imple-mentation in Task 1a.

8

Task 2a: L2-Regularization

The rst method is L2-Regularization. It is a general method that not only works for neural networks but also for many other parameterized learning models. It has a hyperparameter for determining the weight of the L2 term, i.e., it is the coe cient of the L2 term. Let’s call it . In this task, you should implement the L2-regularization and conduct hyperparameter tuning for . Think about what should be added to the gradients when you have a L2-Regularization term in the objective function.

Here we give the mathematical de nition of the L2-regularized objective function that you should look at:

1 | N | |||

X_{i} |
_{ } |
|||

L = | N | L_{i}(x_{i} |
; y_{i}; W ) + R(W ) |
(13) |

=1 | ||||

R(W ) = ^{XX}W_{k;j}^{2} ; |
(14) |

- j

where is a hyperparameter to determine the weight of the regularization part. You should think about how to change the gradients according to the added regularization term.

Related Modules:

my neural networks/networks.py Action Items:

- There is a command-line option “{l2 lambda”, which speci es the coe cient of the L2 term. In defaut, it is zero, which means no regularization.

- Add L2-Regularization to your BasicNeuralNetwork with respect to the command-line option. You need to make changes for all the methods to support L2-regularization, especially for the “train one epoch(.)” and “loss(.)”.

- Test your network with the following commands:

“python hw3 minibatch.py -n 20000 -m 300 {l2 lambda 1 data” “python hw3 minibatch.py -n 20000 -m 300 {l2 lambda 0.1 data” “python hw3 minibatch.py -n 20000 -m 300 {l2 lambda 0.01 data”

- Plot “Loss vs. Epochs” and “Accuracies vs. Epochs” for each lambda value, and include them in the report. You should have 6 plots.

- Describe your observations from the plots in the report. Make comparisons.

9

Task 3: Implement a CNN

In this task we will implement a CNN and its task to better understand its inner workings.

Related Modules:

hw3 minibatch.py

my neural networks/optimizers.py

(to create) my neural networks/CNN networks.py (to create) my neural networks/shu ed labels.py

Action Items:

- (in the code) Follow the the pytorch implementation provided at the end of lecture
__https://www.cs.purdue.edu/homes/ribeirob/courses/Spring2020/lectures/05/optimization.html__to create CNN networks.py using the default parameters given in the lecture.

Optimize using the standard SGD. Modify hw3 minibatch.py so it will ran the CNN code with parameter –CNN

(in the PDF) Describe the neural network architecture and its hyper-parameters: layers, type of layer, number of neurons on each layer, the activation functions used, and how the layers connect.

- For a k k lter, the CNN considers k k image patches. These image patches overlap according to stride, which is by how much each block must be separated horizontally and vertically. If k is not a multiple of the image height of width, we will need padding (increase image height (width) by adding a row (column) of zeros). Modify these lters to (a) 3 3 with stride 3, and (b) 14 14 with stride 1. In the PDF, show the test accuracy of the classi er over training and test data for items (a) and (b). (you do not need to include your code). Discuss your ndings. Speci cally, what could be the issue of having (a) 3 3 lters with stride 3 and (b) 14 14 lters?

Hint: Zero-pad as necessary to keep the input and output of the convolutional layers of the same dimensions.

- Deep neural networks generalization performance is not related to the network’s inability to over t the data. Rather, it is related to the solutions found by SGD. In this part of the homework we will be testing that hypothesis. Please use 100 epochs in the following questions. Please implement this part of the code in a separate le shu ed labels.py.

- Using the provided code, show a plot (in the PDF) with two curves: the training accuracy and validation accuracy, with the x-axis as the number of epochs.

- Now consider an alternative task: randomly shu e the target labels, so that the image (hand-

written digit) and its label are unrelated. Show a plot with two curves: the training accuracy and validation accuracy, with the x-axis as the number of epochs. What Modify hw3 minibatch.py so it will ran this shu e-label CNN experiment with parameter –shuffleCNN. What can you conclude about the ability of this neural network to over t the data? Would you say that the in-ductive bias of a CNN can naturally “understand images” and will try to classify all hand-written “0”s as the same digit?

10

- Using the approach to interpolate the initial (random) and nal parameters seen in class
__https://www.cs.purdue.edu/homes/ribeirob/courses/Spring2020/lectures/05/optimization.____html__

show the \ atness” plot of the original task and the task with shu ed labels over the training data. Can you conclude anything from these plots about the possible generalization performance of the two models?

Submission Instructions

Please read the instructions carefully. Failed to follow any part might incur some score deductions.

Naming convention: [ rstname] [lastname] hw3

All your submitting les, including a report, a ReadMe, and codes, should be included in one folder. The folder should be named with the above naming convention. For example, if my rst name is “Bruno” and my last name is “Ribeiro”, then for Homework 3, my le name should be \bruno ribeiro hw3″.

Tar your folder: [ rstname] [lastname] hw3.tar.gz

Remove any unnecessary les in your folder, such as training datasets. Make sure your folder struc-tured as the tree shown in Overview section. Compress your folder with the the command: tar czvf bruno ribeiro hw3.tar.gz czvf bruno ribeiro hw3 .

Submit: TURNIN INSTRUCTIONS

Please submit your compressed le on data.cs.purdue.edu by turnin command line, e.g. “turnin -c cs690dl -p hw3 bruno ribeiro hw3.tar.gz”. Please make sure you didn’t use any library/source explicitly forbidden to use. If such library/source code is used, you will get 0 pt for the coding part of the assignment. If your code doesn’t run on scholar.rcac.purdue.edu, then even if it compiles in another computer, your code will still be considered not-running and the respective part of the assignment will receive 0 pt.

1.3 Grading Breakdown

Theoretical: 30%

Programming: 70%

11