Description

The following languages are acceptable: Java, C/C++, Python, Matlab

You can work in a team of up to 3 people. Each team will only need to submit one copy of the source code and report. You need to explicitly state each member’s contribution in percentages (a rough estimate).

Your source code and report will be submitted through TEACH

You need to submit a readme le that contains the programming language version you use (e.g. python 2.7 ) and the command to run your code (e.g. python main.py).

Please make sure that you can be run code remotely on the server (i.e. babylon01 ) especially if you develop your code using c/c++ under visual studio.

Be sure to answer all the questions in your report. You will be graded based on your code as well as the report. In particular, the clarity and quality of the report will be worth 10 pts. So please write your report in clear and concise manner. Clearly label your gures, legends, and tables.

In your report, the results should always be accompanied by discussions of the results. Do the results follow your expectation? Any surprises? What kind of explanation can you provide?
1
Perceptron algorithm for Optical Character Recognition (total points: 80 pts + 10 report pts + 10 result pts)
Task description. In the Optical Character Recognition (OCR) we seek to predict a number (between 0 to 9) for a given image of handwritten digit. In this assignment we simplify the OCR into a binary classi cation task. Speci cally, we consider predictions for only two numbers 3 and 5. The goal in this assignment is to develop variations of the perceptron algorithm to classify handwritten digits of numbers 3 or 5.
Data. All the handwritten digits are originally taken from http://www.kaggle.com/c/digitrecognizer/data The original dataset contains the sample digits suitable for OCR. We extract samples with only labels 3 and

Following a little preprocessings we produce three datasets for this assignment as follows:


Train Set (pa2 train.csv): Includes 4888 rows (samples). Each sample is in fact a list of 785 values. The rst number is the digit’s label which is 3 or 5. The other 784 oating values are the the attened grayscale values from a 2d digital handwritten image with shape 28 28.



Validation Set (pa2 valid.csv): Includes 1629 rows. Each row obeys the same format given for the train set. This set will be used to select your best trained model.



Test Set (pa2 test.csv): Includes 1629 rows. Each row contains only 784 numbers. The label column is omitted from each row.

Important Guidelines. For all parts of this assignment:

Please assign labels +1 to number 3 and 1 to label 5. In your produced predictions, please use only +1 and 1 as labels not 3 and 5.

Please do not shu e the given data. While in practice shu ing should be used to improve training convergence, for this assignment we ask you not to shu e the data to ensure determinstic results for assessment purpose.

To simplify the notation in this assignment, your load function which loads train, validation and test dataset should add a bias feature to the dataset. The bias feature is a feature with value of 1.0 for all of the samples. Therefore the feature size for each samples will become 785.
Part 1 (20 pts) : Online Perceptron. In the online perceptron algorithm we train a linear classi er with parameter w to predict the label of a sample with equation:

y^ = sign(w^{T} x)
(1)
Where y^ 2 f 1; 1g. Algorithm 1 describes the online perceptron.
2
Algorithm 1 Online Perceptron

procedure OnlinePerceptron
2: 
w 

0 

0 

3: 
t 0 

_{(}(^{((} 

4:while_{(}(iter^{(} 
< iters: 
_{(((}(^{(((} 

( 
_{((}(^{(} 

5: 
for all sample x 

in the((train set: // no shu ing 

t 
( 

(^{(} ( 

(^{(} 
(^{(} 

6: 
_{(}( _{(}( 
T 
x_{t}) 

(u_{t}_{(}(sign(w 

( 
( 
( ^{t} 

_{(}( 
_{((}(^{((} 

7: 
_{(}if 
(y_{t}u_{t} 0: 

8: 
w_{t+1}_{(}(w_{t} + y_{t}x_{t} 

9: 
(^{(}_{((} 

(t(t + 1 

w 0

while iter < iters:

for all sample x_{i} in the train set: // no shu ing
13: u_{i} w^{T} x_{i}

if y_{i}u_{i} 0:

w w + y_{i}x_{i}
16: iter iter + 1
As we can see, the weight w which is used to predict a sample at time step t + 1 is equivalent to:

X
w =y_{i}x_{i}
(2)
x_{i}2S_{t}
where S_{t} is a list containing all the previous samples that have been incorrectly classi ed by the model (some example may appear multiple times). Therefore the prediction at time step t + 1 can also be given by:

X
y^_{t+1} = sign((
y_{i}x_{i})^{T} x_{t+1})
(3)
x_{i}2S_{t}
In this part we are interested in the following experiments for the online perceptron algorithm:

Implement the online perceptron model with algorithm described in Algorithm 1. Set the iters = 15. During the training, at the end of each iteration use the current w to make prediction on the validation samples. Record the accuracies for the train and validation at the end of each iteration. Plot the recorded train and validation accuracies versus the iteration number.

Does the train accuracy reach to 100%? Why?

Use the validation accuracy to decide the test number for iters. Apply the resulting model to make predictions for the samples in the test set. Generate the prediction le oplabel.csv. Please note that your le should only contain +1 (for 3) and 1 (for 5) and the number of rows should be the same as pa2 test.csv.
Part 2 (20 pts) : Average Perceptron. In this part we are interested to utilize average perceptron to deal with some issues regarding the online perceptron. Algorithm 2 describes the average perceptron.
As shown in the Algorithm 2, we compute a running average w which is used to predict the label of any sample x_{i} as follows:

y^(x) = sign(w^{T} x_{i})
(4)
We are interested in below experiments for average perceptron:

Please implement the average perceptron described in Algorithm 2.

Plot the train and validation accuracies versus the iteration number for iters = 1; :::; 15.

How average model has a ected the validation accuracy comparing to the online perceptron?
3
Algorithm 2 Average Perceptron

procedure AveragePerceptron

w 0

c 0

w 0 // keeps running average weight

s 0 // keeps sum of cs

while iter < iters:

for all sample x_{t} in the train set: // no shu ing
8: u_{t} sign(w^{T} x_{t})

if y_{t}u_{t} 0:
10: 
if s + c > 0 : 

11: 
w 
sw+cw 

s+c 

12: 
s s + c 

w w + y_{t}x_{t}
14: 
c 
0 

15: 
else: c 
c + 1 

16: 
if c > 0: 

17: 
w 
sw 
+cw 

s+c 

Part 3 (40 pts). Polynomial Kernel Perceptron. The online/average perceptron in Algorithm( 1 and 2) are linear models. In order to increase the model’s complexity, one can project the feature vectors into a high (or even in nite) dimensions by applying a projection function . In this case the prediction at time (t+1) is given by:
y^_{t+1} = sign( (w_{t+1})^{T} (x_{t+1})) 
(5) 

Or equivalently by: 
X 

y^_{t+1} = sign(( 
y_{i} (x_{i}))^{T} (x_{t+1})) 
(6) 
x_{i}2S_{t}
Where S_{t} is a list containing all the previous sample vectors for which the model has a wrong predictions (there can be repetitions for a sample). Simplifying above equation yields:

X
y^_{t+1} = sign(
y_{i} (x_{t+1})^{T} (x_{i})))
(7)
x_{i}2S_{t}
As we see in the equation 7, the prediction at each time steps utilizes a similarity terms i.e. (x_{t+1})^{T} (x_{i}) between the current projected sample (x_{t+1}) and all the previous samples which we have incorrect predictions.
Let’s de ne a kernel function with parameter vectors x and y to be k(x; y) = (x)^{T} (y). Therefore the equation is simpli ed to:

X
y^_{t+1} = sign(
y_{i}k(x_{t+1}; x_{i}))
(8)
x_{i}2S_{t}
There are di erent kernel functions. For example to compute the similarity between vectors x and y in the polynomial space (with polynomial degree p) we utilize below kernel:

k_{p}(x; y) = (1 + x^{T} y)^{p}
(9)
For p = 1 we have a linear kernel. In this part we are interested in polynomial kernel perceptron as described in Algorithm 3:
Please consider below experiments:

Implement the polynomial kernel function k_{p} in the Algorithm 3. This function takes two vectors x_{1} and x_{2} and an integer p for the polynomial degree, and returns a real value.
4
Algorithm 3 Kernel (polynomial) Perceptron

procedure KernelPerceptron

N : number of train samples; F : number of features
3: X_{N F} train set

_{N} _{1} 0
5: 
y_{N} _{1} train labels 

6: 
k_{p}(x; y) = (1 + x^{T} y)^{p} 
8x_{i}; x_{j} 2 X 
7: 
K_{N N} (i; j) = k_{p}(x_{i}; x_{j}) 

8: 

while iter < iters:

for all sample x_{i} 2 X: // no shu ing
12: 
if y_{i}u 0:^{P}^{j} 
K[j; i] [j]y[j]) 

11: 
u = sign( 

13: 
[i] 
[i] + 1 


De ne a Gram matrix K with size N N where N is the number of training samples. Fill matrix K(i; j) = k_{p}(x_{i}; x_{j}) for all of the pairs in the training set.

Implement the rest of the kernel perceptron in Algorithm 3. For each p in [1, 2, 3, 7, 15]:


Run the algorithm to compute .



At the end of each iteration use the current to predict validation set.



Record the train and validation accuracy for each iteration and plot the train and validation accuracies versus the iteration number.



Record the best validation accuracy achieved for each p over all iterations.


Plot the recorded best validation accuracies versus degrees. Please explain how p is a ecting the train and validation performance.

Use your best (the best you found over all d and iterations above) to predict the test dataset. Please name the predicted le as kplabel.csv.
Submission. Your submission should include the following:

Your source code with a short instruction on how to run the code in a readme.txt.

Your report only in pdf, which begins with a general introduction section, followed by one section for each part of the assignment.

Two prediction les oplabel.csv and kplabel.csv. These prediction le will be scored against the ground truth y values and 10% of the grade will be based on this score.

Please note that all the les should be in one folder and compressed only by .zip.
5