Handwritten Digits Classification Solution



1    Introduction


In  this  assignment, your  task  is to  implement a  Multilayer  Perceptron Neural  Network  and  evaluate  its performance  in classifying handwritten digits.  For  CSE574 students  only: You will use the same network  to analyze  a more challenging  face dataset and compare  the performance  of the neural  network  against  a deep neural  network  using the TensorFlow  library.

After completing  this assignment, you are able to understand:

  • How Neural Network works and use Feed Forward, Back Propagation to implement Neural Network?
  • How to setup a Machine Learning  experiment on real data?
  • How regularization plays a role in the bias-variance tradeoff ?
  • For CSE574 students only: How to use TensorFlow  library  to deploy deep neural  networks  and under- stand  how having  multiple  hidden  layers can improve the performance  of the neural  network?


To get started with  the  exercise, you will need to download  the  supporting files and  unzip  its contents to the directory  you want to complete  this assignment.



Warning: In this project,  you will have to handle  many computing  intensive  tasks  such as training a neural network.  Our suggestion is to use the CSE server Metallica  (this server is dedicated to intensive computing  tasks)  and  CSE server Springsteen  (this  boss server is dedicated to running  TensorFlow) to run your computation. YOU MUST USE PYTHON 3 FOR IMPLEMENTATION. In addition, training such a big dataset will take  a very long time,  maybe  many  hours  or even days to complete.  Therefore, we suggest  that you should start doing this  project  as soon as possible so that the  computer will have time to do heavy computational jobs.



1.1     File  included in  this exercise

  • mnist all.mat : original dataset from MNIST.  In this  file, there  are 10 matrices  for testing  set and  10 matrices  for training set,  which corresponding  to  10 digits.   You will have  to  split  the  training data into training and validation data.
  • face all.pickle : sample of face images from the  CelebA data  set.  In this  file there  is one data  matrix and one corresponding  labels vector.  The preprocess  routines  in the script  files will split the data  into training and testing  data.
  • nnScript.py : Python script for this programming project. Contains function  definitions  –


–  preprocess(): performs  some preprocess  tasks,  and  output the  preprocessed  train,  validation and test  data  with their  corresponding  labels.  You need to make changes to this function.


–  sigmoid():  compute  sigmoid function.  The input  can be a scalar value, a vector or a matrix. You need to make changes to this function.

–  nnObjFunction():  compute  the  error  function  of Neural  Network.   You need to make changes  to this function.

–  nnPredict(): predicts  the label of data  given the parameters of Neural Network.  You need to make changes to this function.

–  initializeWeights(): return the  random  weights  for Neural  Network  given the  number  of unit  in the input  layer and output layer.

  • facennScript.py : Python script for running your neural network implementation on the CelebA dataset. This function will call your implementations of the functions sigmoid(),  nnObjFunc() and nnPredict() that you  will have  to  copy  from  your  nnScript.py (For  CSE574  students  only).   You  need  to  make changes to this function.
  • deepnnScript.py : Python script for calling the TensorFlow library for running  the deep neural network

(For  CSE574 students  only).  You need to make changes to this function.


1.2     Datasets


Two data  sets will be provided.  Both consist of images.  See the notebook available here – http://nbviewer. jupyter.org/github/ubdsgroup/ubmlcourse/blob/master/notebooks/ProgrammingAssignment1.ipynb, for pointers  about  how to handle  the data.


1.2.1  MNIST Dataset


The  MNIST  dataset [1] consists  of a training set  of 60000 examples  and  test  set  of 10000 examples.   All digits have been size-normalized  and centered  in a fixed image of 28 × 28 size. In original dataset, each pixel

in the  image is represented by an integer  between  0 and  255, where 0 is black,  255 is white  and  anything between  represents different shade of gray.

YOu will need to split the training set of 60000 examples into two sets.  First  set of 50000 randomly  sampled examples  will be used  for training the  neural  network.   The  remainder 10000 examples  will be used  as a validation set to estimate  the hyper-parameters of the network  (regularization constant λ, number  of hidden units).


1.2.2   CelebFaces Attributes  Dataset (CelebA)


CelebFaces  Attributes Dataset (CelebA)  [3] is a large-scale  face attributes dataset with  more  than  200K

celebrity  images.  CelebA has large diversities,  large quantities, and rich annotations, including:

  • 10,177 number of identities,
  • 202,599 number of face images, and
  • 5 landmark locations, 40 binary attributes annotations per image.


For this  programming assignment, we will have provided  a subset  of the  images.  The  subset will consist of data  for 26407 face images, split into two classes. One class will be images in which the individual  is wearing glasses and  the  other  class will be images in which the  individual  is not  wearing  glasses.  Each  image is a

54 × 44 matrix, flattened into a vector  of length  2376.



2    Your tasks

  • Implement Neural Network (forward pass and back propagation)
  • Incorporate regularization on the weights (λ)




Figure  1: Neural network



  • Use validation set to tune hyper-parameters for Neural Network  (number of units  in the  hidden  layer and λ).
  • For CSE574 students  only:  Run  the  deep neural  network  code we provided  and  compare  the  results with normal  neural  network.  The code will be released by Feb 20th 2017.
  • Write a report to explain  the experimental results.



3    Some practical tips in  implementation


3.1     Feature selection


In the dataset, one can observe that there  are many  features  which values are exactly  the same for all data points  in the  training set.  With  those features,  the  classification  models cannot  gain any more information about  the  difference  (or  variation) between  data points.    Therefore,  we can  ignore  those  features  in  the pre-processing  step.

Later  on in this course, you will learn more sophisticated models to reduce the dimension  of dataset (but not for this assignment).

Note:  You will need  to  save the  indices  of the  features  that you use and  submit  them  as part  of the submission.


3.2     Neural Network


3.2.1   Neural Network Representation


Neural network  can be graphically  represented as in Figure  1.

As observed  in the Figure  1, there  are totally  3 layers in the neural  network:

  • The first layer comprises  of (d + 1) units,  each represents  a feature  of image (there  is one extra  unit representing the bias).
  • The second layer in neural  network  is called  the  hidden  units.   In this  document, we denote  m + 1 as the  number  of hidden  units  in hidden  layer.  There  is an additional bias node at  the  hidden  layer as well.  Hidden  units  can be considered  as the  learned  features  extracted from the  original  data  set. Since number  of hidden  units  will represent the  dimension  of learned  features  in neural  network,  it’s


our choice to choose an appropriate number  of hidden  units.  Too many  hidden  units  may lead to the slow training  phase while too few hidden  units  may cause the the under-fitting problem.

  • The third layer  is also called  the  output layer.   The  value  of lth  unit  in the  output layer  represents the  probability of a certain  hand-written image belongs to digit  l.  Since we have  10 possible digits, there  are 10 units  in the  output layer.  In this  document, we denote  k as the  number  of output units in output layer.


The  parameters in Neural  Network  model are  the  weights  associated  with  the  hidden  layer  units  and  the output layers  units.    In  our  standard Neural  Network  with  3 layers  (input, hidden,  output), in order  to represent the model parameters, we use 2 matrices:

  • W (1) ∈ Rm×(d+1) is the  weight  matrix  of connections  from input  layer  to hidden  layer.  Each  row in this matrix  corresponds  to the weight vector  at each hidden  layer unit.
  • W (2) ∈ Rk×(m+1) is the weight matrix  of connections  from hidden  layer to output layer.  Each  row in this matrix  corresponds  to the weight vector  at each output layer unit.


We also further  assume that there  are n training samples when performing  learning  task of Neural Network.

In the next  section,  we will explain  how to perform learning  in Neural Network.


3.2.2   Feedforward Propagation


In Feedforward Propagation, given parameters of Neural Network and a feature vector x, we want to compute the probability that this feature  vector  belongs to a particular digit.

Suppose that we have totally  m hidden  units.  Let aj   for 1 ≤ j ≤ m be the  linear  combination of input

data  and let zj  be the output from the hidden  unit  j after  applying  an activation function  (in this exercise, we use sigmoid  as an  activation function).   For  each  hidden  unit  j (j = 1, 2, · · · , m),  we can  compute  its

value as follow:


jp  xp                                                                                                            (1)

aj  = X w(1)




zj  = σ(aj ) = 1 + exp(−a  )                                                                 (2)


where w(1)  = W (1) [j][p] is the weight of connection  from the pth input  feature  to unit j in hidden layer.  Note that we do not compute  the output for the bias hidden  node (m + 1); zm+1  is directly  set to 1.

The  third  layer  in neural  network  is called the  output layer  where the  learned  features  in hidden  units

are  linearly  combined  and  a sigmoid function  is applied  to produce  the  output.  Since in this  assignment, we want to classify a hand-written digit  image to its corresponding  class, we can use the  one-vs-all binary

classification  in which each output unit  l (l = 1, 2, · · · , 10) in neural network  represents the probability of an

image belongs to a particular digit.  For this reason,  the total  number  of output unit  is k = 10. Concretely, for each output unit  l (l = 1, 2, · · · , 10), we can compute  its value as follow:



lj zj                                                                                                            (3)

bl  = X w(2)





ol = σ(bl ) = 1 + exp(−b )                                                                  (4)

Now we have finished the Feedforward pass.


3.2.3   Error function and Backpropagation


The error function  in this case is the negative  log-likelihood error function  which can be written as follow:


n     k


J (W (1) , W (2) ) =   1 X X(y

n                 il

i=1 l=1

ln oil

+ (1 − yil

) ln(1 − oil

))                                     (5)



where yil  indicates  the  lth  target value in 1-of-K coding scheme of input  data  i and  oil  is the  output at  lth

output node for the ith data  example  (See (4)).


Because of the form of error function  in equation  (5), we can separate its error function  in terms  of error for each input  data  xi :

J (W (1) , W (2) ) = 1 X J (W (1) , W (2) )                                                       (6)





n            i i=1




Ji (W (1) , W (2) ) = − X(yil ln oil + (1 − yil ) ln(1 − oil ))                                         (7)


One  way  to  learn  the  model  parameters in  neural  networks  is to  initialize  the  weights  to  some  random numbers  and compute  the output value (feed-forward), then  compute  the error in prediction, transmits this error backward  and update the weights accordingly  (error  backpropagation).

The feed-forward  step can be computed directly  using formula  (1), (2), (3) and (4).

On the  other  hand,  the  error  backpropagation step  requires  computing  the  derivative  of error  function with respect  to the weight.

Consider  the  derivative  of error  function  with  respect  to  the  weight  from the  hidden  unit  j to  output unit  l where j = 1, 2, · · · , m + 1 and l = 1, · · · , 10:




 Ji   (2) lj


Ji  ol   bl  



∂ol  ∂bl         (2)





=  δl zj                                                                                                                       (9)




Ji ol

yl        1  yl



δl  = ∂o


= −(



− 1 − o )(1 − ol )ol = ol − yl


Note  that we are  dropping  the subscript i for simplicity.   The  error  function  (log loss) that we are  using in (5) is different  from the  the  squared  loss error  function  that we have  discussed  in class.  Note  that the choice of the error function  has “simplified”  the expressions  for the error!

On the other  hand,  the derivative of error function  with respect  to the weight from the pth input  feature to hidden  unit  j where p = 1, 2, · · · , d + 1 and j = 1, · · · , m can be computed  as follow:






=    X Ji  ol  bl  zj   aj            

∂ol  ∂bl  ∂zj  ∂aj  ∂w(1)



jp                   l=1                                    jp


=  X δl w(2)




lj (1 − zj )zj xp                                                                                      (11)





=  (1 − zj )zj (X δl w(2) )xp                                                                                  (12)



Note that we do not compute  the gradient for the weights at the bias hidden  node.

After finish computing  the derivative of error function  with respect to weight of each connection in neural network,  we now can write the formula  for the gradient of error function:



∇J (W (1) , W (2) ) = 1 X ∇J (W (1) , W (2) )                                                  (13)

n                i i=1


We again can use the gradient descent to update each weight (denoted in general as w) with the following rule:

wnew = wold  − γ∇J (wold )                                                                (14)


3.2.4   Regularization  in  Neural Network


In order  to  avoid  overfitting  problem  (the  learning  model is best  fit with  the  training data  but  give poor generalization when test  with validation data), we can add  a regularization term  into our error  function  to control  the magnitude of parameters in Neural Network.  Therefore,  our objective  function  can be rewritten as follow:


 m  d+1

k    m+1             


Je(W (1) , W (2) ) = J (W (1) , W (2) ) +  λ

X X(w(1) )2  + X X (w(2) )2





where λ is the regularization coefficient.

2n 

jp j=1 p=1

lj     

l=1 j=1


With  this new objective  function,  the partial derivative  of new objective  function  with respect  to weight from hidden  layer to output layer can be calculated  as follow:





= 1    X  Ji   + λw(2)










(2)                lj




Similarly,  the partial derivative  of new objective  function  with respect  to weight from input  layer to hidden layer can be calculated  as follow:





= 1    X  Ji   + λw(1)










(1)                jp jp



With  this  new formulas  for computing   objective  function  (15)  and  its  partial derivative   with  respect  to weights (16) (17) , we can again use gradient descent to find the minimum  of objective  function.


3.2.5   Python implementation of Neural Network


In  the  supporting files, we have  provided  the  base  code for you  to  complete.   In  particular, you  have  to complete  the following functions  in Python:

  • sigmoid : compute sigmoid function. The input  can be a scalar value, a vector  or a matrix.
  • nnObjFunction : compute the objective function of Neural Network with regularization and the gradient of objective function.
  • nnPredict : predicts the label of data given the parameters of Neural Network. Details  of how to implement the required  functions  is explained  in Python code.



Optimization: In general,  the  learning  phase  of Neural  Network  consists  of 2 tasks.   First  task  is to  compute  the  value  and  gradient  of error  function  given Neural  Network  parameters.  Second  task is to  optimize  the  error  function  given  the  value  and  gradient of that error  function.    As explained earlier,  we can  use  gradient  descent  to  perform  the  optimization problem.    In  this  assignment, you have  to  use  the  Python scipy  function:   scipy.optimize.minimize (using  the  option  method=’CG’ for conjugate  gradient descent),  which performs  the  conjugate  gradient descent algorithm to  perform optimization task.  In principle,  conjugate  gradient descent is similar to gradient descent but  it chooses a  more  sophisticated learning  rate γ  in  each  iteration so that it  will converge  faster  than  gradient descent.   Details  of how to  use minimize  are  provided  here:  http://docs.scipy.org/doc/scipy-0.



We use regularization in Neural Network to avoid overfitting  problem  (more about  this will be discussed in class).  You are expected  to change different value of λ to see its effect in prediction  accuracy  in validation set.   Your  report  should  include  diagrams  to  explain  the  relation  between  λ  and  performance  of Neural Network.  Moreover, by plotting the  value of λ with respect  to the  accuracy  of Neural  Network,  you should explain  in  your  report  how  to  choose  an  appropriate hyper-parameter λ  to  avoid  both  underfitting and overfitting  problem.  You can vary λ from 0 (no regularization) to 60 in increments  of 5 or 10.

You are also expected  to try  different number  hidden  units  to see its effect to the performance  of Neural Network.   Since training Neural  Network  is very slow, especially when the  number  hidden  units  in Neural Network  is large.   You should  try  with  small  hidden  units  and  gradually  increase  the  size and  see how it effects the  training time.  Your report  should include some diagrams  to explain  relation  between  number  of hidden  units  and training time.  Recommended values:  4, 8, 12, 16, 20.



4    TensorFlow Library


In this assignment you will only implement a single layer Neural Network.  You will realize that implementing multiple  layers  can  be a very  cumbersome  coding  task.   However,  additional layers  can  provide  a better modeling of the data  set.  The analysis  of the challenging CelebA data  set will show how adding  more layers can improve  the  performance  of the  Neural  Network.   To experiment with  Neural  Networks  with  multiple layers, we will use Google’s TensorFlow  library  (https://www.tensorflow.org/).

Your experiments should include the following:

  • Evaluate the accuracy of single hidden  layer  Neural  Network  on CelebA  data  set  (test  data  only), to distinguish between two classes – wearing  glasses and  not  wearing  glasses.  Use facennScript.py to obtain  these results.
  • Evaluate the accuracy of deep Neural Network (try 3, 5, and 7 hidden layers) on CelebA data  set (test data  only).  Use deepnnScript.py to obtain  these results.
  • Compare the performance  of single vs.  deep Neural  Networks  in terms  of accuracy  on test  data  and learning  time.



5    Submission


You are required  to submit  a single file called proj1.zip  using UBLearns. File proj1.zip  must  contain  2 folders:  report  and code.

  • Folder report contains  your  report  file (in pdf format).  Please  indicate  the  team  members  and  your course number  on the top of the report.
  • Folder code must contains the following updated files: nnScript.py and params.pickle 1. File params.pickle contains the  learned  parameters of Neural  Network.   Concretely,  file params.pickle  must  contain  the following variables:  list of selected features  obtained after feature  selection step (selected  features ), op- timal  n hidden  (number of units  in hidden  layer),  w1 (matrix of weight W (1)   as mentioned in section

3.2.1), w2 (matrix of weight W (2)   as mentioned in section 3.2.1), optimal  λ (regularization coeffient λ

as mentioned in section 3.2.4).2



Using UBLearns  Submission: In the  groups  page of the  UBLearns  website  you will see groups called “4/574  Project  Group  x”.  Please choose any available  group number  for your group and join the group.   All project  group  members  must  join the  same group.   Please  do not  join any  other  group  on UBLearns  that you are not part  of. You should submit  one solution per group through the groups page.


1 Check  this  to  learn  how to  pickle  objects in Python:  https://wiki.python.org/moin/UsingPickle

2 If you  want  to  write  more  supporting functions to  complete the  required functions, you  should include these  supporting functions and  a README file which  explains your  supporting functions.


Pro ject report: The hard-copy  of report  will be collected in class at due date.  Your report  should include the following:

  • Explanation of how to choose  the hyper-parameters for Neural  Network  (number   of hidden  units, regularization term  λ).
  • For CSE574 students  only:  Compare  the results  of deep neural  network  and  neural  network  with one hidden  layer on the CelebA data  set.



6    Grading scheme


The TAs will deploy a testing  script  that will test the functionality of individual  functions  that you submit within  the  nnScript.py file.  Full  points  will be awarded  if the  output of the  function  exactly  matches  the expected  output. The second grading  script will load the params.pickle file that you submit  and then  test  a

small testing  data  set.  You get full points  (10) if the  accuracy  using your model parameters is within  ±5%

of the accuracy  reported by our code.  Note that this data  set will  not be made available  to you.

  • For CSE474 students [Total 100 points]:

–  Successfully implement Neural Network:  60 points (preprocess() [10 points], sigmoid()  [10 points],

nnObjFunction() [30 points], nnPredict() [10 points]).

–  Project  report:  40 points

∗ Explanation with  supporting figures of how to choose the  hyper-parameter for Neural  Net-

work:  30 points

∗ Accuracy  of classification  method  on the handwritten digits test  data:  10 points

  • For CSE574 students [Total 120 points]:

–  Successfully implement Neural Network:  60 points (preprocess() [10 points], sigmoid()  [10 points],

nnObjFunction() [30 points], nnPredict() [10 points]).

–  Project  report:  60 points

∗ Explanation with  supporting figures of how to choose the  hyper-parameter for Neural  Net-

work:  30 points

∗ Accuracy  of classification  method  on the handwritten digits test  data:  10 points

∗ Accuracy  of classification  method  on the CelebA data  set:  10 points

∗ Comparison  of your neural  network  with a deep neural  network  (using TensorFlow) in terms

of accuracy  and training time:  10 points

  • Students in CSE474 section may attempt the CSE574 requirements (CelebA data analysis, comparison with deep neural  networks)  for extra  credit



7    Computing Resources


You  are  allowed  to  implement  the  project  on your  personal  computers  using  Python 3.4 or above.   You will need  numpy and  scipy libraries.    If you  need  to  use  departmental resources,  you  will need  to  use metallica.cse.buffalo.edu, which has Python 3.4.3 and the required  libraries  installed.

Students attempting to use the TensorFlow  library  have two options:


  1. Install TensorFlow  on personal  machines.   Detailed  installation information is here  – https://www. tensorflow.org/. Note that, since TensorFlow  is a relatively  new library,  you might encounter instal- lation  issues depending  on your OS and  other  library  versions.  We will not be providing  any detailed support regarding  TensorFlow  installation. If issues persist,  we recommend  using the option  2.


  1. Use springsteen.cse.buffalo.edu.  If you  are  registered  into  the  class,  you  should  have  an  ac- count on that server.  The server already  has Python 3.4.3 and TensorFlow  0.12.1 installed.  Please use

/util/bin/python for Python 3. Note that TensorFlow will not work on metallica.cse.buffalo.edu.




[1] LeCun,  Yann;  Corinna  Cortes,  Christopher J.C.  Burges. “MNIST  handwritten digit database”.


[2] Bishop, Christopher M. “Pattern recognition  and machine  learning  (information science and statistics)” (2007).


[3] Liu, Ziwei; Luo, Ping;  Wang,  Xiaogang;  Tang,  Xiaoou.  “Deep  Learning  Face  Attributes in the  Wild”, Proceedings  of International Conference on Computer Vision (ICCV)  (2015).

error: Content is protected !!