Description
Questions
1. (30 points) In this problem, you will implement a program to t two multivariate Gaussian distributions to the 2class data and classify the test data by computing the log odds log ^{P} ^{(C}^{1}^{jx)} . The priors P (C ) and P (C ) should be

P (C_{2}jx)
1
2
estimated from the training data. Three pairs of training data and test data are given. The parameters _{1}, _{2}, S_{1} and S_{2}, the mean and covariance for class 1 and class 2, are learned in the following three models for each training data and test data pair,
Model 1: Assume independent S_{1} and S_{2} (the discriminant function is as equation (5.17) in the textbook).
Model 2: Assume S_{1} = S_{2}. In other words, shared S between two classes (the discriminant function is as equation (5.22) in the textbook).
Model 3: Assume S_{1} and S_{2} are diagonal and the diagonal entries are identical within S_{1} and S_{2}: S_{1} = _{1}I, S_{2} = _{2}I. (You need to derive the discriminant function yourself).


(10 points) Write the likelihood function and derive S_{1} and S_{2} by maximum likelihood estimation of model 2 and model 3.



(10 points) Your program should return and print out the learned parameters P (C_{1}); P (C_{2}), _{1} and _{2} of each data pair to the MATLAB command window. Your implementation of model 1 and model 2 should return and print out the learned parameters S_{1}; S_{2}. Your implementation of model 3 will return and print out _{1} and _{2}.



(10 points) For each test set, print out the error rates of each model to the MATLAB command window (three models per each test set). Match each data pair to one of the models and justify your answer. Also, explain the di erence in your results in the report.

^{1}Instructor: Rui Kuang (kuan0009@umn.edu). TA: Jungseok Hong (jungseok@umn.edu) and Ujval Bangalore Umesh (banga038@umn.edu).
1

In this problem, you will apply dimension reduction and classi cation on the Optdigits dataset provided in optdigits train.txt and optdigits test.txt.


(5 points) Implement kNearest Neighbor (KNN) to classify the Optdigits dataset with k = f1; 3; 5; 7g. Print out the error rate on the test set for each value of k to the MATLAB command window.



(10 points) Implement your own version of Principal Component Analysis (PCA) and apply it the Optdigits training data. Generate a plot of proportion of variance (see Figure 6.4 (b) in the main textbook), and select the minimum number (K) of eigenvectors that explain at least 90% of the variance. Show both the plot and K in the report. Project the training and test data to the K principal components and run KNN on the projected data for k = f1; 3; 5; 7g. Print out the error rate on the test set for each value of k to the MATLAB command window.



(5 points) Next, project both the training and test data to R^{2} using only the rst two principal components to plot all samples in the projected space and label some data points with the corresponding digit in 10 different colors for the 10 types of digits for a good visualization (similar to Figure 6.5).



(10 points) Implement your own version of Linear Discriminant Analysis (LDA) and apply it to compute a projection only using the Optdigits training data into L dimensions (L = 2, 4, 9). Run KNN on the projected data for k = f1; 3; 5g. Print out the error rate on the test set for each combination of k and L to the MATLAB command window. (Hint: matlab function pinv() can be used to invert singular matrix as an approximation.)



(10 points) Similarly, project both the training and test data to R^{2} with the LDA projections and, plot all samples in the projected space and label some data points with the corresponding digit in 10 di erent colors for the 10 types of digits.


In this problem, you will work on dimension reduction and classi cation on a Faces dataset from the UCI repository^{2}. We provided the processed les face train data 960.txt and face test data 960.txt with 500 and 124 images, respectively. Each image is of size 30 32 with the pixel values in a row in the les and the last column identi es the labels: 1 (sunglasses), and 0

https://archive.ics.uci.edu/ml/datasets/CMU+Face+Images
2
(open) of the image. You can visualize the ith image with the following matlab command line:
imagesc(reshape(faces data(i,1:end1),32,30)’).

(10 points) Implement PCA and apply it to nd the principal components with combined training and test sets. First, visualize the rst 5 eigenfaces using a similar command line as above.

(10 points) Repeat what you did in question 2 (b), using PCA and KNN on this Faces dataset.

(10 points) Use the rst K = f10; 50; 100g principle components to approximate the rst ve images of the training set ( rst row of the data matrix) by projecting the centered data using the rst K principal components then \back project” (weighted sum of the components) to the original space and add the mean. For each K, plot the reconstructed image. Explain your observations in the report.
(Hint: Read section 6.3 on page 126 and 127 of the textbook for the projection and “back projection” to the original space.)
Instructions
Solutions to all questions must be included in a report including result explanations, learned parameter values and all error rates and plots.
All programming questions must be written in MATLAB, no other programming languages will be accepted. The code must be able to be executed from the MATLAB command window on the cselabs machines. Each function must take the inputs in the order speci ed and print/display the required output to the Matlab command window. For each part, you can submit additional les/functions (as needed) which will be used by the main functions speci ed below. Put comments in your code so that one can follow the key parts and steps. Please follow the rules strictly. If we cannot run your code, you will receive no credit.
Question 1:
{ MultiGaussian(training data: le name of the training data, testing data: le name of the testing data, Model: the model number). The function
3
must output the learned parameters and error rates as required in Question 1.
Question 2:
{ myKNN(training data, test data, k). The function returns the prediction for the test set.
{ myPCA(data, num principal components). The function returns the principal components and the corresponding eigenvalues.
{ myLDA(data, num principal components). The function returns the projection matrix and the corresponding eigenvalues.
{ script 2a.m, script 2b.m and script 2c.m Script les that solves question 2 (a), (b), (c), (d) and (e) calling the appropriate functions, do the plots and print values asked.
Question 3:
{ script 3a.m, script 3b.m and script 3c.m Script les that solves question 3 (a), (b) and (c) calling the appropriate functions, do the plots and print values asked.
For each dataset, rows are the samples and columns are the features with the last column containing the label.
You can use the eig function to calculate eigenvalues and eigenvectors. To visualize the projected data, you can use the text function. To specify the color, use the Color parameter in the text function. If the gure does not show all the data, you can use the axis function to scale the axis.
Submission
Things to submit:


hw2 sol.pdf: A document which contains the report with solutions to all questions.



MultiGaussian: Code for Question 1.



myKNN.m, myPCA.m, myLDA.m, script 2a.m, script 2b.m, script 2c.m Code for Question 2.

4

script 3a.m, script 3b.m, script 3c.m Code for Question 3.

Any other les, except the data, which are necessary for your code.
Instructions for Submission:
All material must be submitted electronically via canvas.
A zip le containing all the solutions mentioned other than the hw2 sol pdf. Should have a report hw2 sol.pdf not included in the zip le.
Failing to follow the instructions might result in points lost.
5