Description
Notes:
There are 3 problems on 6 pages in this homework. This homework consists of two parts.
Implementation part: Problem 1(c) and 3 (you may work in groups of 2)
– You may work in pairs on the implementation part. Each group should turn in the source code only once. Please do not collaborate with anyone other than your partner. Indicate group work by adding both wustlkeys to the partners.txt file and commit this file to both repositories (partnerships must me mutually accepted!).
– Stubs for your implementations are in your SVN repository. Your implementations will be autograded. Follow the submission instructions exactly. Do not change the names or signatures of the functions as this will lead the autograder to fail in which case you cannot get credit!
– Comment your code properly.
Written part: Problem 1(ab) and 2 (no group work)
– Submit all written answers by committing a single pdf file named YOUR_WUSTL_KEY_ hw5.pdf to the hw5 folder in your SVN repository.
Autograder: The autograder for this class is used for grading only. It is not meant to be used for debugging. We will run it after the submission deadline. Since this is the first time that your work will be autograded, we offer an autograder deadline prior to the submission deadline, where we grade your work and give you the feedback of the autograder. This deadline is optional, but highly recommended!
THU 3rd of Nov 10am: deadline for optional autograder feedback
TUE 8th of Nov 10am: submission deadline
THU 10th of Nov 10am: deadline after automatic extension (no late days after this deadline!) If you have any questions with respect to the autograder, ask on Piazza using the autograder tag.
Problems:

(30 points) L2distances through Matrix Operations
Many machine learning algorithms access their input data primarily through pairwise distances. It is therefore important that we have a fast function that computes pairwise (Euclidean) distances of input vectors. Assume we have n data vectors ~x_{1}; : : : ; ~x_{n} 2 R^{d} and m vectors ~z_{1}; : : : ; z_{m} 2 R^{d}. And let us define two matrices X = [~x_{1}; : : : ; ~x_{n}] 2 R^{d n}, where the i^{th} column is a vector ~x_{i} and similarly Z = [~z_{1}; : : : ; ~z_{m}]. Our distance function takes as input the two matrices X and Z and outputs a matrix D 2 R^{n m}, where
q
D_{ij} = (~x_{i} ~z_{j} )^{>}(~x_{i} ~z_{j} ):
The key to efficient programming in Matlab and machine learning in general is to think about it in terms of mathematics, and not in terms of loops.
(a) Show that the Gram matrix G (aka innerproduct matrix) with
G_{ij} = ~x^{>}_{i}~z_{j}
can be expressed in terms of a pure matrix multiplication.
(b) Let us define two new matrices S; R 2 R^{n m}
S_{ij} = ~x^{>}_{i}~x_{i}; R_{ij} = ~z_{j}^{>}~z_{j} :
Show that the squaredeuclidean distance matrix D^{2} 2 R^{n m}, defined as
D_{ij}^{2} = (~x_{i} ~z_{j} )^{2};
can be expressed as a linear combination of the matrix S; G; R. (Hint: It might help to first express D_{ij}^{2} in terms of innerproducts.) Further, think about and answer the following questions:
What mathematical property do all entries of D satisfy?
What are the diagonal entries of D assuming that X = Z (we want the distance among all data points in X)?
What do you need to do to obtain the true Euclidean distance matrix D?
Remember the answers to all of these questions and ensure that your implementation in the next part satisfies all of them.

Remember the slow function l2distanceSlow.m using nested for loops to compute the L2distance. You find it in the hw5 folder in your SVN repository. Read through the code carefully and make sure you understand it. It is perfectly correct and will produce the correct result … eventually. To see what is wrong, run the following program in the MATLAB console:


X=rand(100,10000);



Z=rand(100,2000);



tic;D=l2distanceSlow(X,Z);toc

This code defines some random data in X and Z and computes the corresponding distance matrix D. The tic, toc statements time how long this takes. On my laptop it
takes over a minute. This is way too slow! If I were to compute the distances between larger training and testing sets or higher dimensional data points (imagine for instance images with millions of pixels), it would take days! The problem is that the distance function uses a programming style that is well suited for C++ or Java, but not MATLAB!
Implement the function l2distance.m (a stub file is in your SVN repository), which computes the Euclidean distance matrix D without a single loop. Remember all your answers to the previous parts of this problem and test your implementation to make sure ot incorporates all of those. Once you are sure that your implementation is correct, time the distance function again:

X=rand(100,10000);

Z=rand(100,2000);

tic;D=l2distance(X,Z);toc
How much faster is your code now? With your new implementation you should easily be able to compute the distances between many more vectors. Call the function l2dist_tictoc to see how many distance computations you can perform within one second. With the loopy implementations, the same computation might have taken you several days.
Submit your implementation by committing your function l2distance.m, and the partners.txt file to the hw5 folder in your SVN repository.

(20 points) kNearestNeighbor
In this problem, you are going to look at a small dataset to understand various properties of kNN better. Suppose there is a set of points on a twodimensional plane from two different classes. Below are the coordinates of all points.
Points in class red: (0; 1), (2; 3), (4; 4)
Points in class blue: (2; 0), (5; 2), (6; 3)


Draw the knearestneighbor decision boundary for k = 1. Remember that the decision boundary is defined as the line where the classification of a test point changes. Use the standard Euclidean distance between points to determine the nearest neighbors. Start by plotting the points as a twodimensional graph. Please use the corresponding colors for points of each class (e.g, blue and red).



If the ycoordinate of each point was multiplied by 5, what would happen to the k = 1 boundary (draw another picture)? Explain in at most two sentences how this effect might cause problems when working with real data.



The kNN decision boundary for k = 3 is shown as below. Suppose now we have a test point at (1; 2). How would it be classified under 3NN? Given that you can modify the 3NN decision boundary by adding points to the training set in the diagram, what is the minimum number of points that you need to add to change the classification at (1; 2)? Show also where you need to add these points.




What is the testing complexity for one instance, e.g., how long does it take for kNN to classify one point? Assume your data has dimensionality d, you have n training examples and use Euclidean distance. Assume also that you use a quick select implementation which gives you the k smallest elements of a list of length m in O(m).


Suggest two strategies to decrease the testing complexities. Provide the average and worst case testing complexities for both strategies. At what cost are you gaining this increased efficiency?

(50 points) Build a kNearest Neighbor Classifier
In this project, you will build a kNN classifier for handwritten digits classification and face recognition. The data for the experiments resides in the files faces.mat and digits.mat (download them from Piazza resources; do NOT add those to your SVN repositories).
Data description:
xTr are the training vectors with labels yTr. xTe are the testing vectors with labels yTe.
As a reminder, to predict the label or class of an image in xTe, we will look for the knearest neighbors in xTr and predict a label based on their labels in yTr. For evaluation, we will compare these labels against the true labels provided in yTe.
Data visualization:
You can visualize one of the faces by running


load faces



figure(1);



clf;



imagesc(reshape(xTr(:,1),38,31));



colormap gray;

Note that the shape 38 31 is the size of the image. We convert it from a flat vector form in the dataset to a matrix which can be visualized. Note: If your images appear upside down,
use imagesc(flipud(reshape(xTr(:,1),38,31))).
Implementation:
The following parts will walk you through the project. Your development will work out best, if you finish these functions in this order. Note that your project will also be graded based on efficiency. So, avoid tight loops at all cost. Stub files for all functions are in the hw5 folder in your SVN repository.

Implement the function findknn.m, which should find the knearest neighbors of a set of vectors within a given training data set based on Euclidean distance (Hint: use your l2distance function here). The call of
>> [I,D]=findknn(xTr,xTe,k);
should result in two matrices I and D, both of dimensions k n, where n is the number of input vectors in xTe. The matrix I(i; j) is the index of the i^{th} nearest neighbor of the vector xT e(:; j). So, for example, if we set i=I(1,3), then xTr(:,i) is the first nearest neighbor of vector xTe(:,3). The second matrix D returns the corresponding Euclidean distances. So, D(i; j) is the distance of xT e(:; j) to its i^{th} nearest neighbor.

The function analyze.m should compute various evaluation metrics. The call of
>> result=analyze(kind,truth,preds);
should output the accuracy or absolute loss in variable result. The type of output required can be specified in the input argument kind as strings “abs” or “acc”. The input variables truth and pred should contain vectors of true and predicted labels respectively. For example, the call
>> analyze(“acc”,[1 2 1 2],[1 2 1 1])
should return an accuracy of 0:75. Here, the true labels are 1; 2; 1; 2 and the predicted labels are 1; 2; 1; 1. So the first three examples are classified correctly, and the last one is wrong ) 75% accuracy.