- This is a bonus assignment. If submitted, this can count for up to 5% of the final grade.
- The following languages are acceptable: Java, C/C++, Matlab, Python and R.
- You can work in team of up to 3 people. Each team will only need to submit one copy of the source code and report.
- Your source code and report will be submitted through the TEACH site https://secure.engr.oregonstate. edu:8000/teach.php?type=want_auth
Please clearly indicate your team members’ information.
- Be sure to answer all the questions in your report. You will be graded based on your code as well as the report. So please write your report in clear and concise manner. Clearly label your figures, legends, and tables.
- In your report, the results should always be accompanied by discussions of the results. Do the results follow your expectation? Any surprises? What kind of explanation can you provide?
Dimension Reduction & Clustering
In this assignment you will implement 1) Principle Component Analysis (PCA); 2) Linear Discriminant
Analysis (LDA); and 3) Kmeans clustering. You will run your implementation on the provided data set, which are sensor readings of a smart phone while the user is doing various activities (walking, standing, sitting, etc). For our purposes, we have filtered this data down to only two categories: walking up stairs and walking down stairs1 . The dimensionality is 477 different sensor readings.2
You will apply PCA and LDA on the “walking” data set to reduce the dimensions, and then K-Means clustering to the resulting reduced data, comparing the results. You will use the purity measure to measure the resulting clustering performance. Specifically, to measure purity using ground truth class labels, you first assign each cluster (and all of its instances) a class label based on the majority class it contains. Purity simply measures the accuracy of the resulting assignment. You need to submit source code of your implementations (PCA, LDA Kmean, and the experimental evaluation of them). Be sure that your code is properly commented to enhance its readability.
Specifically you need to conduct the following experiments with your implementation.
- (15 pts) Fix k = 2, apply kmeans to the original data (477 dimensions). Measure and report the class purity of the resulting clustering solution. Specifically, you will need to measure purity using ground truth class labels. First assign each cluster (and all of its instances) a class label based on the majority class it contains. Purity simply measures the accuracy of the resulting assignment. Because kmeans is sensitive to random initialization, you should randomly restart kmeans 10 times, then pick the solution with the best Sum Squared Error objective, and then measure its class purity.
- (10 pts) Compute the principal components of the data. To do this, you need to first compute the covariance matrix of your data using the equation on the dimension reduction slides. Then compute the eigen vectors of the covariance matrix and sort them according to the eigen values (in decreasing order). Note that you don’t need to implement your own eigen decomposition function. Feel free to use any numerical package for this purpose. For example, in matlab, function eig can be used. Use the results to answer the following question: what is the smallest dimension we can reduce to if we wish to retain at least 80% and 90% of the total variance respectively?
- (10 pts) Use the principal components computed in (2) to reduce the data from 477 to 1, 2 and 3 dimensions respectively. For each choice, apply kmeans with k = 2 to the resulting reduced data and report their purity measures. How do they compare to the results reported in (1)?
1 For our labels, we have denoted 0 as walking down and 1 as walking up. So yi = 0 means yi = down, and yi = 1 means
yi = up.
2 The original data set was downloaded from https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones.
We have processed this data and then filtered it down.
- (15 pts) Compute the project direction that best separates walking upstairs from walking downstairs by applying Linear discriminant analysis. For this part, you will use the class label as LDA is a supervised dimension reduction technique. First you compute the mean for each walking activity separately (say m1 and m2 ). You will then compute the within-class scatter matrix, assuming xi is a column vector of 477 dimensions representing the i-th activity in the data, using the following equation:
S = X
(xi − m1 ) ∗ (xi − m1 )T + X (xi − m2 ) ∗ (xi − m2 )T
The projection vector is then computed as w = S−1 ∗ (m1 − m2 ). Note that similarly you don’t need to implement the inversion function. Use existing numerical package (e.g., inv for matlab) for this purpose is fine. Project the data onto this direction, and then apply kmeans to the resulting 1-d data to find
2 clusters. Measure and report its class purity. How does this compare to the results you obtained in
- (10 pts) Provide a discussion on the suitability of PCA and LDA for this particular dataset.