Implementation Assignment 4 Solution




General instruction.

  1. This is a bonus assignment. If submitted, this can count for up to 5% of the final grade.
  2. The following languages are acceptable: Java, C/C++,  Matlab, Python and R.
  3. You can work in team of up to 3 people. Each team  will only need to submit  one copy of the source code and report.
  4. Your source code and report will be submitted through the TEACH site https://secure.engr.oregonstate. edu:8000/teach.php?type=want_auth

Please clearly indicate  your team  members’ information.

  1. Be sure to answer all the questions in your report. You will be graded  based on your code as well as the report.   So please  write  your  report  in clear  and  concise manner.   Clearly  label your  figures, legends,  and tables.
  2. In your report,  the  results  should  always  be accompanied by discussions  of the  results.   Do the  results follow your expectation? Any surprises?  What  kind of explanation can you provide?



Dimension Reduction &  Clustering

In  this  assignment  you  will implement  1) Principle  Component  Analysis  (PCA);  2) Linear  Discriminant

Analysis  (LDA);  and  3) Kmeans  clustering.   You will run  your  implementation on the  provided  data  set, which  are  sensor  readings  of a smart  phone  while the  user  is doing  various  activities  (walking,  standing, sitting,  etc).  For our purposes,  we have filtered this data  down to only two categories:  walking up stairs  and walking down stairs1 . The dimensionality is 477 different sensor readings.2

You will apply  PCA  and  LDA on the  “walking”  data  set to reduce  the  dimensions,  and  then  K-Means clustering  to the resulting  reduced  data, comparing  the results.  You will use the purity  measure  to measure the  resulting  clustering  performance.   Specifically,  to  measure  purity  using  ground  truth class labels,  you first  assign  each  cluster  (and  all  of its  instances) a  class  label  based  on  the  majority class  it  contains. Purity simply measures  the  accuracy  of the  resulting  assignment.  You need to submit  source code of your implementations (PCA,  LDA Kmean,  and  the  experimental evaluation of them).   Be sure that your code is properly  commented  to enhance  its readability.

Specifically you need to conduct  the following experiments with your implementation.


  1. (15 pts) Fix k = 2, apply  kmeans  to the original data  (477 dimensions).  Measure  and report  the class purity  of the  resulting  clustering  solution.  Specifically, you will need to measure  purity  using ground truth class labels.  First  assign each cluster  (and  all of its instances) a class label based on the majority class it contains.  Purity simply measures  the accuracy  of the resulting  assignment. Because kmeans is sensitive to random  initialization, you should randomly  restart kmeans 10 times, then pick the solution with the best Sum Squared  Error  objective,  and then  measure  its class purity.


  1. (10 pts) Compute  the  principal  components  of the  data.   To  do this,  you need to  first  compute  the covariance  matrix  of your  data  using the  equation  on the  dimension  reduction slides.  Then  compute the  eigen vectors  of the  covariance  matrix  and  sort  them  according  to the  eigen values (in decreasing order).   Note that you don’t  need to implement  your  own eigen decomposition  function.   Feel free to use any numerical  package for this purpose.  For example, in matlab, function  eig can be used.  Use the results  to answer  the  following question:  what  is the  smallest  dimension  we can reduce  to if we wish to retain  at least 80% and 90% of the total  variance  respectively?


  1. (10 pts) Use the  principal  components  computed in (2)  to  reduce  the  data  from  477 to  1, 2 and  3 dimensions  respectively.  For each choice, apply  kmeans  with k = 2 to the  resulting  reduced  data  and report  their  purity  measures.  How do they  compare  to the results  reported in (1)?


1 For  our  labels,  we have  denoted 0 as walking down  and  1 as walking up.   So yi  = 0 means yi  = down, and  yi  = 1 means

yi  = up.

2 The  original data set was downloaded from

We  have  processed this  data and  then filtered it down.



  1. (15 pts) Compute the project direction that best separates walking upstairs from walking downstairs by applying Linear discriminant analysis.  For this part,  you will use the class label as LDA is a supervised dimension  reduction technique. First  you compute  the mean for each walking activity  separately (say m1   and  m2 ).  You will then  compute  the  within-class  scatter matrix, assuming  xi is a column  vector of 477 dimensions  representing the i-th  activity  in the data,  using the following equation:


S =   X

yi =down

(xi − m1 ) ∗ (xi − m1 )T  + X (xi − m2 ) ∗ (xi − m2 )T

yi =up



The projection  vector is then computed as w = S−1 ∗ (m1  − m2 ).  Note that similarly you don’t need to implement the inversion function.  Use existing numerical  package (e.g., inv for matlab) for this purpose is fine.  Project  the  data  onto  this  direction,  and  then  apply  kmeans  to the  resulting  1-d data  to find

2 clusters.  Measure  and report  its class purity. How does this  compare  to the  results  you obtained in



  1. (10 pts) Provide  a discussion on the suitability of PCA  and LDA for this particular dataset.

error: Content is protected !!