Description
Introduction
In this assignment you’ll work on clustering data.
You may not use any functions from machine learning library in your code, however you may use statistical functions. For example, if available you MAY NOT use functions like
kmeans
however you MAY use basic statistical functions like:
std mean cov eig
In addition, since the spirit of this assignment is not (necessarily) about feature reduction, you may use functions like pca.
Grading
Although all assignments will be weighed equally in computing your homework grade, below is the grading rubric we will use for this assignment:

Part 1
(Theory)
0pts
Part 2
(KMeans Clustering)
75pts
Part 3
(Purity)
25pts
TOTAL
100pts
Table 1: Grading Rubric
1
Pima Indians Diabetes Data Set In this dataset of 768 instances of testing Pima Indians for diabetes each row has the following information (1 class label, 8 features).

Class Label (1=negative,+1=positive)

Number of times pregnant

Plasma glucose concentration

Diastolic blood pressure (mm Hg)

Triceps skin fold thinkness (mm)

Insulin (mu U/ml)

Body mass index (kg=m^{2})

Diabetes pedigree function

Age (yrs)
Data obtained from: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
2
None
3
Let’s implement our own version of kmeans to cluster data!
Write a script that:

Reads in the data. For demonstrative purposes, you will be using the dataset mentioned in Datasets section (i.e. the diabetes dataset). However, given another dataset’s observable data matrix, X, your code should still work.

Separates the class label from the observable data to form matrices Y and X, respectively.

Standardizes the features.

Passes the observable data X to a (usercreated) function called myKMeans that takes two parameters:


The data to cluster, X



The number of clusters, k

The myKMeans function :
Your myKMeans function should run kmeans on the supplied data and value of k. Since we’ll visualize the clustering process, a few caveats:
Although theoretically your code should work for any value of 1 <= k <= N, we’ll constrain k to have a maximum value of 7, so that you will only need to de ne up to 7 colors.
Although your code should be written in such a way to be able to cluster in any dimensional space, for visualization purposes, if your data’s dimenionality is greater than 3, rst reduce its dimensionality to 3 using PCA. You may use the code you developed in HW1 to do this, or a PCA function from your linear algebra library.
In order to visualize the cluster process, you’ll generate a video. Choose a frame rate that is natural for observing the changes in the association. Each frame of the video will be a plot showing the current clustering. If X has only 2 features, this will be a 2D plot. If it has 3 features (either natively, or after PCA reduction), then this will be a 3D plot. If possible (that is, if the graphics library you are using allows it), try to have any 3D plots rotated so that it is easier to see what’s going on.
For each frame/plot you should:

Display the current reference vectors as solid circles and the observations as x’s.

The reference vectors and associated observations should have the same colors. For instance, if you use blue to represent cluster 1, then its reference vector should be a solid blue circle, and all observations associated with it should be blue x’s.

At the top of your graph you should state the iteration number.
Figures 12 show the initial and terminal clusterings when k = 2.
4

Seed your random number generator with zero (do this right before running your kmeans).

Randomly select k observations and use them for the initial reference vectors. I suggest you use randomize the indices and use the rst k randomized indices to select the observations.

Use the L2 distance (Euclidean) to measure the distance between observations and reference vectors.

Terminate the training process when the sum of magnitude of change of the cluster centers (from
the previous iteration to the current one) is less than = 2 ^{23}. That is, when ^{k}_{i=1} d(a_{i}(t
P
1); a_{i}(t)) < where k is the number of clusters, a_{i}(t) is the reference vector for cluster i at time t and d(x; y) is the L1 distance (Manhattan) between vectors x and y (as de ned in the Similarity and Distance Functions link on BBlearn).
Figure 1: Initial Clustering
5
Figure 2: Final Clustering
6
Augment your code from Part 1 to include the purity of the clusterings at each iteration on the plot. In order to compute the purity, use the class label information (1=negative for diabetes, +1=positive for diabetes). Add the current purity value to the text displayed on your gures.
7
For your submission, upload to Blackboard a single zip le containing:

Source Code – Including any necessary make les, etc..

readme.txt le – The readme.txt le should contain information on how to run your code to reproduce your results.

Sample videos of at least three di erent runs where you vary k and/or the features used. At least one of these must use more than two features. If you were able to get the purity part working, then these videos will include the purity in text within each frame. The video lenames
should be in the format
K [k] F [f]:[ext]
where:

[k] is the value of k passed to your function.

[f] is an underscoreseparated list of the features passed in to your kmeans function. If you used all the features, just state all.
(d) [ext] is the le extension.
For example, in the example in Part 2, the le name would be
K 2 F all:avi
Do not include spaces or special characters (other than the underscore character) in your le and directory names. Doing so may break our grading scripts.
Since there is no theory component to this assignment, and the visualization is a video, there is no PDF for this assignment.
8