It is recommended that you complete this exercises in Python 3 and submit your solutions as a Jupyter notebook.
You may use any other language, as long as you include a README with simple, clear instructions on how to run (and if necessary compile) your code.
Please upload all les (code, README, written answers, etc.) to blackboard in a single zip le named ff irstnameg flastnameg DS5230=DS4220 HW 3:zip.
Exercise 1 : Understanding clustering algorithms (40pts)
(10pts) Consider density-based clustering algorithm DBSCAN with parameters p
= 2; M inP ts = 3, and Euclidean distance measures. Given the following
([0; 1]; [5; 2]; [2; 3]; [6; 1]; [10; 2]; [0; 6]; [3; 4]; [6; 3]; [0; 7]; [7; 2]; [1; 2])
List the clusters in terms of their points.
What are the density-connected points?
What points (if any) does DBSCAN consider as noise?
(15pts) Given two clusters
C1 = f(5; 6); (8; 7); (7; 3)g C2 = f(6; 5); (4; 5); (9; 2); (3; 5); (8; 4)g
compute the values in (a) – (f). Use the de nition for scattering criteria presented in class. Note that tr in the scattering criterion is referring to the trace of the matrix.
The mean vectors m1 and m2
The total mean vector m
The scatter matrices S1 and S2
The within-cluster scatter matrix SW
The between-cluster scatter matrix SB
(15pts) Present three graphs (drawn by-hand is ne) illustrating cases where agglomerative hierarchical clustering would produce di erent results, depending on the distance metrics used. In particular consider the following distances: Minimum, Maximum, and Average.
Exercise 2 : Implementing and Comparing Clustering Algorithms (60pts)
In this section, you will implement 2 algorithms: K-means, and DBSCAN. You will then use these algorithms to compare clustering on 3 datasets, dataset1.txt, dataset2.txt, and dataset3.txt. Each dataset contains two columns of features and a third column with labels.
To evaluate your clustering results, consider 3 di erent metrics:
The normalized mutual information (NMI)
NMI(Y ; Z) = p
I(Y ; Z)
H(Y )H(Y )
The Calinski-Harabaz (CH) index tr(SB)=tr(SW ). scikit-learn: sklearn.metrics.calinski_harabaz_score
The Silhouette coe cient (SC) (see attached note). scikit-learn: sklearn.metrics.calinski_harabaz_score
(25pts) Implement the DBSCAN algorithm. Run the algorithm with the following Eps and M inP ts values
dataset1.txt: Eps 2 f0:2; 0:3; 0:4g, M inP ts 2 f2; 3; 4g
dataset2.txt: Eps 2 f0:8; 0:85; 0:9g, M inP ts 2 f6; 7; 8g
dataset3.txt: Eps 2 f0:2; 0:3; 0:4g, M inP ts 2 f5; 6; 7g Report the following information
Make tables for the NMI, CH index and SC values as a function of Eps and M inP ts.
For the best result (based on NMI) make a scatter plot in which di erent shapes are used to show the true label, and di erent colors represent the learned cluster assignments. Noise points should be given a special color.
(25pts) Implement K-means. Your implementation should perform multiple random initializations and keep the best result. Make sure that your code does not crash when one of your initializations results in an empty cluster. Finally, make sure that your implementation accepts the initial value of means as an input. We will check your implementation against a set of predetermined initializations.
For each dataset, plot the sum of squared errors (SSE), the CH index, the SC, and the NMI for the best restart (based on SSE) as a function of the number of clusters K = 1; 2; 3; 4; 5. For each K value, again include a scatter plot, in which di erent symbols are used to signify each label, and di erent colors represent the learned cluster assignments.
(10pts) From above implementations, you may see di erent models perform di erently on these three datasets:
Which model performs best on dataset1, dataset2 and dataset3? Can you explain why?