Description
Submission Instructions
It is recommended that you complete this exercises in Python 3 and submit your solutions as a Jupyter notebook.
You may use any other language, as long as you include a README with simple, clear instructions on how to run (and if necessary compile) your code.
Please upload all les (code, README, written answers, etc.) to blackboard in a single zip le named ff irstnameg flastnameg DS5230=DS4220 HW 3:zip.
Exercise 1 : Understanding clustering algorithms (40pts)

(10pts) Consider densitybased clustering algorithm DBSCAN with parameters p
= 2; M inP ts = 3, and Euclidean distance measures. Given the following
points:
([0; 1]; [5; 2]; [2; 3]; [6; 1]; [10; 2]; [0; 6]; [3; 4]; [6; 3]; [0; 7]; [7; 2]; [1; 2])


List the clusters in terms of their points.



What are the densityconnected points?



What points (if any) does DBSCAN consider as noise?


(15pts) Given two clusters
C_{1} = f(5; 6); (8; 7); (7; 3)g C_{2} = f(6; 5); (4; 5); (9; 2); (3; 5); (8; 4)g
compute the values in (a) – (f). Use the de nition for scattering criteria presented in class. Note that tr in the scattering criterion is referring to the trace of the matrix.

The mean vectors m_{1} and m_{2}

The total mean vector m

The scatter matrices S_{1} and S_{2}

The withincluster scatter matrix S_{W}

The betweencluster scatter matrix S_{B}
1
(f) The scatter criterion ^{tr(S}^{B}^{)}
tr(S_{W} )

(15pts) Present three graphs (drawn byhand is ne) illustrating cases where agglomerative hierarchical clustering would produce di erent results, depending on the distance metrics used. In particular consider the following distances: Minimum, Maximum, and Average.
Exercise 2 : Implementing and Comparing Clustering Algorithms (60pts)
In this section, you will implement 2 algorithms: Kmeans, and DBSCAN. You will then use these algorithms to compare clustering on 3 datasets, dataset1.txt, dataset2.txt, and dataset3.txt. Each dataset contains two columns of features and a third column with labels.
To evaluate your clustering results, consider 3 di erent metrics:
The normalized mutual information (NMI)
NMI(Y ; Z) = p
I(Y ; Z)
H(Y )H(Y )
scikitlearn: scklearn.metrics.normalized_mutual_info_score
The CalinskiHarabaz (CH) index tr(S_{B})=tr(S_{W} ). scikitlearn: sklearn.metrics.calinski_harabaz_score
The Silhouette coe cient (SC) (see attached note). scikitlearn: sklearn.metrics.calinski_harabaz_score

(25pts) Implement the DBSCAN algorithm. Run the algorithm with the following Eps and M inP ts values
dataset1.txt: Eps 2 f0:2; 0:3; 0:4g, M inP ts 2 f2; 3; 4g
dataset2.txt: Eps 2 f0:8; 0:85; 0:9g, M inP ts 2 f6; 7; 8g
dataset3.txt: Eps 2 f0:2; 0:3; 0:4g, M inP ts 2 f5; 6; 7g Report the following information
Make tables for the NMI, CH index and SC values as a function of Eps and M inP ts.
2
For the best result (based on NMI) make a scatter plot in which di erent shapes are used to show the true label, and di erent colors represent the learned cluster assignments. Noise points should be given a special color.

(25pts) Implement Kmeans. Your implementation should perform multiple random initializations and keep the best result. Make sure that your code does not crash when one of your initializations results in an empty cluster. Finally, make sure that your implementation accepts the initial value of means as an input. We will check your implementation against a set of predetermined initializations.
For each dataset, plot the sum of squared errors (SSE), the CH index, the SC, and the NMI for the best restart (based on SSE) as a function of the number of clusters K = 1; 2; 3; 4; 5. For each K value, again include a scatter plot, in which di erent symbols are used to signify each label, and di erent colors represent the learned cluster assignments.

(10pts) From above implementations, you may see di erent models perform di erently on these three datasets:

Which model performs best on dataset1, dataset2 and dataset3? Can you explain why?
3