Description
Submission Instructions
It is recommended that you complete this exercises in Python 3 and submit your solutions as a Jupyter notebook.
You may use any other language, as long as you include a README with simple, clear instructions on how to run (and if necessary compile) your code.
Please upload all les (code, README, written answers, etc.) to blackboard in a single zip le named ff irstnameg flastnameg DS5230=DS4220 HW 4:zip.
Exercise 1 : Gaussian Mixture Model (70pts)
In the expectation maximization for Gaussian Mixture Model, we iterate between
Estep and Mstep:
Estep 

:= p(z 
= k x 
; ) 
/ 
p(x 
z 
= k; )p(z 
= k) 
/ 
p(x_{n}jz_{n} = k; )p(z_{n} = k) 

nk 
n 
j _{n} 
_{n}j 
n 
n 
P 
K 
p x 
z 
k; p z 
k) 

k=1 
( 
_{n}j 
_{n} = 
) ( _{n} = 
(1) 

Mstep 

N 

^{X}n 

N_{k} = 
nk 
(2) 

=1 

1 
N 

^{X}n 

k ^{=} _{N}_{k} 
nk^{x}n 
(3) 

=1 

_{k} = 
1 
N 
_{nk}(x_{n} 
_{k})(x_{n} 
_{k})^{T} 
(4) 

X_{n} 

N_{k} 

=1 


(10pts) Explain how the cluster assignment is di erence between cluster assignment (in Kmeans) and Estep (in GMM).
1

(10pts) Explain how the Mstep in GMM is di erent from the centriods update step in Kmeans.


(50pts) Now implement Expectation Maximization and apply it on 3 datasets, dataset1.txt, dataset2.txt, and dataset3.txt. Each dataset contains two columns of features and a third column with labels. To evaluate your clustering results, consider 3 di erent metrics:

Note that you may use functions from scikitlearn to compute metrics, but you need to implement GMM algorithm from scratch and are NOT allowed to directly call function from scikitlearn.
The normalized mutual information (NMI)
NMI(Y ; Z) = p
I(Y ; Z)
H(Y )H(Y )
scikitlearn: scklearn.metrics.normalized_mutual_info_score
The CalinskiHarabaz (CH) index tr(S_{B})=tr(S_{W} ). scikitlearn: sklearn.metrics.calinski_harabaz_score
The Silhouette coe cient (SC) (see attached note). scikitlearn: sklearn.metrics.calinski_harabaz_score
As a simplifying assumption, you may take the covariance matrix for each cluster to be diagonal. Once again, your implementation should perform multiple restarts and accept initial values for the mean and variance as inputs. Initialize the mean for each of your clusters by sampling from a Gaussian distribution centered on a random point in your data. Initialize the variance along each dimension to a random fraction of the total variance in the data.
For each dataset, plot the log likelihood, the CH index, the SC, and the NMI as a function of K = 2; 3; 4; 5. Also include the scatter plots that you made for kmeans, using z_{n} = arg max_{k nk} to determine the cluster assignments.
2
Hint: When implementing your Estep, you need to be careful about numerical under ow and over ow when calculating
X
_{nk} = p(x_{n}; z_{n} = k j )= p(x_{n}; z_{n} = l j ):
l
A common trick is to rst calculate the log probabilities

!
_{nk} = log
p
x
; z
k
j
;
_{!}
=
max !
:
(
n
_{n} =
)
n
k
nk
Before exponentiation you can now rst subtract !_{n}, which is equivalent to dividing both the numerator and denominator by exp !_{n}
X
_{nk} = exp(!_{nk} !_{n})= exp(!_{nl} !_{n}):
l
Explain in your code comments why this helps prevent under ow/over ow.
Exercise 3 : Learning Title Topics using Latent Dirichlet Allocation (40pts)
In this section you will apply Latents Dirichlet Allocation (LDA) to the academic publication titles dataset (publications.txt), and compare the results with what you got when doing Principle Component Analysis (PCA). By comparison, hopefully you could see the similarity between them in the sense that LDA can also be regarded as some form of matrix factortization.
For PCA and LDA, you may use packages from scikitlearn and don’t need to implement them from scratch. are going to use scikitlearn to implement both algorithms, including converting to texts to word count vectors.

(0pt) Firstly you need to clean up the raw dataset. I wrote a script prep.py which does tokenization, stemming, ltering nonalphabetsnumeric , numeric replacement, stop words removing. When it nishes, the texts are written into a le called titles prep.txt, so that you only need to do this preprocessing once and can import that le as the actual dataset.
In this question we only need to run the script prep.txt. You could either import the function in notebook or directly execute the python script in terminal. The purpose is to get you familiar with the basic text preprocessing if you are not. For the following questions please load titles prep.txt as the dataset.

(20pts) Apply LDA to the same dataset.
3
Step 1. Load the dataset and convert each title to word count vector using the the function CountVectorizer in scikitlearn. Set the parameter min df to be 800, then you vocabulary size will be 1023. (you can also try other methods to truncate the vocabulary and get a size roughly around 1000).
Step 2. Fit the word count vectors to lda and set the number of topics as 10, 20, 50, repsectively. For each experiment, print out the top 10 words for each topic in a descending order. Explain the trend and change as you change the number of principle components.
3. (5pts) Now apply PCA to the datset.
Step 1. Simiarly you need to convert the texts to word count vectors.
Step 2.By setting the number of principle components to 10, 20, 50, also list the top 10 words for each component (topic) in a descending order. Explain the trend and change as you change the number of principle components.
To be more speci c, you normalize a component eigenvector, and sort the absolute value of each element in it. Then you can retrive the top words with largest values using your vovabulary.

(5pts) Based on the results, do you think LDA outperforms pca in topic modeling task? Explain your reason for why or why not.
4