Description
Exercise 1: Bayesian Linear Regression
In this assignment, you will need to derive some identities to express ridge regression as maximum a posteriori (MAP) estimates relative to an appropriate prior. You will also derive con dence intervals on weights in Bayesian linear regression, as well as con dence intervals on predictions. Please complete this part of the exercise as a PDF le (preferably typed up in LaTeX).
a. As discussed in class, ridge regression (i.e. L2regularized linear regression)
minimizes the loss function:
L(w) = jjy Xwjj^{2} + jjwjj^{2}
N
X
= (y_{n} x^{>}_{n}w)^{2} + w^{>}w:
n=1
The closed form solution for the weights w that minimize this loss is
w = (X^{>}X + I) ^{1}X^{>}y:
To demonstrate that this solution is indeed an extremum of the loss function, show that it satis es

r_{w}L(w)
_{w=w} = 2X^{>}(Xw
y) + 2 w = 0:
1

Show that ridge regression (i.e. L2regularized linear regression) is equivalent to a MAP estimation problem in which a prior is placed on the regression weights
w = argmax p(w j y)
w
p(y j w) = N (y; Xw; ^{2}I)
p(w) = N (w; 0; s^{2}I)
Express the regularization parameter in terms of the constants and s, and b.
(note: this problem should be easy if you have reviewed the slides).
c. We will now consider the more general case where we place a multivariate Gaussian prior with parameters m_{0} and S_{0} on the weights

N (m_{0}; S_{0}):
Derive the marginal likelihood of y, which is de ned as
Z
p(yjX; m_{0}; S_{0}) = dw p(yjw; X)p(wjm_{0}; S_{0});
by calculating the mean and covariance such that
yjX; m_{0}; S_{0} N ( ; )
To do so, make use of the fact that for f = Xw, we can write
E[f] = E[Xw] = XE[w];
Cov[f] = Cov[f] = Cov[Xw] = XCov[w]X^{>}:
You will additionally need to make use of the fact that the covariance of two uncorrelated variables (which we will here conveniently denote f and ) is
Cov[f + ] = Cov[f] + Cov[ ]:
d. Using the result from the previous exercise, derive the predictive distribution at a new test point x , which is de ned as
Z
p(y jy; X; x ) = dw p(y jw; x )p(wjy; X):
Calculate the values f and such that
y jy; X; x N(f ; ):
To do so, use the standard identity on Gaussians (which we have already seen in class) that states that

when
j N
a + CB ^{1}(
b); A
_{CB} 1_{C}> _{;}
N
_{b}
;
_{C}>
_{B}
:
a
A
C
2

To check your work, substitute m_{0} = 0 and S_{0} = s^{2}I into the result of the previous exercise. Express the solution for f = E[f(x )] as a function of x . What is this result equal to?
Can PCA uncover academic communities?
In this problem, you will try to apply Principal components analysis (PCA) techniques to the real world bibliographic dataset from Aminer (https://aminer.org/). The dataset (publications.txt) was downloaded and attached with this assignment. We will see if we can use latent semantic analysis (PCA on text) to discover structure in paper titles.
You may implement the following exercise using any language/system you choose. We recommend that you use the Jupyter docker instance with Spark (http://spark. apache.org/). The pyspark.ml.feature namespace implements many of the prerequisite operations, which are documented here:
http://spark.apache.org/docs/latest/mlfeatures.html
In the rst part of the exercise, we will consider the dataset as a whole

Transform titles to word count vectors. Truncate your (sparse) vectors to the 1000 most frequent words and perform PCA with 50 components on the counts.

Plot the eigenvalues of the principal components. Calculate how many components are needed to explain 50% of the total variance?

Identify which words are important in each of the principal components. To do so, take the sum of squares of each of the component vectors to check how they are normalized. For each component, then print out the words for which the absolute value of the component is larger than 0.20 of the norm.

Make a scatter plot of some reasonably sized sample (1k10k titles). Explain the structure (or lack thereof) you see based on your results from item bc.

Run a preprocessing step to remove stop words (a list of stop words is provided which is identical to the list used in Spark). Rerun steps bd and evaluate whether this has improved your representation.

One of the issues with text is that the variance in how often a word appears is often correlated with the base frequency with which a word appears. To account
3
for this, the word counts are often reweighted using the term frequency, inverse document frequency (TFIDF) scheme:
tfidf(t; d; D) := tf(t; d)idf(t; D)
Here tf(t; d) is a measure of how often a term t (i.e. a vocabulary word) occurs in document d (i.e. a title). Here we will simply use the word counts calculate in part a. of this exercise:
tf(t; d) = jft^{0} 2 d : t^{0} = tgj
The inverse document frequency idf(t; D) is used to discount terms t that occur commonly across all documents D. Here we will use:
jDj + 1
idf(t; D) := log _{jfd} _{2} _{D} _{:} _{t} _{2} _{dgj} _{+ 1}
Note that idf(t; D) = 0 for words that occur in all documents.
Calculate TFIDF features for all titles and rerun the operations in parts bd of this exercise. How have your results changed?
In the second part of this exercise, we will look at subsets of the data. To construct these subsets, rst construct two lists of titles:
Titles for machine learning papers published at ‘NIPS’ Titles for database papers published at ‘VLDB’

Merge the two sets of titles. Construct both word count vectors and TFIDF features. Repeat steps bd and compare word count results to TFIDF results.

Now make a scatter plot of these two principal components, showing the titles from each subset in di erent colors. Again compare word counts and TFIDF. Did PCA succeed in uncovering the di erences between the communities?
4