Description
Problem 1 (30 points) Naïve Bayes
In this problem, we explore why the naïve assumption is necessary.
After learning about using Bayes rule to make predictions (without the naïve assumption), you want to apply the method to a complex problem. Keep in mind that you want to use the rule:
Pr(X j Y ) Pr(Y )
Pr(Y j X) =
Pr(X)
and you want to estimate the parameters of Pr(X j Y ) and Pr(Y ). However, before applying the method to your problem, you want to apply to a toy problem rst.

(10 pts) In the toy problem, X You want to estimate Pr(X j write

[X_{1}; X_{2}] (so d = 2), where X_{i} is binary. Y is also binary. Y ) without the Naïve Bayes assumption, that is you cannot
d
Y
Pr(X j Y ) = Pr(X_{i} = x_{i} j Y = y);
i=1
instead, you must estimate
Pr(X j Y ) = Pr(X_{1} = x_{1}; : : : ; X_{d} = x_{d} j Y = y)
for all combinations of the values x1; : : : ; xd, and y. How many parameters do you have to estimate of for your toy problem?
(Here parameters refers to the estimate of Pr(X j Y ) = Pr(X1 = x1; : : : ; Xd = xd j Y = y), and Pr(Y = y) for some combination of xi’s and y. In our case where d = 2, examples of such parameters are Pr(X_{1} = 1; X_{2} = 0 j Y = 0) and Pr(Y = 1).)

(10 pts) After running the decision rule on your toy problem, you decide to apply it to the actual problem. However, in your problem, d = 100. How many parameters do you have to estimate now?

(10 pts) When is it necessary for us to make the naïve assumption? Explain by showing how the assumption will a ect one of the answers from above.
Problem 2 (30 points) Naïve Bayes, part II.

(10 pts) We will attempt to write the multinomial naïve Bayes classi er’s decision rule as a linear rule. Suppose that we have a document that is represented as a sequence d = (w_{1}; : : : ; w_{‘}) of length ‘. This can be classi ed as:

2f
g
‘
_{i}Y
h(d) = arg max Pr(Y = y)
Pr(W = w_{i} j Y = y)
y
+1; 1
=1
Assuming that we have estimates for all the appropriate probabilities and none of them are zero, rewrite h(d) as h(d) = sign(~v ~x + b). We assume that we have a vocabulary of words fa1 ; : : : ; amg that contains m words. Each word wi in the document is one of the aj, and ~x is a vector where each component xj is the count of the number of times the word aj shows up in the document. Provide ~v, and b.

(10 pts) Previous problems considered only those cases where X consists of discrete values. Now, we will also consider the case where X can take continuous values.
But rst, (i) write down the maximum likelihood estimates (MLE) for the parameters of the naïve Bayes classi er, Pr(X_{i} j Y ) and Pr(Y ), where X takes discrete, categorical values.
Now let’s look at naïve Bayes classi er that takes vectors of continuous values as input. In this case, we need a di erent formulation for Pr(XijY ) (we don’t need to worry about Pr(Y ) because Y still takes discrete values). One way is to assume that, for each discrete y, each Xi comes from a Gaussian.
For example, consider the simpli ed case above, where Xi takes a continuous value (a value along the xaxis) and Y can be either blue or green. As shown above, for each discrete value of Y (blue or green), Xi is a random variable from a Gaussian speci c to Xi (not some other Xj) and the value of Y . As a result, we see two Gaussians, each generating blue data points or green data points.
With this, we get a Gaussian Naïve Bayes classi er. Using the assumption, (ii) write down the MLE for the parameters of Pr(Xi j Y ) (Note: since Pr(Xi j Y = y ) is a Gaussian its parameters are its mean _{y;i} and standard deviation _{y;i}). (iii) What is the total number of parameters?
3

(10 pts) Using what we found in part (b), we will reformulate the classi er once more. Remember how a Gaussian naíve Bayes classi er is de ned:


X_{i} is continuous



Pr(X_{i} j Y = y) is a Gaussian with _{y;i}; _{i} (we will assume that each only depends on the feature, and not the label y)



Given the label, features are conditionally independent, just like the discrete version



Y is a Bernoulli random variable

Now, nd the weight vector w~ such that

Pr(Y = +1
j
X) =
Pr(XjY = +1) Pr(Y = +1)
=
1
1 + exp[ (w_{0} +
n
Pr(X)
^{P}_{i=1} w_{i}X_{i})]
Be sure to use your answers for part (b).
Note: The result you got it precisely the formulation for logistic regression. However, Gaussian naïve Bayes and logistic regression are di erent learning algorithms. They output the (asymptotically) same model only under special conditions. We will explore the relation further when we cover logistic regression.
Problem 3 (40 points) Valid Kernels, Kernel Construction

(10 pts) Consider x_{i} 2 R^{3} and (x_{i}) 2 R^{D} with

p
p
p
; (x )^{2}
; (x )^{2}
; (x )^{2}
p
p
p
]^{T} ;
(x
)=[1;
2(x ) ;
2(x )
;
2(x )
;
2(x )
(x ) ;
2(x )
(x )
;
2(x )
(x )
i
i 1
i 2
i 3
i _{1}
i _{2}
i _{3}
i 1
i 2
i 1
i 3
i 2
i 3
compute h (xi); (xj)i. What is D? Brie y argue why considering the inner products directly instead of explicitly computing the feature mappings is more e cient.

(10 pts)For each of the following matrices show whether it is strictly positive de nite, positive semide nite, or neither:

^{A}1^{=} _{1}
1
; A_{2}
=
^{2}1 2
_{0}3
; A_{3}=^{2}
1
1
1
3
:
1
2
1
1
2
1
1
1
0
2
1
1
2
1
4
5
4
5
(c) (10 pts) For positive semide nite (psd) matrices A,B show that A + B and A B ( is the Kronecker product) are positive semide nite.

(10 pts) Show that the RBF Kernel corresponds to an inner product in an in nte dimensional space. (HINT: use Taylor expansion.)
4
Problem 4 (30 points) Kernelize the Perceptron Algorithm
Kernelize the perceptron algorithm for binary classi cation y_{i} = f+1; 1g.

Initialize w~ = 0 REPEAT until convergence:

Pick (~x_{i}; y_{i}) randomly from D.

(2) If y_{i}w~^{T} ~x_{i} 0 then make the update w~ w~ + y_{i}~x_{i}, otherwise do nothing.

(10 pts) Show that w~ can be written as a linear combination of the training inputs.

(10 pts) Derive the kernelized classi er h(~x).

(10 pts) State the kernelized version of the learning algorithm.
5