Description

BDR and nearest neighbors Consider a classification problem with c classes and uniform class
probabilities, i.e. P_{Y} (i) = 1/c, ∀i. Assume that the goal is to classify an iid sequence of observations

= {x_{1}, . . . , x_{n}} as a whole (i.e. the samples are not classified one at a time).

Compute the BDR for this problem and show that it converges (in probability) to a nearest neighbor rule based on the classconditional distributions and the distribution of the observations. Show that the distance function is the KullbackLeibler divergence
^{p}^{(x)}
D[p(x)q(x)] = p(x) log dx.
This proves that the BDR for the classification of sequences is really just a nearest neighbor rule.

Assuming that all densities are Gaussian with equal covariance Σ, the class conditional densities have mean μ_{i} and the observation density has mean μ write down an expression for the decision rule as a function of the Gaussian parameters. Provide an interpretation for this new decision rule, by stating what are the items being compared and what is the distance function.
2. Kernel methods Problem 4.3.3 in DHS
1

Multinomial EM In this problem we consider an example where there is a closedform solution to ML estimation from incomplete data. The goal is to compare with the EM solution and get some insight on how the steps of the latter can be substantially easier to derive than the former.
Consider our bridge example and let U be the type of vehicle that crosses the bridge. U that can take

values, (compact, sedan, station wagon, and pickup truck) that we denote by U ∈ {1, 2, 3, 4}. On a given day, an operator collects an iid sample of size n from U and the number of vehicles of each type is counted and stored in a vector D = (x_{1}, x_{2}, x_{3}, x_{4}). The resulting random variable X (the histogram of vehicle classes) has a multinomial distribution
^{P}X_{1},X_{2},X_{3},X_{4} 
(x_{1} 
, x_{2} 
, x_{3}, x_{4} 
; Ψ) = _{x}_{1} 
!x_{2}!x_{3}!x_{4}! 
_{2} 
x1 
_{4} 
x2 
_{4} 
− _{4} Ψ 
x3 
_{4} Ψ 
x4 

+ _{4} Ψ 
− _{4} Ψ 
. 

n! 
1 
1 
1 
1 
1 
1 
1 

However, it is later realized that the operator included motorcycles in the compact class. It is established that bikes have probability ^{1}_{4} Ψ, which leads to a new model

^{P}X_{11} ,X_{12},X_{2},X_{3},X_{4} ^{(}^{x}11^{, x}12^{, x}2^{, x}3^{, x}4^{; Ψ) =}
n!
1
x11
1
x12
1
1
x_{2}
1
1
x_{3}
1
x_{4}
=
_{Ψ}
−
_{Ψ}
−
_{Ψ}
_{Ψ}
.
x_{11}!x_{12}!x_{2}!x_{3}!x_{4}!
2
4
4
4
4
4
4
Determining the parameter Ψ from the available data is as a problem of ML estimation with missing data, since we only have measurements for
x_{1} = x_{11} + x_{12}
but not for x_{11} and x_{12} independently.

Determine the value of Ψ that maximizes the likelihood of D, i.e.
Ψ^{∗}i = arg max PX_{1},X_{2},X_{3},X_{4} (D; Ψ)
Ψ
by using standard ML estimation procedures.

Assume that we have the complete data, i.e. D_{c} = (x_{11}, x_{12}, x_{2}, x_{3}, x_{4}). Determine the value of Ψ that maximizes its likelihood, i.e.
Ψ^{∗}c = arg max PX_{11} ,X_{12},X_{2},X_{3},X_{4} (Dc; Ψ),
Ψ
by using standard ML estimation procedures. Compare the diﬃculty of obtaining this solution vs. that of obtaining the solution in a). Does this look like a problem where EM might be helpful?

Derive the E and Msteps of the EM algorithm for this problem.

Using the equations for the EM steps, determine the fixed point of the algorithm (i.e. the solution)
by making
_{Ψ}k+1 _{= Ψ}k
where k is the iteration number. Compare to the solution obtained in a).
2

Mixtures of Gaussians The goal of this problem is to give you some “handson” experience on the very important case of EM as a tool for the estimation of the parameters of a mixture. Consider a
mixture of two Gaussians
2
P_{X} (x) = π_{c}G(x, μ_{c}, Σ_{c})
c=1
where the covariance matrices are diagonal, i.e. Σ_{c} = diag(σ_{c,}^{2}_{1}, σ_{c,}^{2}_{2}), and a training sample of five points


= {(−2.5, −1), (−2, 0.5), (−1, 0), (2.5, −1), (2, 1)}.


Assume that the following hold
μ_{1} 
= 
−μ_{2} 

^{Σ}1 
= 
Σ_{1} = σ^{2}I, 

1 
. 

π_{1} 
= 
π_{2} = 

2 

Plot the 
loglikelihood surface log P 

2 
_{X} (D) as a function of the mean parameters (entries of μ_{1}) for 

σ 
∈ {0.1, 1, 2}. Let the coordinate axis cover the range ([_{2}−5, 5]). What can you say about the local 

maxima of the likelihood surface, and how it changes with σ 
? How does the convergence to the optimal 

depend on the location of the initial parameter guess? 

b) Starting from the initial parameter estimate 

_{π}(0) 
= 
_{π}(0) 
= 
1 

1 
2 
2 

^{(0)} ^{(0)}
μ_{1} = −μ_{2} = (−0.1, 0)
compute all the quantities involved in the first 3 iterations of the EM algorithm. For each iteration produce

plot 1: the posterior surface P_{Z}_{X}(1x) for the first class as a function of x,

plot 2: the mean of each Gaussian, the contour where the Mahalanobis distance associated with it becomes 1, the points in D, and the means of the solutions obtained the previous steps.
Let EM run until convergence, storing the mean estimates at each iteration. Produce the two plots above for the final solution. In plot 2, plot the values of the means as they progress from the initial to the final estimate.

EM and MAP estimates In this problem we use EM for the maximization of the posterior
probability
Ψ^{∗} = arg max P_{Ψ}_{X} (Ψx).
Ψ
Consider the binomial distribution of problem 3. and a Gamma prior
_{P}_{Ψ}_{(Ψ) =} ^{Γ(}^{ν}^{1} ^{+} ^{ν}^{2}^{)} _{Ψ}ν1−1_{(1} _{−} _{Ψ)}ν2−1_{.}
Γ(ν_{1})Γ(ν_{2})
Derive the equations of the EM algorithm for MAP estimation of the parameter Ψ.
3

(Computer) This week we use the cheetah image to evaluate the performance of a classifier based on mixture models estimated with EM. Once again we use the decomposition into 8 × 8 image blocks, compute the DCT of each block, and zigzag scan. For this (using the data in TrainingSamplesDCT new 8.mat) we fit a mixture of Gaussians of diagonal covariance to each class, i.e.
C
P_{XY} (xi) = π_{c}G(x, μ_{c}, Σ_{c})
c=1
where all Σ_{c} are diagonal matrices. We then apply the BDR based on these density estimates to the cheetah image and measure the probability of error as a function of the number of dimensions of the space (as before, use {1, 2, 4, 8, 16, 24, 32, . . ., 64} dimensions).

For each class, learn 5 mixtures of C = 8 components, using a random initialization (recall that the mixture weights must add up to one). Plot the probability of error vs. dimension for each of the 25 classifiers obtained with all possible mixture pairs. Comment the dependence of the probability of error on the initialization.

For each class, learn mixtures with C ∈ {1, 2, 4, 8, 16, 32}. Plot the probability of error vs. dimension for each number of mixture components. What is the eﬀect of the number of mixture components on the probability of error?
4