Description
Notes:
There are 6 problems on 2 pages in this homework.
You may not use late days on this homework. Solutions will be distributed in class on December 1.
This homework has 50 points and a bonus problem worth 25 points.
Submit all answers by committing a single pdf file named YOUR_WUSTL_KEY_hw7.pdf to the hw7 folder in your SVN repository.
Problems:

(From Russell & Norvig) Suppose I pick some decision tree on functions going from 5 Boolean variables to a Boolean output. Now, I generate all possible inputs x_{i}, and run them through the decision tree to produce the corresponding classes y_{i}. Finally, I take this generated dataset of all (x_{i}; y_{i}) pairs, and run a greedy decision tree learning algorithm using information gain as the splitting criterion. Am I guaranteed to get back the same tree that generated the data? If not, what kind of guarantee can I give about the tree that is returned?

Consider the following dataset: the class variable is whether or not a car gets good mileage, and the features are Cylinders (either 4 or 8), Displacement (High, Medium, or Low), and Horsepower (High, Medium, or Low).

Cylinders
Displacement
Horsepower
GoodMPG?
4
Low
Low
No
8
High
High
No
8
High
Medium
Yes
8
High
High
No
4
Low
Medium
Yes
4
Medium
Medium
No
Give a decision tree that classifies this dataset perfectly. If you were to split this dataset using information gain, what would the first feature chosen to split on be?

Consider a training set of size n, where each example has two continuous features, no two examples have the exact same value for any of the two features, and the problem is a twoclass problem. Suppose the training set is linearly separable. Can a decision tree correctly separate the data? What if the dataset is not linearly separable? In either case, if a decision tree can correctly separate the data, give the tightest bound that you can on the depth of that tree.
1

Suppose you apply bagging and boosting to a hypothesis space of linear separators. Will the hypothesis space of the ensemble still be linear for boosting? For bagging?

(From Russell & Norvig) Construct an SVM that computes the XOR function. Use values of +1 and 1 for the inputs and outputs. Map inputs (x_{1}; x_{2}) into a space consisting of x_{1} and x_{1}x_{2}. Draw the four input points in this space and the maximal margin separator. What is the margin? Now draw the separating line back in the original input space.

The key point of the socalled “kernel trick” in SVMs is to learn a classifier that effectively separates the training data in a higher dimensional space without having to explicitly compute the representation (x) of every point x in the original input space. Instead, all the work is done through the kernel function that computes dot products K(x_{i}; x_{j}) = (x_{i}) (x_{j}).
Show how to compute the squared Euclidean distance in the projected space between any two points x_{i}; x_{j} in the original space without explicitly computing the mapping, instead using the kernel function K.
Bonus Problem (25 extra points):
Bagging reduces the variance of a classifier by averaging over several classifiers trained on subsets of the original training data. The subsets are obtained by uniform subsampling with replacement. I.e. if your data is S = f(~x_{1}; y_{1}); : : : ; (~x_{n}; y_{n})g, at each iteration you create a new data set S^{0} with n random picks, picking each example pair with probability _{n}^{1} each time. As a result you could end up with multiple identical pairs, or some not present at all.
Let p_{n}(m; k) be the probability that you have drawn m unique examples after k picks with jSj = n. So clearly p_{n}(m; k) = 0 whenever m > k (because you cannot end up with more unique elements m than you have drawn), and also p_{n}(m; k) = 0 whenever m > n.

What are the basecase values of p_{n}(1; 1); p_{n}(m; 1); p_{n}(1; k)?

Assume you are have already picked k 1 elements. What is the probability that the k^{th} pick will not increase your number of unique elements m? What is the probability that it will?

Can you express p_{n}(m; k) in terms of p_{n}(m; k 1) and p_{n}(m 1; k 1)?

Write out the formula for E_{k=n}[^{m}_{n} ], the expected ratio of unique elements (m) and the total number of elements (n) after n picks (i.e. k = n) from set S with jSj = n.

Write a little recursive function (in the programming language of your choice) that evaluates E_{k=n}[^{m}_{n} ]. Plot its value as n increases. What value does it converge to?

If you average over M classifiers, trained on subsets S_{1}^{0} ; : : : ; S_{M}^{0} where jS_{i}^{0} j = n, what is the probability that at least one input pair is never picked in any of the training data sets S_{i}^{0} ? Plot this function as M increases. (Assuming that n is large enough for the convergence as observed in (c).)
2