Description
Notes:
There are 6 problems on 2 pages in this homework.
You may not use late days on this homework. Solutions will be distributed in class on December 1.
This homework has 50 points and a bonus problem worth 25 points.
Submit all answers by committing a single pdf file named YOUR_WUSTL_KEY_hw7.pdf to the hw7 folder in your SVN repository.
Problems:
-
(From Russell & Norvig) Suppose I pick some decision tree on functions going from 5 Boolean variables to a Boolean output. Now, I generate all possible inputs xi, and run them through the decision tree to produce the corresponding classes yi. Finally, I take this gen-erated dataset of all (xi; yi) pairs, and run a greedy decision tree learning algorithm using information gain as the splitting criterion. Am I guaranteed to get back the same tree that generated the data? If not, what kind of guarantee can I give about the tree that is returned?
-
Consider the following dataset: the class variable is whether or not a car gets good mileage, and the features are Cylinders (either 4 or 8), Displacement (High, Medium, or Low), and Horsepower (High, Medium, or Low).
-
Cylinders
Displacement
Horsepower
GoodMPG?
4
Low
Low
No
8
High
High
No
8
High
Medium
Yes
8
High
High
No
4
Low
Medium
Yes
4
Medium
Medium
No
Give a decision tree that classifies this dataset perfectly. If you were to split this dataset using information gain, what would the first feature chosen to split on be?
-
Consider a training set of size n, where each example has two continuous features, no two examples have the exact same value for any of the two features, and the problem is a two-class problem. Suppose the training set is linearly separable. Can a decision tree correctly separate the data? What if the dataset is not linearly separable? In either case, if a decision tree can correctly separate the data, give the tightest bound that you can on the depth of that tree.
1
-
Suppose you apply bagging and boosting to a hypothesis space of linear separators. Will the hypothesis space of the ensemble still be linear for boosting? For bagging?
-
(From Russell & Norvig) Construct an SVM that computes the XOR function. Use values of +1 and -1 for the inputs and outputs. Map inputs (x1; x2) into a space consisting of x1 and x1x2. Draw the four input points in this space and the maximal margin separator. What is the margin? Now draw the separating line back in the original input space.
-
The key point of the so-called “kernel trick” in SVMs is to learn a classifier that effec-tively separates the training data in a higher dimensional space without having to explic-itly compute the representation (x) of every point x in the original input space. Instead, all the work is done through the kernel function that computes dot products K(xi; xj) = (xi) (xj).
Show how to compute the squared Euclidean distance in the projected space between any two points xi; xj in the original space without explicitly computing the mapping, instead using the kernel function K.
Bonus Problem (25 extra points):
Bagging reduces the variance of a classifier by averaging over several classifiers trained on subsets of the original training data. The subsets are obtained by uniform subsampling with replacement. I.e. if your data is S = f(~x1; y1); : : : ; (~xn; yn)g, at each iteration you create a new data set S0 with n random picks, picking each example pair with probability n1 each time. As a result you could end up with multiple identical pairs, or some not present at all.
Let pn(m; k) be the probability that you have drawn m unique examples after k picks with jSj = n. So clearly pn(m; k) = 0 whenever m > k (because you cannot end up with more unique elements m than you have drawn), and also pn(m; k) = 0 whenever m > n.
-
What are the base-case values of pn(1; 1); pn(m; 1); pn(1; k)?
-
Assume you are have already picked k 1 elements. What is the probability that the kth pick will not increase your number of unique elements m? What is the probability that it will?
-
Can you express pn(m; k) in terms of pn(m; k 1) and pn(m 1; k 1)?
-
Write out the formula for Ek=n[mn ], the expected ratio of unique elements (m) and the total number of elements (n) after n picks (i.e. k = n) from set S with jSj = n.
-
Write a little recursive function (in the programming language of your choice) that evaluates Ek=n[mn ]. Plot its value as n increases. What value does it converge to?
-
If you average over M classifiers, trained on sub-sets S10 ; : : : ; SM0 where jSi0 j = n, what is the probability that at least one input pair is never picked in any of the training data sets Si0 ? Plot this function as M increases. (Assuming that n is large enough for the convergence as observed in (c).)
2