STAT Homework # 1 Solution

$30.00

Category: Tag:

Description

Instructions:  You may  discuss the homework  problems  in small groups,  but  you must  write  up the  final solutions  and  code yourself.  Please  turn  in your code for the  problems  that  involve coding.  However, for the  problems that  involve coding, you must also provide written  answers:  you will receive no credit if you submit  code without  written  answers.

  1. We will perform k-nearest-neighbors in this problem, in a setting with 2 classes,

25 observations  per class, and p = 2 features.  We will call one class the “red” class and  the  other  class the  “blue”  class.  The  observations  in the  red class are drawn i.i.d.  from a Np (µr , I ) distribution, and the observations  in the blue

0

class are drawn  i.i.d.   from a Np (µb , I ) distribution, where µr   =

1.5

0    is the

mean in the red class, and where µb =

1.5

is the mean in the blue class.

(a)  Generate  a training  set, consisting  of 25 observations  from the  red class and  25 observations  from the  blue class.   (You  will want to  use the  R function  rnorm.)    Plot  the  training  set.    Make  sure  that   the  axes  are properly labeled, and that  the observations  are colored according to their class label.

(b)  Now generate  a test  set consisting  of 25 observations  from the  red class and 25 observations from the blue class. On a single plot, display both the training  and test  set, using one symbol to indicate  training  observations (e.g.  circles) and  another  symbol to indicate  the  test  observations  (e.g. squares).  Make sure that  the axes are properly labeled, that  the symbols for training  and test  observations  are explained in a legend, and that  the observations  are colored according to their class label.

(c)  Using the  knn function  in the  library  class,  fit a  k-nearest  neighbors model on the training  set, for a range of values of k from 1 to 20. Make a plot that  displays the value of 1/k on the x-axis, and classification error (both  training  error and test error) on the y-axis. Make sure all axes and curves are properly labeled.  Explain your results.

(d)  For  the  value  of k  that   resulted  in  the  smallest  test  error  in  part  (c) above, make a plot displaying the test observations  as well as their  true and predicted  class labels.  Make sure that  all axes and points are clearly labeled.

(e)  In this example, what is the Bayes error rate?  Justify  your answer.

  1. We will once again perform k-nearest-neighbors in a setting with p = 2 features.

But  this  time,  we’ll generate  the  data  differently:   let  X1    ∼  Unif [0, 1] and X2  ∼ Unif [0, 1], i.e. the observations  for each feature are i.i.d.  from a uniform distribution. An observation belongs to class “red” if (X1 −0.5)2 +(X2 −0.5)2 >

0.15 and  X1   > 0.5; to class “green”  if (X1  − 0.5)2  + (X2  − 0.5)2  > 0.15 and

X1  ≤ 0.5; and to class “blue” otherwise.

(a)  Generate  a training  set of n = 200 observations.   (You will want  to use the R function runif.)  Plot the training  set.  Make sure that  the axes are properly labeled, and that  the observations  are colored according to their class label.

(b)  Now generate a test set consisting of another 200 observations.  On a single plot, display both  the training  and test  set, using one symbol to indicate training  observations  (e.g.   circles) and  another  symbol to  indicate  the test  observations  (e.g.   squares).   Make sure that the  axes are properly labeled, that  the symbols for training  and test observations  are explained in a legend, and that  the observations  are colored according to their class label.

(c)  Using the  knn function  in the  library  class,  fit a  k-nearest  neighbors model on the training  set, for a range of values of k from 1 to 50. Make a plot that  displays the value of 1/k on the x-axis, and classification error (both  training  error and test error) on the y-axis. Make sure all axes and curves are properly labeled.  Explain your results.

(d)  For  the  value  of k  that   resulted  in  the  smallest  test  error  in  part  (c) above, make a plot displaying the test observations  as well as their  true and predicted  class labels.  Make sure that  all axes and points are clearly labeled.

(e)  In this example, what  is the Bayes error rate?   Justify  your answer, and explain how it relates to your findings in (c) and (d).

  1. For each scenario, determine whether it is a regression or a classification prob- lem, determine whether the goal is inference or prediction, and state  the values of n (sample size) and p (number  of predictors).

(a)  I want to  predict  each  student’s  final exam  score based  on his or  her homework scores. There are 50 students enrolled in the course, and each student has completed 8 homeworks.

(b)  I want to  understand the  factors  that   contribute to  whether  or  not  a student passes this course.  The factors that  I consider are (i) whether or not the student has previous programming  experience; (ii) whether or not the student has previously studied  linear algebra; (iii) whether or not the student has taken a previous stats/probability course; (iv) whether or not the  student attends office hours;  (v) the  student’s  overall GPA;  (vi) the student’s  year (e.g. freshman, sophomore, junior, senior, or grad student). I have data  for all 50 students enrolled in the course.

  1. In each setting, would you generally expect a flexible or an inflexible statistical machine learning method to perform better?  Justify  your answer.

(a)  Sample size n is very small, and number of predictors  p is very large. (b)  Sample size n is very large, and number of predictors  p is very small. (c)  Relationship  between predictors  and response is highly non-linear.

(d)  The variance of the error terms,  i.e. σ2  = Var(  ), is extremely high.

  1. This question has to do with the bias-variance decomposition.

(a)  Make a sketch of typical (squared) bias, variance, training error, test error, and  Bayes (or irreducible)  error curves,  on a single plot,  as we go from less flexible statistical learning methods to more flexible approaches.  The x-axis should  represent  the  amount  of flexibility in the  model,  and  the y-axis should  represent the  values  of each curve.   There  should  be five curves.  Make sure to label each one.

(b)  Explain why each of the five curves has the shape displayed in (a).

  1. This exercise involves the Boston housing data set, which is part of the MASS

library in R.

(a)  How many rows are in this data  set?  How many columns?  What  do the rows and columns represent?

(b)  Make some pairwise scatterplots of the predictors  (columns)  in this data set.  Describe your findings.

(c)  Are any  of the  predictors  associated  with  per capita  crime rate?   If so, explain the relationship.

(d)  Do any of the suburbs  of Boston appear  to have particularly high crime rates?   Tax  rates?   Pupil-teacher ratios?   Comment on the  range of each predictor.

(e)  How many of the suburbs  in this data  set bound the Charles river?

(f )  What  are  the  mean  and  standard deviation  of the  pupil-teacher   ratio among the towns in this data  set?

(g)  Which  suburb  of Boston  has  highest  median  value  of owner-occupied homes?  What  are the values of the other predictors  for that  suburb,  and how do those values compare to the  overall ranges for those predictors? Comment on your findings.

(h)  In this data set, how many of the suburbs average more than six rooms per dwelling? More than  eight rooms per dwelling? Comment on the suburbs that  average more than  eight rooms per dwelling.


error: Content is protected !!