Assignment 4 Classification Solution

$30.00

Description

Introduction

In this assignment you will implement a Naive Bayes classi er for the purpose of binary classi cation.

You may not use any functions from an ML library in your code. And as always your code should work on any dataset that has the same general form as the provided one.

Grading

Although all assignments will be weighed equally in computing your homework grade, below is the grading rubric we will use for this assignment:

Part 1

(Theory)

55pts

Part 2

(Naive Bayes)

35pts

Report

10pts

Extra Credit

10pts

TOTAL

110 (of 100) pts

1

Datasets

Spambase Dataset (spambase.data) This dataset consists of 4601 instances of data, each with 57 features and a class label designating if the sample is spam or not. The features are real valued and are described in much detail here:

https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names

Data obtained from: https://archive.ics.uci.edu/ml/datasets/Spambase

2

  • Theory

    1. Consider the following set of training examples for an unknown target function: (x1; x2) ! y:

Y

x1

x2

Count

+

T

T

3

+

T

F

4

+

F

T

4

+

F

F

1

T

T

0

T

F

1

F

T

3

F

F

5

    1. What is the sample entropy, H(Y ) from this training data (using log base 2) (5pts)?

    1. What are the information gains for branching on variables x1 and x2 (5pts)?

    1. Draw the decision tree that would be learned by the ID3 algorithm without pruning from this training data. All leaf nodes should have a single class choice at them. If necessary use the mean class or, in the case of a tie, choose one at random.(10pts)?

  1. We decided that maybe we can use the number of characters and the average word length an essay to determine if the student should get an A in a class or not. Below are ve samples of this data:

# of Chars

Average Word Length

Give an A

216

5.68

Yes

69

4.78

Yes

302

2.31

No

60

3.16

Yes

393

4.2

No

    1. What are the class priors, P (A = Y es); P (A = N o)? (5pts)

    1. Find the parameters of the Gaussians necessary to do Gaussian Naive Bayes classi cation on this decision to give an A or not. Standardize the features rst over all the data together so that there is no unfair bias towards the features of di erent scales (5pts).

    1. Using your response from the prior question, determine if an essay with 242 characters and an average word length of 4.56 should get an A or not. Show the math to support your decision (10pts).

  1. Another common activation function for use in logistic regression or arti cial neural networks is the hyperbolic tangent function, tanh, which is de ned as:

tanh(z) =

ez

e z

(1)

ez

+ e z

3

(a) Since the hyperbolic tangent function outputs values in the range of 1 <= tanh(z) <= 1 we will have to augment our log likelihood objective function to deal with this range. If

we opt to use this function for logistic regression (as opposed to

1

), what will this

1+e

z

object function be? Show your work. (5pts)

  1. In order to compute the gradient of your previous answer with respect to j, we’ll need to compute the gradient of the hyperbolic tangent function itself. Use the exponential de nition of the hyperbolic tangent function provided at the top of this problem to show that

@@j (tanh(x )) = xj(1 tanh(x )2). (5pts)

  1. Using the fact that @@j (tanh(x )) = xj(1 tanh(x )2), what is the gradient of your log likelihood function in part (a) with respect to j? Show your work. (5pts)

4

  • Naive Bayes Classi er

For your rst programming task, you’ll train and test a Naive Bayes Classi er.

Download the dataset spambase.data from Blackboard. As mentioned in the Datasets area, this dataset contains 4601 rows of data, each with 57 continuous valued features followed by a binary class label (0=not-spam, 1=spam). There is no header information in this le and the data is comma separated. As always, your code should work on any dataset that lacks header information and has several comma-separated continuous-valued features followed by a class id 2 0; 1.

Write a script that:

  1. Reads in the data.

  1. Randomizes the data.

  1. Selects the rst 2/3 (round up) of the data for training and the remaining for testing

  1. Standardizes the data (except for the last column of course) using the training data

  1. Divides the training data into two groups: Spam samples, Non-Spam samples.

  1. Creates Normal models for each feature for each class.

  1. Classify each testing sample using these models and choosing the class label based on which class probability is higher.

  1. Computes the following statistics using the testing data results:

    1. Precision

    1. Recall

    1. F-measure

    1. Accuracy

Implementation Details

  1. Seed the random number generate with zero prior to randomizing the data

  1. Matlab interprets 0log0 as N aN (not a number). You should identify this situation and consider it to be a value of zero.

In your report you will need:

1. The statistics requested for your Naive Bayes classi er run.

5

Precision: Around 68%

Recall: Around 95%

F-Measure: Around 79%

Accuracy: Around 81%

Table 1: Evaluation for Naive Bayes classi er

  • Extra Credit: The Precision-Recall Tradeo

For 10 extra credit points, nd a data set of your choosing on which you can perform binary classi-cation. Now apply your Naive Bayes code to this dataset and vary the threshold required to make an observation be considered your positive class.

Write a script that:

  1. Reads in the data.

  1. Randomizes the data.

  1. Selects the rst 2/3 (round up) of the data for training and the remaining for testing

  1. Standardizes the data using the training data

  1. Divides the training data into two groups: Positive samples, Negative samples.

  1. Creates Normal models for each feature and each class.

  1. Computes the P (P ositivejdata) and P (N egativejdata) for each testing sample using Naive Bayes, normalizing them such that P (P ositivejdata) + P (N egativejdata) = 1.

  2. Vary the threshold from 0.0 to 1.0 in increments of 0.05, each time:

    1. Using the current threshold, label each testing sample as True Positive, True, Negative, False Positive, or False Negative

    1. Compute the Precision and Recall for this threshold level.

  1. Plot Precision vs Recall.

Implementation Details

  1. In computing Precision, Recall, F-Measure and Accuracy make sure the denominators don’t become zero. If they do, check the numerator. If that’s also zero then set the value to one.

6

Submission

For your submission, upload to Blackboard a single zip le (again no spaces or non-underscore special characters in le or directory names) containing:

  1. PDF Writeup

  1. Source Code

  1. If you did the extra credit, the dataset used.

  1. readme.txt le

The readme.txt le should contain information on how to run your code to reproduce results for each part of the assignment.

The PDF document should contain the following:

  1. Part 1:

    1. Answers to theory questions

  1. Part 2:

    1. Requested Classi cation Statistics

  1. Extra Credit:

    1. Citation and link to dataset.

    1. Plot of Precision-vs-Accuracy

7


error: Content is protected !!