Description
Introduction
In this assignment you will implement a Naive Bayes classi er for the purpose of binary classi cation.
You may not use any functions from an ML library in your code. And as always your code should work on any dataset that has the same general form as the provided one.
Grading
Although all assignments will be weighed equally in computing your homework grade, below is the grading rubric we will use for this assignment:

Part 1
(Theory)
55pts
Part 2
(Naive Bayes)
35pts
Report
10pts
Extra Credit
10pts
TOTAL
110 (of 100) pts
1
Spambase Dataset (spambase.data) This dataset consists of 4601 instances of data, each with 57 features and a class label designating if the sample is spam or not. The features are real valued and are described in much detail here:
https://archive.ics.uci.edu/ml/machinelearningdatabases/spambase/spambase.names
Data obtained from: https://archive.ics.uci.edu/ml/datasets/Spambase
2


Consider the following set of training examples for an unknown target function: (x_{1}; x_{2}) ! y:


Y
^{x}1
^{x}2
Count
+
T
T
3
+
T
F
4
+
F
T
4
+
F
F
1
–
T
T
0
–
T
F
1
–
F
T
3
–
F
F
5


What is the sample entropy, H(Y ) from this training data (using log base 2) (5pts)?



What are the information gains for branching on variables x_{1} and x_{2} (5pts)?



Draw the decision tree that would be learned by the ID3 algorithm without pruning from this training data. All leaf nodes should have a single class choice at them. If necessary use the mean class or, in the case of a tie, choose one at random.(10pts)?


We decided that maybe we can use the number of characters and the average word length an essay to determine if the student should get an A in a class or not. Below are ve samples of this data:

# of Chars
Average Word Length
Give an A
216
5.68
Yes
69
4.78
Yes
302
2.31
No
60
3.16
Yes
393
4.2
No


What are the class priors, P (A = Y es); P (A = N o)? (5pts)



Find the parameters of the Gaussians necessary to do Gaussian Naive Bayes classi cation on this decision to give an A or not. Standardize the features rst over all the data together so that there is no unfair bias towards the features of di erent scales (5pts).



Using your response from the prior question, determine if an essay with 242 characters and an average word length of 4.56 should get an A or not. Show the math to support your decision (10pts).


Another common activation function for use in logistic regression or arti cial neural networks is the hyperbolic tangent function, tanh, which is de ned as:

tanh(z) =
e^{z}
e ^{z}
(1)
e^{z}
+ e ^{z}
3
(a) Since the hyperbolic tangent function outputs values in the range of 1 <= tanh(z) <= 1 we will have to augment our log likelihood objective function to deal with this range. If

we opt to use this function for logistic regression (as opposed to
1
), what will this
1+e
z
object function be? Show your work. (5pts)

In order to compute the gradient of your previous answer with respect to _{j}, we’ll need to compute the gradient of the hyperbolic tangent function itself. Use the exponential de nition of the hyperbolic tangent function provided at the top of this problem to show that
_{@}^{@}_{j} (tanh(x )) = x_{j}(1 tanh(x )^{2}). (5pts)

Using the fact that _{@}^{@}_{j} (tanh(x )) = x_{j}(1 tanh(x )^{2}), what is the gradient of your log likelihood function in part (a) with respect to _{j}? Show your work. (5pts)
4
For your rst programming task, you’ll train and test a Naive Bayes Classi er.
Download the dataset spambase.data from Blackboard. As mentioned in the Datasets area, this dataset contains 4601 rows of data, each with 57 continuous valued features followed by a binary class label (0=notspam, 1=spam). There is no header information in this le and the data is comma separated. As always, your code should work on any dataset that lacks header information and has several commaseparated continuousvalued features followed by a class id 2 0; 1.
Write a script that:

Reads in the data.

Randomizes the data.

Selects the rst 2/3 (round up) of the data for training and the remaining for testing

Standardizes the data (except for the last column of course) using the training data

Divides the training data into two groups: Spam samples, NonSpam samples.

Creates Normal models for each feature for each class.

Classify each testing sample using these models and choosing the class label based on which class probability is higher.

Computes the following statistics using the testing data results:


Precision



Recall



Fmeasure



Accuracy

Implementation Details

Seed the random number generate with zero prior to randomizing the data

Matlab interprets 0log0 as N aN (not a number). You should identify this situation and consider it to be a value of zero.
In your report you will need:
1. The statistics requested for your Naive Bayes classi er run.
5
Recall: Around 95%
FMeasure: Around 79%
Accuracy: Around 81%
Table 1: Evaluation for Naive Bayes classi er

Extra Credit: The PrecisionRecall Tradeo
For 10 extra credit points, nd a data set of your choosing on which you can perform binary classication. Now apply your Naive Bayes code to this dataset and vary the threshold required to make an observation be considered your positive class.
Write a script that:

Reads in the data.

Randomizes the data.

Selects the rst 2/3 (round up) of the data for training and the remaining for testing

Standardizes the data using the training data

Divides the training data into two groups: Positive samples, Negative samples.

Creates Normal models for each feature and each class.

Computes the P (P ositivejdata) and P (N egativejdata) for each testing sample using Naive Bayes, normalizing them such that P (P ositivejdata) + P (N egativejdata) = 1.

Vary the threshold from 0.0 to 1.0 in increments of 0.05, each time:


Using the current threshold, label each testing sample as True Positive, True, Negative, False Positive, or False Negative



Compute the Precision and Recall for this threshold level.


Plot Precision vs Recall.
Implementation Details

In computing Precision, Recall, FMeasure and Accuracy make sure the denominators don’t become zero. If they do, check the numerator. If that’s also zero then set the value to one.
6
For your submission, upload to Blackboard a single zip le (again no spaces or nonunderscore special characters in le or directory names) containing:

PDF Writeup

Source Code

If you did the extra credit, the dataset used.

readme.txt le
The readme.txt le should contain information on how to run your code to reproduce results for each part of the assignment.
The PDF document should contain the following:

Part 1:


Answers to theory questions


Part 2:


Requested Classi cation Statistics


Extra Credit:


Citation and link to dataset.



Plot of PrecisionvsAccuracy

7