Description
Instructions
Submit a soft copy of your homework of all questions to your TA(o.karakahya@bilkent.edu.tr). Add your code at the end of the your homework le and send it via email with title “CS464 HW1” to your TA. Your report should be a .pdf le. Submitting a hard copy or scanned les is NOT allowed. You have to prepare your homework digitally(using Word, Excel, Latex etc.).
You will be doing your homework individually. No groups are allowed for this assignment.
You may code in any programming language you would prefer. In submitting the homework le, please package your report and code les as a gzipped TAR le or a ZIP le with the name
CS464 HW1 Firstname Lastname. The code you submit should be in a format easy to run, main script to call other functions. You must also provide us with a README le that tells us how we can execute/call your program.
If you do not follow the submission routes, deadlines and speci cations (codes, report, etc), it will lead to signi cant grade deduction.
- The Slot Machine [13 pts]
You and your friend are playing with slot machines in a casino. Having played on two separate machines for a while, you decide to switch machines to measure for di erences in luck. The wins/losses of you and your friend for each machine are tabulated below.
Machine 1 | Wins | Losses | Machine 2 | Wins | Losses |
You | 40 | 60 | You | 212 | 828 |
Friend | 25 | 75 | Friend | 18 | 72 |
Assuming that the outcome of playing the slot machine is independent of its history and that of the other machine, answer the following questions:
Question 1.1 [2 points] If the outcome of a game is a loss, what’s the probability that it was played using machine 2?
Question 1.2 [5 points] What is the losing probability of you and your friend for each of the machines? Compare your losing probability with your friend’s on di erent machines, who is more likely to lose on each machine?
Question 1.3 [3 points] Suppose you did not keep track of the wins/losses for each machine, but only of the total number of wins/loses for the two machines. In this case, estimate the overall winning probability of you and your friend in the casino (assume that there are only two slot machines in the casino). Who is more likely to win?
Question 1.4 [3 points] Assume you and your friend is playing on Machine 1, in turns (i.e. rst you and then your friend). What is the probability of the following sequence? Loss, Win, Win, Loss.
- Rolling the dice [10 pts]
Let’s say we have two fair dice: one is blue and the other one is red. You roll the dices and try to predict the outcome. Let’s say the outcome of the blue die is b, and the outcome of the red is r. You can understand that the outcome of the dices are independent of each other.
Let’s say we have an oracle who helps you to predict the probability of any outcome.
Question 2.1 [4 pts] Oracle gives you the following information about the outcomes:
C = `b is not equal to 1 or 6, and r is not equal to 1 or 2 ‘
What is the probability of b and r are both 5, which is P(b=5,r=5 | C)?
Question 2.2 [3 pts] On a new roll, oracle gives you another information :
D = `multiplication of the outcomes (b*r) is an odd number ‘
What is the probability of b = 5 and r = 5, which is P(b=3,c=5 | D) ?
Question 2.3 [3 pts] What is the di erence between the information C and D above? Explain your reasoning in terms of conditional probability.
- MLE and MAP [12 pts]
The Poisson distribution is a useful discrete distribution which can be used to model the number of occur-rences of some event per unit time. If X is Poisson distributed, i.e. X P oisson( ), its probability mass function takes the following form:
^{x}^{e}
^{ }
P (X = x j ) =
Assume now we have n identically and independently drawn data points from P oisson( ) : D =
fx_{1}; : : : ; x_{n}g
[3 pts] Derive an expression for maximum likelihood estimate (MLE) of .
[6 pts] Assume that prior distribution for is Pareto(x j k; 1) where Pareto distribution is de ned as Pareto(x j k; m) = km^{k}x ^{(k+1)} where x m , derive an expression for maximum a posterior (MAP) estimate of . Additionally, nd an interval for k such that 1 condition holds for the posterior.
[3 pts] Show that MLE estimate of and MAP estimate of with uniform prior U(a; b) is the same for any a and b where b > a.
- Building a Newsgroup Classi er with Naive Bayes [65 pts]
Your job is to build a newsgroup classi er that can accurately predict whether an email belonging to a medical newsgroup or space newsgroup. The questions summarizes the model, therefore, please read all questions before starting coding.
Dataset
Your dataset is a preprocessed and modi ed subset of the Twenty Newsgroups Data Set [1]. It is based on 2000 real email messages from a newsgroup mailing list. Emails have been preprocessed in the following ways:
Stop word removal: Words like \and”,\the”, and \of”, are very common in all English sentences and are therefore not very predictive in deciding the newsgroup. These words have been removed from the emails.
Removal of non-words: Numbers and punctuation have both been removed. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character
The data has been already split into two subsets: a 1600-email subset for training and a 400-email subset for testing (consider this as your validation set and imagine there is another test set which is not given to you). The features have been generated for you. You will use the following les:
question-4-train-features.csv question-4-train-labels.csv question-4-test-features.csv question-4-test-labels.csv
The les that ends with features.csv contains the features and the les ending with labels.csv contains the ground truth labels.
In the feature les each row contains the feature vector for an email. The j-th term in a row i is the number of occurrences of the j-th vocabulary word in the i-th email. The size of the vocabulary is 26507. The label les include the ground truth label for the corresponding email, the order of the emails (rows) are the same as the features le. That is the i-th row in the les corresponds to the same email document. The labels are indicated by 1 or 0, 1 stands for an email coming from space newsgroup and 0 stands for an email belonging to medical newsgroup.
Bag-of-Words Representation and Multinomial Naive Bayes Model
Recall the bag-of-words document representation makes the assumption that the probability that a word appears in email is conditionally independent of the word position given the class of the email. If we have a particular email document D_{i} with n_{i} words in it, we can compute the probability that D_{i} comes from the class y_{k} as:
n_{i} | _{ } |
_{j}Y | |
P (D_{i} j Y = y_{k}) = P (X_{1} = x_{1}; X_{2} = x_{2}; ::; X_{n}_{i} = x_{n}_{i} j Y = y_{k}) = P (X_{j} = x_{j} j Y = y_{k}) | (4.1) |
=1 |
In Eq. (4.1), X_{j} represents the j^{th} position in email D_{i} and x_{j} represents the actual word that appears in the j^{th} position in the email, whereas n_{i} represents the number of positions in the email. As a concrete example, we might have the rst email document (D_{1}) which contains 200 words (n_{1} = 200). The document might be of space email (y_{k} = 1) and the 15^{th} position in the email might have the word \apollo” (x_{j} = \apollo”).
~
In the above formulation, the feature vector X has a length that depends on the number of words in the email n_{i}. That means that the feature vector for each email will be of di erent sizes. Also, the above formal de nition of a feature vector ~x for a email says that x_{j} = k if the j-th word in this email is the k-th word in the dictionary. This does not exactly match our feature les, where the j-th term in a row i is the number of occurrences of the j-th dictionary word in that email i. As shown in the lecture slides, we can slightly change the representation, which makes it easier to implement:
V | |
j^{Y} | ^{ } |
P (D_{i} j Y = y_{k}) = P (X_{j} j Y = y_{k})^{t}^{w}j^{;i} | (4.2) |
=1 |
,where V is the size of the vocabulary, X_{j} represents the appearing of the j-th vocabulary word and t_{w}_{j}_{;i} denotes how many times word w_{j} appears in email D_{i}. As a concrete example, we might have a vocabulary of size of 1309, V = 1309. The rst email (D_{1}) might be from space newsgroup (y_{k} = 1) and the 80-th word in the vocabulary, w_{80}, is \moon” and t_{w}_{80}_{;1} = 2, which says the word \moon” appears 2 times in email D_{1}. Contemplate on why these two models (Eq. (4.1) and Eq. (4.2)) are equivalent.
In the classi cation problem, we are interested in the probability distribution over the email classes (in this case space newsgroup and medical newsgroup) given a particular email D_{i}. We can use Bayes Rule to write:
P (Y = y_{k}) ^{Q}^{V}_{j=1} P (X_{j} j Y = y)^{t}^{wj;i}
^{ }
^{P} ^{(Y} ^{=} ^{y}^{k}^{jD}^{i}^{) =} ^{P}_{k} _{P} _{(Y} _{=} _{y}_{k}_{)} ^{Q}^{V}_{j=1} _{P} _{(X}_{j} _{j} _{Y} _{=} _{y}_{k}_{)}^{t}wj;i
Note that, for the purposes of classi cation, we can actually ignore the denominator here and write:
V | |||||||||
P (Y = y_{k}jD_{i}) / P (Y = y_{k}) | j^{Y} | ^{ } | |||||||
_{P} _{(X}_{j} j _{Y} _{=} _{y)}^{t}^{w}j^{;i} | ^{ } | ||||||||
=1 | |||||||||
V | |||||||||
j^{Y} | j ^{j} | ^{ } | ^{t}w_{j};i | ||||||
y^ = arg max P (Y = y | max P (Y = y | ) | P (X | Y | |||||
i | y_{k} | k ^{j} ^{D}i^{) = arg}_{y}_{k} | k | =1 | = y_{k}) | ||||
Question 4.1 [2 points] Why it is that we can ignore the denominator?
Probabilities are oating point numbers between 0 and 1, so when you are programming it is usually not a good idea to use actual probability values as this might cause numerical under ow issues. As the logarithm is a strictly monotonic function on [0,1] and all of the inputs are probabilities that must lie in [0,1], it does not have an a ect on which of the classes achieves a maximum. Taking the logarithm gives us:
^{y^}i ^{= arg}_{y} | 0 | k | V | w_{j};i | j j | k | 1 | |
max | @ | log P (Y = y ) + | ^{X}j | log P (X | Y = y ) | A | (4.6) | |
t | ||||||||
=1 |
, where y^_{i} is the predicted label for the i-th example.
Question 4.2 [3 points] If the the ratio of the classes in a dataset is close to each other, it is a called \bal-anced” class distribution if not it is skewed. What is the percentage of space emails in the train.labels.txt. Is the training set balanced or skewed towards a one of the classes?
The parameters to learn and their MLE estimators are as follows:
_{j} | _{ } | y=0 | ^{T}j;y=0 | |||||
j | P | V | ||||||
^{j}T_{j;y=1} | ||||||||
_{ } | =1 | ^{T}j;y=0 | ||||||
j j y=1 | P | V | ||||||
_{j=1} ^{T}j;y=1 |
_{y=1} P (Y = 1) = ^{N}_{N}^{1}
T_{j;0} is the number of occurrences of the word j in medical emails in the training set including the multiple occurrences of the word in a single email.
T_{j;1} is the number of occurrences of the word j in space emails in the training set including the multiple occurrences of the word in a single email.
N_{1} is the number of space emails in the training set. N is the total number of emails in the training set.
_{y=1} estimates the probability that any particular email will be a space email.
_{j} _{j} _{y=0} estimates the probability that a particular word in a medical email will be the j-th word of the vocabulary, P (X_{j} j Y = 0)
_{j} _{j} _{y=1} estimates the probability that a particular word in a space email will be the j-th word of the vocabulary, P (X_{j} j Y = 1)
Question 4.3 [5 points] How many parameters do we need to estimate for this model?
Question 4.4 (Coding) [30 points] Train a Naive Bayes classi er using all of the data in the training set ( train-features.csv and train-labels.csv). Test your classi er on the test data (test-features.csv and test-labels.csv, and report the testing accuracy as well as how many wrong predictions were made.
In estimating the model parameters use the above MLE estimator. If it arises in your code, de ne 0 log 0 = 0 (note that a log 0 is as it is, that is -inf ). In case of ties, you should predict \medical”. In the written part of your report what your test set accuracy is? What did your classi er end up predicting? Why is using the MLE estimate is a bad idea in this situation?
Question 4.5 (Coding) [5 points] Extend your classi er so that it can compute an MAP estimate of
parameters using add-one smoothing technique. This new prior \hallucinates” that each word appears additional 1 time in the train set(In your nal submission, your code shall run with these new parameters, not with the parameters in question 4.4).
j | y=0 | ^{T}j;y=0^{+1} | ^{ } | |||||||
j | P | V | ||||||||
^{j}T_{j;y=1}+1 | ||||||||||
=1 | ^{T}j;y=0+V | |||||||||
j j y=1 | P | V | ||||||||
_{j=1} ^{T}j;y=1^{+V} | ^{ } |
_{y=1} P (Y = 1) = ^{N}_{N}^{1}
Train your classi er using all of the training set and have it classify all of the test set and report test-set classi cation accuracy using add-one smoothing technique.
Question 4.6 (Coding) [10 points] Rank the features by calculating mutual information between the class variable and each feature. Write indices and mutual information scores of features in descending order.
Question 4.7 (Coding) [10 points] Remove features from the full model one-by-one starting from the least informative one. Keep removing until a single feature remains. Plot test-set accuracy as a function of removed number of features and report the maximum accuracy that you get.
References
- Twenty Newsgroups dataset. https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
- “On Discriminative vs. Generative Classi ers: A comparison of logistic regression and Naive Bayes” by Andrew Ng and Michael I. Jordan.
- CMU Lecture Notes.
http://www.cs.cmu.edu/_{˜}epxing/Class/10701-10s/Lecture/lecture5.pdf
5