## Description

Comparing Methods for Speed Dating Classi cation

In this programming assignment, you will be asked to implement Logistic Regression and Linear SVM for the classi cation task that you explored in Assignment 2, and then compare the perfor-mance of di erent classi ers.

You should implement your solution using Python. You can use supporting libraries like numpy, scipy as before, but DO NOT use any publicly available code including but not limited to libraries such as sklearn. As before, you should submit your typed assignment report as a pdf along with your source code le.

In the following sections, we specify a number of steps you are asked to complete for this assignment. Note that all results in sample outputs are ctitious and for representation only.

- Preprocessing (4 pts)

Consider the data le dating-full.csv that you used in Assignment 2. For this assignment, we will only consider the rst 6500 speed dating events in this le. That is, you can discard the last 244 lines of the le. Write a Python script named preprocess-assg3.py which reads the rst 6500 speed dating events in dating-full.csv as input and performs the following operations.

- Repeat the preprocessing steps 1(i), 1(ii) and 1(iv) that you did in Assignment 2. (You can reuse the code there and you are not required to print any outputs.)

- For the categorical attributes gender, race, race o and eld, apply one-hot encoding. Sort the values of each categorical attribute lexicographically before you start the encoding process, and set the last value of that attribute as the reference (i.e., the last value of that attribute will be mapped to a vector of all zeros). Print as outputs the mapped vectors for ‘female’ in the gender column, for ‘Black/African American’ in the race column, for ‘Other’ in the race o column, and for ‘economics’ in the eld column.

Expected output lines:

Mapped vector for female in column gender: [vector-for-female].

Mapped vector for Black/African American in column race: [vector-for-Black/African American].

Mapped vector for Other in column race o: [vector-for-other].

Mapped vector for economics in column eld: [vector-for-economics].

- Use the sample function from pandas with the parameters initialized as random state = 25, frac = 0.2 to take a random 20% sample from the entire dataset. This sample will serve as your test dataset, which you should output in testSet.csv; the rest will be your training dataset, which you should output in trainingSet.csv. (Note: The use of the random state will ensure all students have the same training and test datasets; incorrect or no initialization of this parameter will lead to non-reproducible results).

1

In summary, below are the sample inputs and outputs we expect to see. We expect 4 lines of outputs (the outputs below are ctitious) as well as two new .csv les (trainingSet.csv and testSet.csv) produced:

$python preprocess-assg3.py

Mapped vector for female in column gender: [1]

Mapped vector for Black/African American in column race: [0 0 1 0 0]

Mapped vector for Other in column race o: [0 0 0 1 0]

Mapped vector for economics in column field: [0 0 0 0 0 0 0 0]

- Implement Logistic Regression and Linear SVM (16 pts)

Please put your code for this question in a le called lr svm.py. This script should take three arguments as input:

- trainingDataFilename: the set of data that will be used to train your algorithms (e.g., train-ingSet.csv).

- testDataFilename: the set of data that will be used to test your algorithms (e.g., testSet.csv).

- modelIdx: an integer to specify the model to use for classi cation (LR= 1 and SVM= 2).

- Write a function named lr(trainingSet, testSet) which takes the training dataset and the testing dataset as input parameters. The purpose of this function is to train a logistic regres-sion classi er using the data in the training dataset, and then test the classi er’s performance on the testing dataset.

Use the following setup for training the logistic regression classi er: (1) Use L2 regularization,

with = 0:01. Optimize with gradient descent, using an initial weight vector of all zeros and a

step size of 0:01. (2) Stop optimization after a maximum number of iterations max = 500, or when the L2 norm of the di erence between new and old weights is smaller than the threshold tol = 1e 6, whichever is reached rst. Print the classi er’s accuracy on both the training dataset and the testing dataset (rounded to two decimals).

- Write a function named svm(trainingSet, testSet) which takes the training dataset and the testing dataset as input parameters. The purpose of this function is to train a linear SVM classi er using the data in the training dataset, and then test the classi er’s performance on the testing dataset.

Use the following setup for training the SVM: (1) Use hinge loss. Optimize with subgradient

descent, using an initial weight of all zeros, a step size of 0:5 and a regularization parameter

of = 0:01. (2) Stop optimization after a maximum number of iterations max = 500, or when the L2 norm of the di erence between new and old weights is smaller than the threshold tol = 1e 6, whichever is reached rst. Print the classi er’s accuracy on both the training dataset and the testing dataset (rounded to two decimals).

The sample inputs and outputs we expect to see are as follows (the numbers are ctitious):

$python lr svm.py trainingSet.csv testSet.csv 1 Training Accuracy LR: 0.71

2

Testing Accuracy LR: 0.68

$python lr svm.py trainingSet.csv testSet.csv 2 Training Accuracy SVM: 0.75 Testing Accuracy SVM: 0.74

- Learning Curves and Performance Comparison (10 pts)

In this part, you are asked to use incremental 10-fold cross validation to plot learning curves for di erent classi ers, with training sets of varying size but constant test set size. You are then asked to compare the performance of di erent classi ers given the learning curves. The only dataset you should use in this part is trainingSet.csv. Put your code for this question in a le named cv.py.

- Use the sample function from pandas with the parameters initialized as random state = 18, frac = 1 to shu e the training data (i.e., data in trainingSet.csv). Then partition the training data into 10 disjoint sets S = [S
_{1}; :::; S_{10}], where S_{1}contains training samples with index from 1 to 520 (i.e., the rst 520 lines of training samples after shu ing), and S_{2}contains samples with index from 521 to 1040 (i.e., the second 520 lines of training samples after shu ing) and so on. Each set has 520 examples.

- For each t f rac 2 f0.025, 0.05, 0.075, 0.1, 0.15, 0.2g:

- For idx = [1::10]

- Let test set = S
_{idx}.

S

^{Let}^{S}C^{=}^{S}i=[1::10];i6=idx^{.}

^{ }

- Construct train set by taking a random t f rac fraction of training examples from

S_{C} . Use the sample function from pandas with the parameters initialized as ran-dom state = 32, frac = t frac to generate this training set.

- Learn each model (i.e., NBC, LR, SVM) from train set. Feel free to use your NBC implementation in Assignment 2 here (you do not need to apply Laplacian correction).

- Apply each of the learned model to test set and measure the model’s accuracy.

- For each model (i.e., NBC, LR, SVM), compute the average accuracy over the ten trials and its standard error. Standard error is the standard deviation divided by the square root of the number of trials (in our case it’s 10). For example, for a sequence of numbers L = [0:16; 0:18; 0:19; 0:15; 0:19; 0:21; 0:21; 0:16; 0:18; 0:16], the standard deviation of L is
_{L}= 0:021, and the standard error is:

_{L}

^{sterr}L ^{=} _{sqrt(num trials)} ^{= 0:007}

^{ }

- Plot the learning curves for each of the three models in the same plot based on the incremental 10-fold cross validation results you have obtained above. Use x-axis to represent the size of the training data (i.e., t f rac jS
_{C}j) and y-axis to represent the model accuracy. Use error bars on the learning curves to indicate 1 standard error.

- Formulate a hypothesis about the performance di erence between at least two of the models.

- Test your hypothesis and discuss whether the observed data support your hypothesis (i.e., are the observed di erences signi cant).

3

Submission Instructions:

After logging into data.cs.purdue.edu, please follow these steps to submit your assignment:

- Make a directory named yourF irstN ame yourLastN ame and copy all of your les to this directory.

- While in the upper level directory (if the les are in /homes/yin/ming yin, go to /homes/yin), execute the following command:

turnin -c cs573 -p HW3 your folder name

(e.g. your professor would use: turnin -c cs573 -p HW3 ming yin to submit her work)

Keep in mind that old submissions are overwritten with new ones whenever you execute this command.

You can verify the contents of your submission by executing the following command: turnin -v -c cs573 -p HW3

Do not forget the -v ag here, as otherwise your submission would be replaced with an empty one.

Your submission should include the following les:

- The source code in python.

- Your evaluation & analysis in .pdf format. Note that your analysis should include visualization plots as well as a discussion of results, as described in details in the questions above.

- A README le containing your name, instructions to run your code and anything you would like us to know about your program (like errors, special conditions, etc).

4