Description
You may complete this homework assignment either individually or in teams up to 2 people.

Support Vector Machines: Implementation [40 points]: In this problem you will implement a class called SVM453X that supports both training and testing of a linear, hardmargin support vector machine (SVM). In particular, you should esh out the two methods fit and predict that have the same API as the other machine learning tools in the sklearn package.


fit: Given a matrix X consisting of n rows (examples) by m columns (features)^{1} as well as a vector y 2 f 1; 1g^{n} consisting of labels, optimize the hyperplane normal vector w and bias term b to maximize the margin between the two classes. To do so, you should harness an o theshelf quadratic programming (QP) solver from the cvxopt Python package. The template le already includes code showing how to use it { just pass in the relevant matrices and vectors as numpy arrays. The trick is in setting up the variables to encode the SVM objective and constraints correctly. Then, store the optimized values in self.w and self.b because you’ll need them later.



predict: Given a matrix X of test examples, and given the pretrained self.w and self.b, compute and return the corresponding test labels.

Your code will be evaluated based both on the accuracy of the predicted test labels in several tests (you can see the exact test code we’ll be running in the template le), as well as the di erence between the hyperplane parameters compared to their ideal values.
Note: as shown in the template, you can calculate the ideal values using the o theshelf sklearn SVM solver. Note that, since the SVM in this package is, by default, a softmargin SVM, we have to \force” it to be hardmargin by making the cost penalty C very large.

Support Vector Machines: Application [30 points]: Register for the following Kaggle competition about predicting which bank customers will make a transaction: https://www.kaggle.com/c/santandercustomertransactionprediction Read the Overview section to learn what the competition is about. Download the training data and associated labels from the Data tab. Train SVMs using two di erent kernels: linear and polynomial (I suggest degree 3, but it’s your choice). Use sklearn’s implementation of the SVC class. Use half the training data (randomly selected from the entire training set) for training, and use the remaining half to estimate the performance of your classi ers. You are not required to submit your guesses for the test examples; however, if you want to, you can earn a spot on the competition’s leaderboard.
Bootstrap aggregation (bagging): In order to train an SVM with a nonlinear kernel on a dataset of this size (the whole training dataset contains 200K examples; hence, your training data will consist of 100K examples) on an ordinary desktop/laptop computer (i.e., without terabytes of RAM), you will need to use an approach such as bagging: (a) split the data into multiple subsets; (b) train an SVM on each subset; and (c) create an ensemble classi er that averages the predictions of the individual classi ers in the ensemble. Report your estimated accuracy (AUC) using both your SVMs; use the sklearn.metrics.roc auc score function. You can just include the accuracy as a comment at the top of your Python le. We might run your code to reproduce this result; hence, make sure you only use sklearn, pandas, and numpy { no other packages.
Submit your Python code in a le called homework4 WPIUSERNAME1.py (or homework4 WPIUSERNAME1 WPIUSERNAME2.py for teams).

To be consistent with the notation I’ve used in the lecture slides, X in this assignment would actually need to be called X^{>}. The reason I call it X here is to be consistent with the sklearn package.
1