CS 109A/AC 209A/STAT 121A Data Science: Homework 8

Harvard University
Fall 2016
Instructors: W. Pan, P. Protopapas, K. Rader
Due Date: Wednesday, November 16th, 2016 at 11:59pm

Clear namespace

In [1]:
# Clear namespace
for name in dir():
    if not name.startswith('_'):
        del globals()[name]

Import libraries

In [2]:
# Data manipulation
import numpy as np
import pandas as pd

# Ploting
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Scientific computing
import scipy as sp

# Machine Learning
from sklearn import linear_model
from sklearn.decomposition import PCA
from sklearn.cross_validation import KFold
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import f1_score
from sklearn import tree

Problem 0: Basic Information

Fill in your basic information.

Part (a): Your name

[Hagmann, Tim]

Part (b): Course Number

[CS 109a]

Part (c): Who did you work with?

[-]

Problem 1: Image Processing Revisited

In this problem we revisit applications of classification, with the purpose of comparing the performance of support vector classifiers with other classifiers we have learned. We'll begin with the aeriel vegetation detection problem from Homework #7.

The data is contained in dataset_1.txt and dataset_2.txt (you are encouraged to use the datasets from Homework #7 as well). The first two columns of the data contains the latitude and longitudes of randomly sampled locations in the satellite image, and the last column contains a label indicating whether the location contains vegetation (1 denotes the presence of vegetation and 0 denotes otherwise). The task is to, again, identify the vegetation regions in the image.

Functions (necessary for the following calculations)

In [3]:
#--------  plot_decision_boundary
# A function that visualizes the data and the decision boundaries
# Input: 
#      x (predictors)
#      y (labels)
#      model (classifier)
#      poly_flag (fits quadratic model if true, otherwise linear)
#      title (title for plot)
#      ax (a set of axes to plot on)
# Returns: 
#      ax (axes with data and decision boundaries)

def plot_decision_boundary(x, y, model, title, ax, bounds=(0, 1), poly_flag=False):
    # Plot data
    ax.scatter(x[y == 1, 0], x[y == 1, 1], c='darkgreen', alpha=0.6, s=60, edgecolors="None")
    ax.scatter(x[y == 0, 0], x[y == 0, 1], c='black', alpha=0.6, s=60, edgecolors="None")

    # Create mesh
    interval = np.arange(bounds[0], bounds[1], 0.01)
    n = np.size(interval)
    x1, x2 = np.meshgrid(interval, interval)
    x1 = x1.reshape(-1, 1)
    x2 = x2.reshape(-1, 1)
    xx = np.concatenate((x1, x2), axis=1)

    # Predict on mesh points
    if(poly_flag):
        quad_features = preprocessing.PolynomialFeatures(degree=2)
        xx = quad_features.fit_transform(xx)
        
    yy = model.predict(xx)    
    yy = yy.reshape((n, n))

    # Plot decision surface
    x1 = x1.reshape(n, n)
    x2 = x2.reshape(n, n)
    ax.contourf(x1, x2, yy, alpha=0.2, cmap='Greens')
    
    # Label axes, set title
    ax.set_title(title)
    ax.set_xlabel('Latitude')
    ax.set_ylabel('Longitude')
    ax.grid()
   
    return ax
In [4]:
#--------  fit_and_plot_svm_for_c
# Fit and plot SVM model for value of 'C', overlayed on a scatter plot of data 
# (fit on train set and evaluate on test set)
#
# Input: 
#      x_train (array of train predictors)
#      y_train (array of train responses)#      
#      x_test (array of test predictors)
#      y_test (array of test responses)
#      bounds (tuple of bounds for plotting)
#      C  (value for parameter C)
#      ax (axes to plot on)

def fit_and_plot_svm_for_c(x_train, y_train, x_test, y_test, C, ax, bounds=(0, 1)):
    model = svm.SVC(C=C, kernel='linear')
    model.fit(x_train, y_train)
    
    # Train and test error
    tr_acc = model.score(x_train, y_train)
    ts_acc = model.score(x_test, y_test)

    # Plot decision boundary
    plot_decision_boundary(x_train, y_train, model, \
                           'C = ' + str(C)\
                           + ', ACC (Train) = ' + str(tr_acc)\
                           + ', ACC (Test) = ' + str(ts_acc), ax, bounds)
    
    # Plot support vectors
    sv_indices = model.support_
    ax.scatter(x_train[sv_indices, 0], x_train[sv_indices, 1], color='red', alpha=0.15, s=200)
    
    return ax
In [5]:
#--------  kernel_switcher
# A Function to varying the kernels 
#
# Input: 
#      df_train (array of train predictors)
#      df_test (array of test predictors)

def kernel_switcher(df_train, df_test):
    # Select x-variables
    x_train = df_train[:, 0:-1]
    x_test = df_test[:, 0:-1]

    # Select yvariables
    y_train = df_train[:, -1]
    y_test = df_test[:, -1]

    fig, ax = plt.subplots(1, 4, figsize = (30, 7))
    kernel_list = ['linear', 'poly', 'rbf', 'sigmoid']
    
    # Fit models
    for i, kernel in enumerate(kernel_list):
        
        # SVM
        svm_fit = svm.SVC(C=10000, kernel=kernel)
        svm_fit.fit(x_train, y_train)
        sv_indices = svm_fit.support_
        ax[i].scatter(x_train[sv_indices, 0], x_train[sv_indices, 1], color='red', alpha=0.15, s=100)
        
        # Train and test error
        tr_acc = svm_fit.score(x_train, y_train)
        ts_acc = svm_fit.score(x_test, y_test)
        
        # Plot boundaries
        ax[i] = plot_decision_boundary(x_train, y_train, svm_fit, 
                                       str(kernel) + ' Kernel'+', ACC (Train) = ' + str(tr_acc)\
                               + ', ACC (Test) = ' + str(ts_acc), ax[i])
In [6]:
#--------  fit_and_plot_dt
# Build a decision tree and plot the data
# Input: 
#      df_train (array of train predictors)
#      df_test (array of test predictors)   
#      depth (depth of tree)
#      ax (axis)

def fit_and_plot_dt(df_train, df_test, depth, ax):
    x_train = df_train[:, 0:-1]
    x_test = df_test[:, 0:-1]
    y_train = df_train[:, -1]
    y_test = df_test[:, -1]
    
    # Tree
    dt = tree.DecisionTreeClassifier(max_depth = depth)
    dt.fit(x_train, y_train)
    
    # Train and test error
    tr_acc = dt.score(x_train, y_train)
    ts_acc = dt.score(x_test, y_test)
    
    # Plotting
    ax = plot_tree_boundary(x_train, y_train, dt, 'Depth = ' + str(depth)\
                           + ', ACC (Train) = ' + str(tr_acc)\
                           + ', ACC (Test) = ' + str(ts_acc), ax)
    
    return ax
In [7]:
#--------  plot_tree_boundary
# A function that visualizes the data and the decision boundaries
# Input: 
#      x (predictors)
#      y (labels)
#      model (the classifier you want to visualize)
#      title (title for plot)
#      ax (a set of axes to plot on)
# Returns: 
#      ax (axes with data and decision boundaries)

def plot_tree_boundary(x, y, model, title, ax):
    # Plotting
    ax.scatter(x[y==1, 0], x[y==1, 1], c='darkgreen', alpha=0.6, s=60, edgecolors="None")
    ax.scatter(x[y==0, 0], x[y==0, 1], c='black', alpha=0.6, s=60, edgecolors="None")
    
    # Preparation
    interval = np.arange(0,1, 0.01)
    n = np.size(interval)
    x1, x2 = np.meshgrid(interval, interval)
    x1 = x1.reshape(-1, 1)
    x2 = x2.reshape(-1, 1)
    xx = np.concatenate((x1, x2), axis=1)
    yy = model.predict(xx)    
    yy = yy.reshape((n, n))

    # Plot surface
    x1 = x1.reshape(n, n)
    x2 = x2.reshape(n, n)
    ax.contourf(x1, x2, yy, alpha=0.1, cmap='Greens')
    
    # Astetics
    ax.set_title(title)
    ax.set_xlabel('Latitude')
    ax.set_ylabel('Longitude')
    ax.grid()
    
    return ax

Step 1: Load the data and explore

Let's load the two datasets and visualize them.

In [8]:
# Load dataset_1
data_1 = pd.read_csv('datasets/dataset_1_train.txt', delimiter=',')

# Load dataset_2
data_2 = pd.read_csv('datasets/dataset_2_train.txt', delimiter=',')

# Plot data
fig, ax = plt.subplots(1, 2, figsize = (20, 6))

#Plot dataset 1

x = data_1.values[:, :-1]
y = data_1.values[:, -1]

ax[0].scatter(x[y==0, 0], x[y==0, 1], label='non vegetation', c='black', alpha=0.6, s=60, edgecolors="None")
ax[0].scatter(x[y==1, 0], x[y==1, 1], label='vegetation', c='darkgreen', alpha=0.6, s=60, edgecolors="None")

ax[0].set_xlabel('Longitude')
ax[0].set_ylabel('Latitude')
ax[0].set_title('Dataset 1')
ax[0].legend(loc='best', scatterpoints = 1)
ax[0].grid()
#Plot dataset 2

x = data_2.values[:, :-1]
y = data_2.values[:, -1]

ax[1].scatter(x[y==0, 0], x[y==0, 1], label='non vegetation', c='black', alpha=0.6, s=60, edgecolors="None")
ax[1].scatter(x[y==1, 0], x[y==1, 1], label='vegetation', c='darkgreen', alpha=0.6, s=60, edgecolors="None")

ax[1].set_xlabel('Longitude')
ax[1].set_ylabel('Latitude')
ax[1].set_title('Dataset 2')
ax[1].legend(loc='best', scatterpoints = 1)
ax[1].grid()
plt.show()

It looks like the classes in dataset 1 is well separated, while the classes in dataset 2 is not. That means that while dataset 1 has no overlapp of the points dataset 2 has some.

In [9]:
# Load dataset_1
x = data_1.values[:, :-1]
y = data_1.values[:, -1]

# Fit SVM model with C = 1000, linear kernel
model = svm.SVC(C=1000, kernel='linear')
model.fit(x, y)

# Plot decision boundary
fig, ax = plt.subplots(1, 2, figsize = (20, 6))
ax[0] = plot_decision_boundary(x, y, model, 'dataset_1 (fit with entire data)', ax[0])

# Highlight the support vectors
sv_indices = model.support_ # retrieve the support vector indices
ax[0].scatter(x[sv_indices, 0], x[sv_indices, 1], color='red', alpha=0.15, s=200) # draw circles around SVs

# Isolate only support vectors and their labels
x_svs = x[sv_indices, :] 
y_svs = y[sv_indices]
model.fit(x_svs, y_svs)

# Plot decision boundary with only support vectors
ax[1] = plot_decision_boundary(x_svs, y_svs, model, 'dataset_1 (refit with only SVs)', ax[1])

# Highlight the support vectors
ax[1].scatter(x_svs[:, 0], x_svs[:, 1], color='red', alpha=0.15, s=200) # draw circles around SVs
plt.show()

The above plots show, that the support vectors are the points used for calculating the hyperplane decision boundary. Removing the other points (done in plot two) has no effect on the decision boundary.

In [10]:
# Load dataset_1
x = data_2.values[:, :-1]
y = data_2.values[:, -1]

# Fit SVM model with C = 1000, linear kernel
model = svm.SVC(C=1000, kernel='linear')
model.fit(x, y)

# Plot decision boundary
fig, ax = plt.subplots(1, 2, figsize = (20, 6))
ax[0] = plot_decision_boundary(x, y, model, 'dataset_2 (fit with entire data)', ax[0])

# Highlight the support vectors
sv_indices = model.support_ # retrieve the support vector indices
ax[0].scatter(x[sv_indices, 0], x[sv_indices, 1], color='red', alpha=0.15, s=200) # draw circles around SVs

# Isolate only support vectors and their labels
x_svs = x[sv_indices, :] 
y_svs = y[sv_indices]
model.fit(x_svs, y_svs)

# Plot decision boundary with only support vectors
ax[1] = plot_decision_boundary(x_svs, y_svs, model, 'dataset_2 (refit with only SVs)', ax[1])

# Highlight the support vectors
ax[1].scatter(x_svs[:, 0], x_svs[:, 1], color='red', alpha=0.15, s=200) # draw circles around SVs

plt.show()

It appears that in both cases, the decision boundary of an SVM is completely determined by a subset of data points - the support vectors. In Dataset 1, it's possible to find a subset of points from the two classes that are well separated, SVM chooses the subset in which the classes are maximally separated (the margin between points from different classes is maximized). In Dataset 2 it is not possible. Any decision boundary will have some errors (a mix of classes on either side). That means it the selection of the support vectors are very important to optimize the model.

Step 2: Play with values of the parameters of the model

In defining our support vector classifier, the parameter C and kernel have to be specified.

The parameter C is to tune the bounds of the sum of errors for the classification. That means it controls how much the of a violation of the hyperplane margin the model tolerates. It can be seen as a budget for how much the margin can be violated.

The value of C affects how many support vectors are chosen an also the variance of the model. The more support vectors are chosen the lower the variance. The bigger the C value the wider the margin and the model tollerates more violations of the margin.

This can be demonstrated with the simulation below:

In [11]:
# Load train data
data_train = np.loadtxt("datasets/dataset_2_train.txt", delimiter=',')
x_train = data_train[:, 0:-1]
y_train = data_train[:, -1]

# Load test data
data_test = np.loadtxt("datasets/dataset_2_test.txt", delimiter=',')
x_test = data_test[:, 0:-1]
y_test = data_test[:, -1]

# Fit and plot for different 'C' values
fig, ax = plt.subplots(1, 3, figsize = (15, 3))

ax[0] = fit_and_plot_svm_for_c(x_train, y_train, x_test, y_test, 0.1, ax[0], (-10, 10))

ax[1] = fit_and_plot_svm_for_c(x_train, y_train, x_test, y_test, 1, ax[1])

ax[2] = fit_and_plot_svm_for_c(x_train, y_train, x_test, y_test, 100, ax[2])

plt.tight_layout()

For dataset_2, any linear decision boundary would have some errors. In this case, SVM chooses a decision boundary by trading-off the errors for margin (i.e. balancing minimizing error with maximizing margin). It appears that the parameter $C$ controls the trade-off.

Increasing $C$ lays more emphasis on accuracy and less emphasis on the margin of separation. As a result, the model tends to overfit the train set, and perform poorly on the test set.

For $C=0.01$, the model clearly underfits the train set. For $C=100$, the model overfits the train set. For $C=1$, the fitted model misclassifies a point in the upper left corner (which could have othwerwise been classified correctly by a linear model), and achieves higher margin.

Step 3: Push your analysis further

1. How does SVM compare with the other classifiers we know

As a first step we're fitting a decision tree with different dept levels to the available datasets.

CART fitted on dataset 1

In [12]:
# Load train data
train_1 = np.loadtxt("datasets/dataset_1_train.txt", delimiter=',')
# Load test data
test_1 = np.loadtxt("datasets/dataset_1_test.txt", delimiter=',')

Plotting CART with different depth levels. The chosen levels are 1 split, 2 splits, 4 splits and 10 splits.

In [13]:
# Plot for dataset_1
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))

# Plot data and decision boundary
ax[0, 0] = fit_and_plot_dt(train_1, test_1, 1, ax[0, 0]) 
ax[0, 1] = fit_and_plot_dt(train_1, test_1, 2, ax[0, 1]) 
ax[1, 0] = fit_and_plot_dt(train_1, test_1, 4, ax[1, 0]) 
ax[1, 1] = fit_and_plot_dt(train_1, test_1, 10, ax[1, 1]) 

plt.tight_layout()
plt.show()

As can be seen above, the depth levels doesn't matter. The points are always perfectly split in the dataset 1.

CART fitted on dataset 2

In [14]:
# Load train data
train_2 = np.loadtxt("datasets/dataset_2_train.txt", delimiter=',')
# Load test data
test_2 = np.loadtxt("datasets/dataset_2_test.txt", delimiter=',')

Plotting CART with different depth levels. The chosen levels are 1 split, 2 splits, 4 splits and 10 splits.

In [15]:
# Plot for dataset_2
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))

# Plot data and decision boundary
ax[0, 0] = fit_and_plot_dt(train_2, test_2, 1, ax[0, 0]) 
ax[0, 1] = fit_and_plot_dt(train_2, test_2, 2, ax[0, 1]) 
ax[1, 0] = fit_and_plot_dt(train_2, test_2, 4, ax[1, 0]) 
ax[1, 1] = fit_and_plot_dt(train_2, test_2, 10, ax[1, 1]) 

plt.tight_layout()
plt.show()

The above plots show, that the depth of the tree matters for the decision boundary. The further down the split the closer the decision boundary to the training points.

2. Changing the shape of the decision boundarys

It looks like the only decision boundaries SVM draws, so far, are lines. This is pretty limiting. Is there any way for us to change the shape of the decision boundary (how did we get logistic regression to draw curvy boundaries)?

This can be done by transforming the feature space in a non linear fashion (polynomial). In order to do this kernels can be used. Using kernels means transforming the feature vector to fitting a support vector in a high-dimensional space.

Kernel switching on dataset 1

Using 4 different kernels for the SVM. Those are a linear kernel, a polynomial kernel, a rbf kernel and a sigmoid kernel

In [16]:
# Plot 4 different kernels
kernel_switcher(train_1, test_1)

Kernel switching on dataset 2

In [17]:
# Plot 4 different kernels
kernel_switcher(train_2, test_2)

3. Which model is better for remote senseing

In this lab, we've only explored SVM. How does SVM compare with other classifiers in terms of addressing the challenges of remote sensing for land cover analysis?

Dataset 1
The CART classification on dataset 1 achieves a maximum test accuracy of 0.89 with a depth level of 1 compared to an accuracy of 1.0 with a linear SV and rbf kernel. The SVM performs in this setting better.

Dataset 2
On the dataset 2 the CART classifier performs the best with a depth level of 4 where it achieves an accuracy of 0.825. This in contrast to 0.7625 of the polynomial kernel of the svm. The CART tree performs for this data better.

4. Comparing the results

Compare the result of using support vector classifiers to perform classification against results obtained from other models you have learned. Which model is more appropriate for the general task of vegetation detection in aerial images (do not restrict yourself to which model performs better on just these two datasets)?

The above results show, that when modeling vegetation it can't always be assumed that the modeled areas occure in distinct cluster. The problem with SVM's is, that they are only able to model a single decision boundary between vegetation and non-vegetation. The CART approach on the other hand can also model smaller patches of land inbetween the clusters. This is done with the depth of the tree.

5. What is the most appropriate model

Which model is more appropriate for other types of image processing (hand-writting digit classification for example)

In order to address the topic of hand-writting digit classification I would argue that SVM's are more appropriate than using CART. The reason for this is, that SVM's are spare, which means that they only require a very small subset of the handwritting points. That means that SVM's are not affected by large datasets and less prone to the course of dimensionalty, i.e., they are more able to handly high dimensionality problems.

6. Drawbacks of SVM's

Are there any obvious draw backs to support vector classifiers as we have presented them to you? What might be some intuitive ways to address these draw backs?

Classical SVM's have a linear kernel and can only cut the hyperplane in a linear fashion. This was for example addressed by Burgess (1998): "Perhaps the biggest limitation of the support vector approach lies in choice of the kernel." This can be addressed by using different kernels that are able to transform the hyperplane into complex non linear decision boundaries.

A further drawback is presented by categorical variables. In order to solve this issue, the categorical variables can be recoded with dummy variables.

Last but not least a further drawback is the algorithmic complexity and extensive memory requirements of the required quadratic programming in large-scale tasks. This problem can nowadays be easierd solved trough the usage of cloud computing options like AWS, Azure or others.


Problem 2 (Optional): Classification Competition

I took part in the Harvard Data Science Competition competition. My rank is 6th out of 97. The link to my profile is the following:
https://www.kaggle.com/timhagmann


Challenge Problem: Meta Learning

In the problem, you are provided with 10 different previously trained prediction models for a spam classification task. The task is to investigate how can one combine these models into a single meta classification model (without retraining the individual models) that performs better than each of the individual ones?

Functions (necessary for the following calculations)

In [18]:
#--------  find_mode
# Generate for the test set a mode prediction. 
# Input: 
#      row (define the row)

def find_mode(row_index):
    counts = np.unique(row_index, return_counts=True)[1]

    if (len(counts) == 2 and counts[0] == counts[1]):
        if np.random.random() >0.5:
            return counts[0]
        return counts[1]
    
    return np.argmax(counts)

Load data and models

The data for this problem is provided in the files dataset_5_train.txt and dataset_5_test.txt. Each row of these files is an email described by 57 attributes, and the last column is 1 if the email is spam, and 0 otherwise.

The prediction models are provided in the file models.npy and can be loaded into an array by executing:

models = np.load('models.npy')

In [19]:
# Load models from the npy file
models = np.load('models.npy')

# Load train and test dataframe
train_5 = np.loadtxt('datasets/dataset_5_train.txt', delimiter=',')
test_5 =  np.loadtxt('datasets/dataset_5_test.txt', delimiter=',')

Extract the x and y variables

In [20]:
# Extract x and y variables
x_train = train_5[:, 0:-1]
y_train = train_5[:, -1]

x_test = test_5[:, 0:-1]
y_test = test_5[:, -1]

Make predictions

As before, you can make predictions using the $i^\text{th}$ using:

model[i].predict(x_test)

and score the model using:

model[i].score(x_test, y_test)

In [21]:
# Get a n x 10 prediction matrix of the training and testing testing set of the model
pred_train = np.zeros((x_train.shape[0], len(models)))
pred_test = np.zeros((x_test.shape[0], len(models)))

for i in range(len(models)):
    pred_train[:, i] = models[i].predict(x_train)
    pred_test[:, i] = models[i].predict(x_test)

Create a baseline model

The baseline for this task is a simple combination strategy that takes a majority vote from the individual prediction models.

In [22]:
# Find mode vectories with apply
majority_vote = np.apply_along_axis(find_mode, axis=1, arr=pred_test)
baseline_accuracy = np.mean(majority_vote == y_test)

Printing the accuracy score:

In [23]:
print 'Majority vote for the baseline accuracy:', round(baseline_accuracy * 100, 2), '%'
Majority vote for the baseline accuracy: 77.87 %

The majority vote baseline ensemble model reaches a accouracy of 77.87% that should be beaten by the following ensemble model.

Create an ensemble model

There are three very popular methods for combining predictions from different models. Those are:

  • Bagging: Building multiple models (typically of the same type) from different subsamples of the training dataset.
  • Boosting: Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.
  • Stacking: Building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the primary models.

The here used approach falls into the category of stacking. A good approach to do this is to use a GLM or logistic regression on the predictions from the 10 models. The resulting response is the y variable that indicates if an email is classified as spam or not.

Building a glm on the training data leads to the coefficients for each of the predictors that gives "weights" to the prediction generadet by the seperated models.

In [24]:
# Logistic ensemble model
log_model = LogisticRegression().fit(pred_train, y_train)
test_y_pred = log_model.predict(pred_test)
log_model_accuracy = np.mean(test_y_pred == y_test)

Printing the accuracy score:

In [25]:
print 'Ensemble model accuracy score:', round(log_model_accuracy * 100, 2), '%'
Ensemble model accuracy score: 89.6 %

The combined ensemble model build with a GLM gives an accuracy score of 89.6% which is better that the baseline score.