CS 109A/AC 209A/STAT 121A Data Science: Homework 7

Harvard University
Fall 2016
Instructors: W. Pan, P. Protopapas, K. Rader
Due Date: Wednesday, November 9th, 2016 at 11:59pm

Download the IPython notebook as well as the data file from Vocareum and complete locally.

To submit your assignment, in Vocareum, upload (using the 'Upload' button on your Jupyter Dashboard) your solution to Vocareum as a single notebook with following file name format:

last_first_CourseNumber_HW7.ipynb

where CourseNumber is the course in which you're enrolled (CS 109a, Stats 121a, AC 209a). Submit your assignment in Vocareum using the 'Submit' button.

Avoid editing your file in Vocareum after uploading. If you need to make a change in a solution. Delete your old solution file from Vocareum and upload a new solution. Click submit only ONCE after verifying that you have uploaded the correct file. The assignment will CLOSE after you click the submit button.

Problems on homework assignments are equally weighted. The Challenge Question is required for AC 209A students and optional for all others. Student who complete the Challenge Problem as optional extra credit will receive +0.5% towards your final grade for each correct solution.

Clear namespace

In [1]:
# Clear namespace
for name in dir():
    if not name.startswith('_'):
        del globals()[name]

Install package

In [2]:
#!pip install pydotplus

Import libraries

In [57]:
# Data manipulation
import numpy as np
import pandas as pd

# Ploting
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import rcParams
%matplotlib inline

# Scientific computing
import scipy as sp

# Visualization
import StringIO
from IPython.display import Image
import pydotplus

# Machine Learning
from sklearn import linear_model
from sklearn import discriminant_analysis
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn import tree
from sklearn import ensemble
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

Problem 0: Basic Information

Fill in your basic information.

Part (a): Your name

[Hagmann, Tim]

Part (b): Course Number

[CS 109a]

Part (c): Who did you work with?

[-]

All data sets can be found in the datasets folder and are in comma separated value (CSV) format

Problem 1: Monitoring Land Cover Changes Using Satellite Images

In the face of rapid urban development and climate change, it is now more urgent than ever for governments (and other organizations) to have a detailed, accurate and up-to-date picture of land use and land cover, as well as how the land use/cover is changing over time, in order to make effective policy decision to manage and protect natural resources. Building such a comprehensive picture of land use/cover for a large region is extremely difficult.

Recent improvements in satellite imagery and image process have allowed for new tools in land use/cover analysis. The following is an image of the change in vegetation cover around Belize from 1975 to 2007:

In this problem, we will explore how to use classifiers to detect the presence and location of vegetation in satellite images.

Part 1(a): Detecting vegetation in satellite images

The following files contain sampled locations from satelite aeriel images: dataset_1.txt, ... dataset_4.txt. The first two columns contain the normalized latitude and longitude values. The last column indicates whether or not the location contains vegetation, with 1 indicating the presence of vegetaion and 0 indicating otherwise.

These small sets of labels are typically generated by hand (that is, locations might be classified based on field studies or by cross-referencing with government databases). Your task is to use the labeled locations to train a model that will predict whether a new location is vegetation or non-vegetation.

  • Suppose we were asked to write a computer program to automatically identify the vegetation regions on the landscape. How can we use the model fitting algorithms you have studied so far to identify the boundaries of the vegetation regions? In particular, discuss the suitability of the following algorithms for each of the four data sets (you do not need to evaluate your classifier, build your argument using data and decision boundary visualizations):

    • linear or polynomial linear regression
    • linear or polynomial logistic regression
    • linear or quadratic discriminant analysis
    • decision trees
  • By a quick visual inspection of each data set, what do you think is the smallest depth decision tree that would provide a good fit of the vegetation boundaries in each case? Does sklearn's decision tree fitting algorithm always provide a good fit for the proposed depth? If not, explain why. Support your answer with suitable visualization.

We provide you with a function plot_tree_boundary to visualize a decision tree model on the data set.

Functions (necessary for the following calculations

In [4]:
#--------  plot_decision_boundary
# A function that visualizes the data and the decision boundaries
# Input: 
#      x (predictors)
#      y (labels)
#      model (classifier)
#      poly_flag (fits quadratic model if true, otherwise linear)
#      title (title for plot)
#      ax (a set of axes to plot on)
# Returns: 
#      ax (axes with data and decision boundaries)

def plot_decision_boundary(x, y, model, title, ax, bounds=(0, 1), poly_flag=False):
    # Plot data
    ax.scatter(x[y == 1, 0], x[y == 1, 1], c='darkgreen', alpha=0.6, s=40, label='vegetation', edgecolors="None")
    ax.scatter(x[y == 0, 0], x[y == 0, 1], c='black', alpha=0.6, s=40, label='non vegetation', edgecolors="None")
    
    # Create mesh
    interval = np.arange(bounds[0], bounds[1], 0.01)
    n = np.size(interval)
    x1, x2 = np.meshgrid(interval, interval)
    x1 = x1.reshape(-1, 1)
    x2 = x2.reshape(-1, 1)
    xx = np.concatenate((x1, x2), axis=1)

    # Predict on mesh points
    if(poly_flag):
        quad_features = preprocessing.PolynomialFeatures(degree=2)
        xx = quad_features.fit_transform(xx)
        
    yy = model.predict(xx)    
    yy = yy.reshape((n, n))

    # Plot decision surface
    x1 = x1.reshape(n, n)
    x2 = x2.reshape(n, n)
    ax.contourf(x1, x2, yy, alpha=0.1, cmap='Greens')
    
    # Label axes, set title
    ax.set_title(title)
    ax.set_xlabel('Latitude')
    ax.set_ylabel('Longitude')
    
    return ax
In [5]:
#--------  plot_tree_boundary
# A function that visualizes the data and the decision boundaries
# Input: 
#      x (predictors)
#      y (labels)
#      model (the classifier you want to visualize)
#      title (title for plot)
#      ax (a set of axes to plot on)
# Returns: 
#      ax (axes with data and decision boundaries)

def plot_tree_boundary(x, y, model, title, ax):
    # PLOT DATA
    ax.scatter(x[y==1,0], x[y==1,1], c='darkgreen', alpha=0.6, s=40, label='vegetation', edgecolors="None")
    ax.scatter(x[y==0,0], x[y==0,1], c='black', alpha=0.6, s=40, label='non vegetation', edgecolors="None")
    
    # CREATE MESH
    interval = np.arange(0,1,0.01)
    n = np.size(interval)
    x1, x2 = np.meshgrid(interval, interval)
    x1 = x1.reshape(-1,1)
    x2 = x2.reshape(-1,1)
    xx = np.concatenate((x1, x2), axis=1)

    # PREDICT ON MESH POINTS
    yy = model.predict(xx)    
    yy = yy.reshape((n, n))

    # PLOT DECISION SURFACE
    x1 = x1.reshape(n, n)
    x2 = x2.reshape(n, n)
    ax.contourf(x1, x2, yy, alpha=0.1, cmap='Greens')
    
    # LABEL AXIS, TITLE
    ax.set_title(title)
    ax.set_xlabel('Latitude')
    ax.set_ylabel('Longitude')
    ax.grid()
    
    return ax
In [6]:
#--------  fit_and_plot_dt
# Fit decision tree with on given data set with given depth, and plot the data/model
# Input: 
#      fname (string containing file name)
#      depth (depth of tree)

def fit_and_plot_dt(x, y, depth, title, ax):
    # FIT DECISION TREE MODEL
    dt = tree.DecisionTreeClassifier(max_depth = depth)
    dt.fit(x, y)

    # PLOT DECISION TREE BOUNDARY
    ax = plot_tree_boundary(x, y, dt, title, ax)
    
    return ax

Step 1: Load the data and explore

Let's load the four datasets and visualize the data. That is, let's plot the data points by longitude and latitude. Let's also color code the points: green if the location is vegetation and white otherwise.

In [7]:
# Load the data from satellite image #1 
sat_img_1 = pd.read_csv('datasets/dataset_1.txt', delimiter=',', header=None)

# Check out the data - sanity check
sat_img_1.head()
Out[7]:
0 1 2
0 0.566809 0.788130 1.0
1 0.400046 0.620933 1.0
2 0.458702 0.536935 1.0
3 0.474504 0.638224 1.0
4 0.558707 0.715527 1.0
In [8]:
# The data looks ok, so let's load the rest of the images
sat_img_2 = pd.read_csv('datasets/dataset_2.txt', delimiter=',', header=None)
sat_img_3 = pd.read_csv('datasets/dataset_3.txt', delimiter=',', header=None)
sat_img_4 = pd.read_csv('datasets/dataset_4.txt', delimiter=',', header=None)

# Make a list of the four dataframes so we can iterate through them later
sat_images = [sat_img_1, sat_img_2, sat_img_3, sat_img_4]
In [9]:
# Plot the data in each dataframe as a subplot of a single figure
fig, ax = plt.subplots(1, 4, figsize=(20, 5))

# Iterate through the four images/dataframes
for i in range(4):
    # Get the long/lat coords
    x = sat_images[i].values[:, :-1]
    # Get the class labels
    y = sat_images[i].values[:, -1]
    
    # Plot vegetation locations as green dots
    ax[i].scatter(x[y == 1, 0], x[y == 1, 1], c='darkgreen', alpha=0.6, s=40, label='vegetation', edgecolors="None")
    # Plot non-vegetation locations as white dots
    ax[i].scatter(x[y == 0, 0], x[y == 0, 1], c='black', alpha=0.6, s=40, label='non vegetation', edgecolors="None")
    
    # Label everything
    ax[i].set_xlabel('longitude')
    ax[i].set_ylabel('latitude')
    ax[i].set_title('satellite image {}'.format(i + 1))
    ax[i].legend()
    ax[i].grid()

plt.tight_layout()
plt.show() 

For each image, based on the sample of we want to train a classifier that will classify a location as vegetation or non-vegetation.

Since the vegetation seem to be clustered in isolated regions in each image. Classifying locations with vegetation involves learning a boundary around each region. We then classify the points inside this region as vegetation and points outside as non-vegetation.

Formally speaking, the vegetation regions in an image can be identified by treating the latitude and longitude values as predictors and the vegetation information as a binary response, and fitting a classification model. The decision boundaries of these classifiers then allow us to identify the vegetation regions.

Step 2: Classify locations with vegetation

How can we use the models we have studied so far to identify the boundaries of the vegetation regions?

Let's consider:

1. linear or polynomial logistic regression
2. linear or quadratic discriminant analysis
3. decision trees

Question 1:

Looking at the data, can you intuitively tell which model will perform best on which data set?

Answer 1:

  • DF 1: A good decision boundary for the cluster would be a square, that means, that a method that is able to produce rectangles as its decision boundaries is going to perform the best. A method that is able to this are the family of classification trees.
  • DF 2: There are two clusters in a square shaped form, with the same reasoning as above, a classification tree might perform the best.
  • DF 3: With the same reasoning the decision tere should perform very well on the data.
  • DF 4: The decision boundary is a circle wich means that a polynomial logistic regression or a QDA should perform very well.

Question 2:

What do you think is the smallest depth decision tree that would provide a good fit of the vegetation boundaries in each case?

Answer 2:

  • DF 1: A tree with the depth of four should do the trick. There are four edges to cover.
  • DF 2: I would guess of 8 or slightly below 8.
  • DF 3: A depth of 4 should be sufficiant.
  • DF 4: Because of the rectangular descion boundaries of trees a very large number of splits are nesessary. My guess would be arround 15+.

Linear logistic regression with lamda (c=1'000'000) --> No regularization

In [10]:
# Plot the data in each dataframe as a subplot of a single figure
fig, ax = plt.subplots(1, 4, figsize=(20, 5))

# Create a logistic regression model with linear boundary
np.random.seed(123) # Set random seed
logreg = linear_model.LogisticRegression(C=1000000)

# Iterate through the four images/dataframes
for i in range(4):
    # Get the long/lat coords
    x = sat_images[i].values[:, :-1]
    # Get the class labels
    y = sat_images[i].values[:, -1]    
    
    # Fit our logistic regression model
    logreg.fit(x, y)
    
    # Change the bounds on the scatter plot (the 4th dataframe needs a larger frame)
    if i == 3:
        bounds = (-0.5, 1.5)
    else:
        bounds = (0, 1)
    
    # Plot the data along with the decision boundary learned by our model
    ax[i] = plot_decision_boundary(x, y, logreg, 
                                   'Logistic Regression (linear): satellite image {}'.format(i + 1), 
                                   ax[i], bounds)
    ax[i].grid()    

plt.tight_layout()
plt.show()

Polinomial logistic regression with lamda (c=1'000'000) --> No regularization

In [11]:
# Plot the data in each dataframe as a subplot of a single figure
fig, ax = plt.subplots(1, 4, figsize=(20, 5))

# Logistic Regression with Quadratic Terms
np.random.seed(123) # Set random seed
logreg_poly = linear_model.LogisticRegression(C=1000000)

# Iterate through the four images/dataframes
for i in range(4):
    #Get the long/lat coords
    x = sat_images[i].values[:, :-1]
    #Get the class labels
    y = sat_images[i].values[:, -1]  
    
    #Expand our predictor array with quadratic terms
    quad_features = preprocessing.PolynomialFeatures(degree = 2)
    x_expanded = quad_features.fit_transform(x)
    
    #Fit logistic regression model with quadratic decision boundary
    logreg_poly.fit(x_expanded, y)
    print logreg_poly.coef_
    
    #Plot the data along with the decision boundary learned by our model
    ax[i] = plot_decision_boundary(x, y, logreg_poly, 'Logistic Regression (quadratic): satellite image {}'.format(i + 1), ax[i], poly_flag=True)
    ax[i].grid()

plt.tight_layout()
plt.show()
[[-19.50055616  55.91712532  71.55150986 -57.24851099  18.98698705
  -58.38296017]]
[[ -7.2650532   26.5708788   41.99911503 -13.02731188 -35.11273308
  -26.74273522]]
[[  250.33015399 -1000.11678659 -1003.06259068     4.69227591
   1995.20829885     4.46298734]]
[[ -8.68797958  47.14766547  41.92751815 -47.47130911   0.70758635
  -43.6331506 ]]