**Harvard University**

**Fall 2016**

**Instructors: W. Pan, P. Protopapas, K. Rader**

**Due Date: ** Wednesday, November 2nd, 2016 at 11:59pm

Download the `IPython`

notebook as well as the data file from Vocareum and complete locally.

To submit your assignment, in Vocareum, upload (using the 'Upload' button on your Jupyter Dashboard) your solution to Vocareum as a single notebook with following file name format:

`last_first_CourseNumber_HW6.ipynb`

where `CourseNumber`

is the course in which you're enrolled (CS 109a, Stats 121a, AC 209a). Submit your assignment in Vocareum using the 'Submit' button.

**Avoid editing your file in Vocareum after uploading. If you need to make a change in a solution. Delete your old solution file from Vocareum and upload a new solution. Click submit only ONCE after verifying that you have uploaded the correct file. The assignment will CLOSE after you click the submit button.**

Problems on homework assignments are equally weighted. The Challenge Question is required for AC 209A students and optional for all others. Student who complete the Challenge Problem as optional extra credit will receive +0.5% towards your final grade for each correct solution.

Import libraries

In [1]:

```
# Clear namespace
for name in dir():
if not name.startswith('_'):
del globals()[name]
```

In [2]:

```
# Data manipulation
import numpy as np
import pandas as pd
# Ploting
import matplotlib
import matplotlib.pyplot as plt
# Scientific computing
import scipy as sp
from scipy.stats import mode
# Machine Learning
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
from sklearn import discriminant_analysis
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegressionCV as LogRegCV
from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.preprocessing import PolynomialFeatures
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import confusion_matrix
%matplotlib inline
```

[Hagmann, Tim]

[CS 109a]

[-]

**All data sets can be found in the datasets folder and are in comma separated value (CSV) format**

In this problem, you will build a model to recommend movies using ratings from users.

The dataset for this problem is contained in `dataset_4_ratings.txt`

. This dataset contains ratings from 100 users for 1000 movies. The first two columns contain the user and movie IDs. The last column contains a 1 if the user liked the movie, and 0 otherwise. Not every movie is rated by every user (i.e. some movies have more ratings than others).

The names of the movies corresponding to the IDs are provided in `dataset_4_movie_names.txt`

.

One way of recommending movies is to recommend movies that are generally agreed upon to be good. But how do we measure the "goodness" or "likability" of a movie?

**Implementation:**Suppose we measure the "goodness" of a movie by the probability that it will be liked by a user, $P(\textbf{label} = \text{like}|\textbf{movie}) = \theta_{\text{movie}}$. Assuming that each user independently rates a given movie according to the probability $\theta_{\text{movies}}$. Use a reasonable estimate of $\theta_{\text{movies}}$ to build a list of top 25 movies that you would recommend to a new user.**Hint:**What does the likelihood function, $P(\textbf{likes} = k | \theta_{\text{movie}}, n, \textbf{movie})$, look like? What $\theta_{\text{movie}}$ will maximize the likelihood?

**Analysis:**Why is using $\theta_{\text{movie}}$ to rank movies more appropriate than using the total number of likes? Explain why your estimate of $\theta_{\text{movie}}$ is reasonable. Explain the potential draw backs of estimating $\theta_{\text{movie}}$ this way.**Hint:**Under what conditions may models that maximize the likelihood be suboptimal? Do those conditions apply here?

Let's get acquainted with our data.

In [3]:

```
ratings_df = pd.read_csv('datasets/dataset_4_ratings.txt', delimiter=',')
ratings_df.head()
```

Out[3]:

In [4]:

```
names_df = pd.read_csv('datasets/dataset_4_movie_names.txt', delimiter='|')
names_df.head()
```

Out[4]:

`names`

dataframe but not in the `rating`

dataframe. So we forsee a challenge later to cross reference a movie's id with its name. After all, it'd be more helpful to see the names of the movies in the top 25!

One way of recommending movies is to recommend movies that are generally agreed upon to be good. But how do we measure the "goodness" or "likability" of a movie?

Suppose that we simple count the number of likes for each movie and take the 25 movies with the most number of likes. Would this be a good idea?

To help you extract the movie information from the two databases, we have built you a helper function.

In [5]:

```
#-------- movie_stats
# A function that extracts the number of likes and total number of ratings for a movie
# Input:
# movie_name (an optional parameter containing the exact name of the movie)
# movie_name_contains (an optional parameter containing a portion of the name of the movie)
# Returns:
# total_ratings (the total number of ratings for a movie)
# likes (the total number of likes for a movie)
def movie_stats(movie_name=None, movie_name_contains=None):
#If given an exact movie name:
if movie_name is not None:
#Find the index of the movie, by name, in the "names" dataframe
movie_index = names_df[names_df['movie_name'] == movie_name].index[0]
#Get the id for the movie in the "names" dataframe
movie_id = names_df.loc[movie_index, 'movie_id']
#Get all ratings for the movie, by id, in the "ratings" dataframe
ratings_for_movie = ratings_df[ratings_df['movie_id'] == movie_id]
#Count the total number of ratings
total_ratings = len(ratings_for_movie)
#Count the likes (the 1's)
likes = ratings_for_movie['rating'].sum()
#Otherwise, if given a partial movie name:
elif movie_name_contains is not None:
#Find the index of the movie, by name, in the "names" dataframe
movie_index = names_df[names_df['movie_name'].str.contains(movie_name_contains)].index[0]
#Get the id for the movie in the "names" dataframe
movie_id = names_df.loc[movie_index, 'movie_id']
#Get all ratings for the movie, by id, in the "ratings" dataframe
ratings_for_movie = ratings_df[ratings_df['movie_id'] == movie_id]
#Count the total number of ratings
total_ratings = len(ratings_for_movie)
#Count the likes (the 1's)
likes = ratings_for_movie['rating'].sum()
else:
total_ratings = 0.
likes = 0.
return float(total_ratings), likes
```

Using the helper function above, let's check out the stats for a couple of movies:

- Toy Story
- Shawshank Redemption
- French Twist

In [6]:

```
total_ratings, likes = movie_stats(movie_name_contains='Toy Story')
print 'total number of ratings for Toy Story:', total_ratings
print 'number of likes for Toy Story:', likes
```

In [7]:

```
total_ratings, likes = movie_stats(movie_name_contains="Shawshank Redemption")
print 'total number of ratings for Star Wars:', total_ratings
print 'number of likes for Star Wars:', likes
```

In [8]:

```
total_ratings, likes = movie_stats(movie_name_contains='French Twist')
print 'total number of ratings for French Twist:', total_ratings
print 'number of likes for French Twist:', likes
```

Now, instead of using the total number of likes, we use the total percentage of likes to rank the movies. Why is total like-percentage a more reasonable metric for likability than total number of likes? There is an intuitive, layman's explanation for this choice and there is a formal probabilistic way of justifying this choice.

So what does the top 25 movies loook like when we rank by like-percentage?

In [9]:

```
#Make a list of movie names and their ratings info
likability = []
#Iterate through all the movie names
for name in names_df['movie_name'].values:
#Get ratings info for movie
total_ratings, likes = movie_stats(movie_name=name)
#Add movie info to our list
likability.append((name, likes, total_ratings, likes / total_ratings))
#Sort our list of movie info by like-percentage, in descending order
sorted_likability = sorted(likability, key=lambda t: t[3], reverse=True)
#Get the movies with top 25 like-percentage
top_25_movies = sorted_likability[:25]
#Print results of ranking
print 'Top 25 Movies'
print '****************************'
for movie, likes, total_ratings, likable in top_25_movies:
print movie, ':', likable, '({}/{})'.format(likes, total_ratings)
```

Let's add a prior, $p(\theta_{\text{movie}})$, to our probabilistic model for movie rating. To keep things simple, we will restrict ourselves to using beta priors.

**Analysis:**How might adding a prior to our model benifit us in our specific task? Why are beta distributions appropriate priors for our application?**Hint:**Try visualizing beta priors $a = b = 1$, $a = b = 0.5$, $a = b = 2$ and $a = 4, b = 2$, for example, what kind of plain-English prior beliefs about the movie does each beta pdf encode?

**Implementation/Analysis:**How does the choice of prior affect the posterior distribution of the 'likability' for the movies:*Toy Story, Star Wars, The Shawshank Redemption, Down Periscope and Chain Reaction*.**Hint:**Use our posterior sampling function to visualize the posterior distribution.

**Implementation/Analysis:**How does the effect of the prior on the posterior distribution vary with the number of user ratings?**Hint:**Visualize the posterior distribution for different sizes of subsample of user ratings for the movie*Star Wars*.

In the following, we've provide you a couple of functions for visualize beta priors and approximating their associated posteriors.

Recall that we can predjudice our models (if we don't like their results) by building in some prior beliefs about the model parameters. In this case, our model parameter is "the likability of a movie". So let's encode some prior beliefs for what we think are reasonable values for "likability". To keep things simple, we will restrict ourselves to using beta priors to model our beliefs about "likability".

We have built for you a few helpful functions: one to help you visualize the pdf for various beta priors, one for approximating the posterior using samples, and one for calculating the mode of a distribution (given by some sampled values).

In [10]:

```
#-------- plot_beta_prior
# A function to visualize a beta pdf on a set of axes
# Input:
# a (parameter controlling shape of beta prior)
# b (parameter controlling shape of beta prior)
# color (color of beta pdf)
# ax (axes on which to plot pdf)
# Returns:
# ax (axes with plot of beta pdf)
def plot_beta_prior(a, b, color, ax):
#Create a beta-distributed random variable with shape a, b
rv = sp.stats.beta(a, b)
#Create values from 0 to 1
x = np.linspace(0, 1, 100)
#Plot the beta pdf for values from 0 to 1
ax.plot(x, rv.pdf(x), '-', lw=2, color=color, label='a=' + str(a) + ', b=' + str(b))
#Set title, legend etc
ax.set_title('Beta prior with a=' + str(a) + ', b=' + str(b))
ax.legend(loc='best')
return ax
```

In [11]:

```
#-------- sample_posterior
# A function that samples points from the posterior over a movie's
# likability, given a binomial likelihood function and beta prior
# Input:
# a (parameter controlling shape of beta prior)
# b (parameter controlling shape of beta prior)
# likes (the number of likes in likelihood)
# ratings (total number of ratings in likelihood)
# n_samples (number of samples to take from posterior)
# Returns:
# post_samples (a array of points from the posterior)
def sample_posterior(a, b, likes, ratings, n_samples):
#Samples points from a beta distribution
#(the posterior of a binomial likelihood and a beta prior is a beta distribution!)
post_samples = np.random.beta(a + likes, b + ratings - likes, n_samples)
return post_samples
```

In [12]:

```
#-------- find_mode
# A function that approximates the mode of a distribution given a sample from the distribution
# Input:
# values (samples from the distribution)
# num_bins (number of bins to use in approximating histogram)
# Returns:
# mode (the approximate mode of the distribution)
def find_mode(values, num_bins):
#Make an approximation (histogram) of the distribution using the samples
bins, edges = np.histogram(values, bins=num_bins)
#Find the bin in the histogram with the max height
max_height_index = np.argmax(bins)
#Find the sample corresponding to the bin with the max height (the mode)
mode = (edges[max_height_index] + edges[max_height_index + 1]) / 2.
return mode
```

In [13]:

```
#A list of beta distribution shapes to try out
beta_shapes = [(1, 1), (0.5, 0.5), (2, 2), (4, 2), (2, 4)]
#Length of the list of shapes
n = len(beta_shapes)
#Plot all the beta pdfs in a row
fig, ax = plt.subplots(1, n, figsize=(20, 4))
#Start the index of the current subplot at 0
ax_ind = 0
#Iterate through all the shapes
for a, b in beta_shapes:
#Plot the beta pdf for a particular shape
plot_beta_prior(a, b, 'darkblue', ax[ax_ind])
#Increment the subplot index
ax_ind += 1
plt.tight_layout()
plt.show()
```

The beta prior encodes commonsens values about how the likelihood affects the posterior.

So, which beta prior encodes an appropriate belief about likability? If this is a difficult question to answer, let's look at the effect of the priors on the posterior (i.e. the likelihood of a particular "likelihood" given a set of observed ratings).

Pick some movies from your top 25 and experiment with combining their ratings information and different priors.

Each beta prior encodes a belief about likability. The effect of the priors on the posterior can be explored - this effectively gives the likelihood of a particular "likelihood" given a set of observed ratings.

The posterior distribution of a binomial likelihood and a beta prior is also a beta distribution in which the alpha parameter increased by the number of likes and the beta parameter is increased by the number of dislikes.

In [14]:

```
#Get the name of the first movie in the top 25 list
movie_name = top_25_movies[0][0]
#Get the ratings info for the first movie in the top 25 list
likes = top_25_movies[0][1]
total_ratings = top_25_movies[0][2]
likability = top_25_movies[0][3]
#Print movie info
print '{}: {} ({}/{})'.format(movie_name, likability, likes, total_ratings)
#Number of samples to use when approximating our posterior
n_samples = 10000
#Plot the posterior corresponding to each prior
fig, ax = plt.subplots(1, n, figsize=(20, 4))
#Start the index of the current subplot at 0
ax_ind = 0
#Iterate through all the shapes
for a, b in beta_shapes:
#Draw samples from the posterior corresponding to a particular beta prior
post_samples = sample_posterior(a, b, likes, total_ratings, n_samples)
#Approximate the posterior with a histogram of these samples
ax[ax_ind].hist(post_samples, bins=30, color='darkred', alpha=0.8, edgecolor='white')
#Find the approximate mode of the posterior
mode = find_mode(post_samples, 30)
#Plot the mode as a vertical line
ax[ax_ind].axvline(x=mode, linewidth=3, label='Posterior mode', color = "darkblue")
#Set title, legends etc
ax[ax_ind].set_title('Posterior, with Beta prior (a={}, b={})'.format(a, b))
ax[ax_ind].legend(loc='best')
#Increment the subplot index
ax_ind += 1
plt.tight_layout()
plt.show()
```

So, how do we interpret the posterior? I.e. what is it telling us about the likability of a particular movie?

How do we interpret the posterior mode? What does the mode tell us about the likability of a particular movie?

Do the posterior mode tell us the same thing about a movie's likability as the like-percentage? Which metric is more realistic or reasonable?

**Implementation:**Choose a reasonable beta prior, choose a reasonable statistic to compute from the posterior, and then build a list of top 25 movies that you would recommend to a new user based on your chosen posterior statistic.

**Analysis:**How does your top 25 movies compare with the list you obtained in part(a)? Which method of ranking is better?

**Analysis:**So far, our estimates of the 'likability' for a movie was based on the ratings provided by all users. What can be the draw back of this method? How can we improve the recommender system for individual users (if you feel up to the challenge, implement your improved system and compare it to the one you built in the above)?

Let's choose a reasonable beta prior and build a list of top 25 movies that you would recommend to a new user based on the posterior mode.

Why is this reasonable to do? There is an intuitive, layman's explanation for this as well as a formal statistical justification.

In [15]:

```
#Choose a beta prior that encodes a reasonable belief about likability
a = 2
b = 2
#Make a list of movie names and their ratings info
likability = []
#Iterate through all the movie names
for name in names_df['movie_name'].values:
#Get ratings info for movie
total_ratings, likes = movie_stats(movie_name=name)
#Approximate the posterior given the ratings info and the prior
post_samples = sample_posterior(a, b, likes, total_ratings, n_samples)
#Approximate posterior mode
mode = find_mode(post_samples, 30)
#Add movie info to our list
likability.append((name, likes, total_ratings, mode))
#Sort our list of movie info by like-percentage, in descending order
sorted_likability = sorted(likability, key=lambda t: t[3], reverse=True)
#Get the movies with top 25 like-percentage
top_25_movies = sorted_likability[:25]
#Print results of ranking
print 'Top 25 Movies'
print '****************************'
for movie, likes, total_ratings, likable in top_25_movies:
print movie, ':', likable, '({}/{})'.format(likes, total_ratings)
```

So while "Shawshank Redemption" is still highly ranked, "French Twist" no longer appears on our top 25. This at least seems more intuitive.

So, in what important ways is our top 25 list using the posterior mode different from our top 25 list using the like-percent? Which list is, generally speaking, more realistic and why?

In this problem you'll work with some neighborhood demographics of a region in Boston from the years 2000 to 2010.

The data you need are in the files `dataset_1_year_2000.txt`

, ..., `dataset_1_year_2010.txt`

. The first two columns of each dataset contain the adjusted latitude and longitude of some randomly sampled houses. The last column contains economic status of a household:

0: low-income,

1: middle-class,

2: high-income

Due to the migration of people in and out of the city, the distribution of each economic group over this region changes over the years. The city of Boston estimates that in this region there is approximately a 25% yearly increase in high-income households; and a 25% decrease in the remaining population, with the decrease being roughly the same amongst both the middle class and lower income households.

Your task is to build a model for the city of Boston that is capable of predicting the economic status of a household based on its geographical location. Furthermore, your method of prediction must be accurate over time (through 2010 and beyond).

**Hint:** look at data only from 2000, and consider using both Linear Discriminant Analysis (LDA) and Logistic Regression. Is there a reason one method would more suited than the other for this task?

**Hint:** how well do your two models do over the years? Is it possible to make use of the estimated yearly changes in proportions of the three demographic groups to improve the predictive accuracy of each models over the years?

To help you visually interpret and assess the quality of your classifiers, we are providing you a function to visualize a set of data along with the decision boundaries of a classifier.

In [16]:

```
#-------- Scatter Plot
# A function that visualizes the data
# Input:
# x (variable on the x-axis)
# y (variable on the y_axis)
# df (panda data frame)
# group (grouping variable, i.e., variable to plot)
# col (vector of colors)
# alpha (alpha value for colors)
# size (size of the dots)
# title (title of the plot)
# marker (shape of the marker, default to 'o')
# add_line (add a line plot)
# linewidth (thinkness of the added line)
# figsize (size of the figure)
def scatter_plot(x, y, df, group, col, alpha, size, title, marker='o', add_line=False, linewidth=1, figsize=(15, 5)):
# Define plot
fig = plt.figure(figsize = figsize)
ax = plt.axes()
# Extract unique categories
cat = df[group].unique()
# Loop trough categories
for i in range(0, len(cat)) :
ax.scatter(df[x][df[group] == cat[i]].values,
df[y][df[group] == cat[i]].values,
c=col[i], alpha=alpha, edgecolors="None", s=size,
label=cat[i],
marker=marker)
# Add line plot
if add_line:
for i in range(0, len(cat)) :
ax.plot(df[x][df[group] == cat[i]].values,
df[y][df[group] == cat[i]].values,
c=col[i], alpha=alpha, linewidth=linewidth)
# Add legend, grid etc.
ax.legend(loc=0, scatterpoints = 1) # Legend with just one dot
ax.set_xlabel(x); ax.set_ylabel(y)
ax.set_title(title)
plt.grid()
```

In [17]:

```
#-------- plot_decision_boundary
# A function that visualizes the data and the decision boundaries
# Input:
# x (predictors)
# y (labels)
# poly_flag (a boolean parameter, fits quadratic model if true, otherwise linear)
# title (title for plot)
# ax (a set of axes to plot on)
# Returns:
# ax (axes with data and decision boundaries)
def plot_decision_boundary(x, y, model, poly_flag, title, ax):
# Plot data
ax.scatter(x[y == 0, 0], x[y == 0, 1], c='darkred', alpha = 0.4, edgecolors="None", s=20)
ax.scatter(x[y == 1, 0], x[y == 1, 1], c='black', alpha = 0.4, edgecolors="None", s=20)
ax.scatter(x[y == 2, 0], x[y == 2, 1], c='darkblue', alpha = 0.4, edgecolors="None", s=20)
# Create mesh
interval = np.arange(0,1,0.01)
n = np.size(interval)
x1, x2 = np.meshgrid(interval, interval)
x1 = x1.reshape(-1, 1)
x2 = x2.reshape(-1, 1)
xx = np.concatenate((x1, x2), axis=1)
# Predict on mesh points
if(poly_flag):
quad_features = preprocessing.PolynomialFeatures(degree=2)
xx = quad_features.fit_transform(xx)
yy = model.predict(xx)
yy = yy.reshape((n, n))
# Plot decision surface
x1 = x1.reshape(n, n)
x2 = x2.reshape(n, n)
ax.contourf(x1, x2, yy, alpha=0.1, colors=["darkred", "black", "darkblue", "black", "white"])
# Label axes, set title
ax.set_title(title)
ax.set_xlabel('Latitude')
ax.set_ylabel('Longitude')
ax.grid()
return ax
```

In [18]:

```
## Load the data
df_demo = None
years = range(2000, 2011)
# Loop trough all the different years
for i in range(0, len(years)):
path = 'datasets/dataset_1_year_' + str(years[i]) + '.txt'
csv = pd.read_csv(path, delimiter=' ', header=None, names=['longitude', 'latitude', 'econ_id'])
csv['year'] = years[i]
# The first iteration is special
if df_demo is None:
df_demo = csv
else:
df_demo = df_demo.append(csv)
print "first 5 entries of the overall dataframe:"
df_demo[0:5]
```

Out[18]:

The next step is to add the label variable of the economic status to the id variable:

In [19]:

```
# Create income data frame with the corpesonding labels
df_econ = pd.DataFrame({'econ_id': [0, 1, 2],
'economic_status': ['low-income', 'middle-class', 'high-income']})
print "created helper table of the economic status:"
df_econ
```

Out[19]:

In [20]:

```
df_econ
```

Out[20]:

In [21]:

```
# Left join
df_demo = pd.merge(df_demo, df_econ, how='left', on=['econ_id', 'econ_id'])
# Show the data
print "first 5 entries of the new dataframe with the coresponding labels:"
df_demo[0:4]
```

Out[21]:

The next step is exploring the data at hand.

In [22]:

```
# Select the year 2000 and 2010
df_demo_2000 = df_demo[df_demo['year'] == 2000]
df_demo_2010 = df_demo[df_demo['year'] == 2010]
# Print the number of rows
print "The demographic data of Chicago in the year 2000 consists of", df_demo_2000.shape[0], "observations."
```

In [23]:

```
# Plot the data
col = ["black", "darkred", "darkblue"]
scatter_plot(x='longitude', y='latitude', df=df_demo_2000, group='economic_status',
col=col, alpha=0.6, size=40, marker='o',
title="Plot I: Economic status in Chicago in the year 2000")
```

In [24]:

```
# Aggregate
df_econ_years = pd.DataFrame(df_demo.groupby(['year', 'economic_status'], as_index=False)['longitude'].count())
df_econ_years.columns = ["year", "economic_status", "count"]
print "Aggregated economic status over the years:"
df_econ_years[0:4]
```

Out[24]:

In [25]:

```
# Plot the data over the years
scatter_plot(x='year', y='count', df=df_econ_years, group='economic_status',
col=["black", "darkblue", "darkred"], alpha=0.6, size=80,
title="Plot II: Economic status in Chicago over the years", add_line=True, linewidth=4)
```

In order to analyse the data three models are beeing fitted to the data. Those are:

- Polynomial Logistic Regression (with L2 regularization)
- Linear Discriminant Analysis (LDA)
- Adjusted Linear Discriminant Analysis (LDA) with an estimation of the yearly changes

For model 1 and 2 the assumtin is, that there are the same numbers of houshold for each class and that the parameter stay stable over the years. This ceteris paribus assumtion is going to be relaxed later on. Nevertheless, the first model is beeing build on the year 2000. In order to be able to check for overfitting this 2000 dataset split into a training and a testing set.

In [26]:

```
# Extracting the y and x variables
y = df_demo_2000['econ_id'].values
x = df_demo_2000[['longitude', 'latitude']].values
# Create testing and training set (70% Train, 30% Test)
np.random.seed(123) # Set random seed
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
# Creat polynomial effects
poly = PolynomialFeatures(degree=2)
x_poly_train = poly.fit_transform(x_train)
x_poly_test = poly.fit_transform(x_test)
```

In [27]:

```
# Create logistic regression object
# c = 1/lamda --> Tuning parameter for regularization --> big c --> no regularization
np.random.seed(123) # Set random seed
logitm_cv = LogRegCV(Cs=200, cv=10, penalty="l2")
logitm_cv.fit(x_poly_train, y_train)
```

Out[27]:

In [28]:

```
# Calculate scores
cv_scores = logitm_cv.scores_[1.0]
mean_score = np.mean(cv_scores, axis=0)
```

In [29]:

```
# Plotting
plt.figure(figsize=(15, 5))
plt.semilogx(logitm_cv.Cs_, mean_score, color='darkblue', linewidth=3)
plt.title('Plot III: CV Parameters')
plt.xlabel('Regularisation parameter'); plt.ylabel('Mean score')
plt.grid()
```

In [30]:

```
# Find the optimal c-value (regularisation parameter)
print 'Optimal c-value: ', logitm_cv.C_[0]
print 'Mean score: ', max(mean_score)
```

In [31]:

```
# Build logistic model with the above c-value
logitm_model = LogReg(C = logitm_cv.C_[0], penalty = "l2")
logitm_model.fit(x_poly_train, y_train)
# Create LDA model
lda_model = LDA()
lda_model.fit(x_train, y_train)
# Create QDA model
qda_model = QDA()
qda_model.fit(x_train, y_train)
```

Out[31]:

In [32]:

```
# Predicting on the test set
y_pred_log = logitm_model.predict(x_poly_test)
y_pred_lda = lda_model.predict(x_test)
y_pred_qda = qda_model.predict(x_test)
```

In [33]:

```
print "Confusion matrix (logistic model):"
pd.crosstab(y_test, y_pred_log, rownames=['True'], colnames=['Predicted'], margins=True)
```

Out[33]:

In [34]:

```
print "Confusion matrix (lda):"
pd.crosstab(y_test, y_pred_lda, rownames=['True'], colnames=['Predicted'], margins=True)
```

Out[34]:

In [35]:

```
print "Confusion matrix (qda):"
pd.crosstab(y_test, y_pred_qda, rownames=['True'], colnames=['Predicted'], margins=True)
```

Out[35]:

In [36]:

```
# Scoring
log_model_score = logitm_model.score(x_poly_test, y_test)
lda_model_score = lda_model.score(x_test, y_test)
qda_model_score = qda_model.score(x_test, y_test)
```

In [37]:

```
# Plot parameters
plt.figure(figsize=(10, 10))
fig, ((ax1, ax2, ax3)) = plt.subplots(1, 3, figsize=(15, 5))
# Ploting the decision boundaries
plot_decision_boundary(x_test, y_test, logitm_model, True, 'Plot IV - Logistic score:' + str(round(log_model_score, 3)), ax1)
plot_decision_boundary(x_test, y_test, lda_model, False, 'Plot V - LDA score:' + str(round(lda_model_score, 3)), ax2)
plot_decision_boundary(x_test, y_test, qda_model, False, 'Plot VI - QDA score:' + str(round(qda_model_score, 3)), ax3)
plt.tight_layout()
print "Chicago: Year 2000"
```

In [38]:

```
# Extracting the y and x variables
y = df_demo_2010['econ_id'].values
x = df_demo_2010[['longitude', 'latitude']].values
# Create polynomial effects
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(x)
```

In [39]:

```
# Predicting on the test set
y_pred_log = logitm_model.predict(x_poly)
y_pred_lda = lda_model.predict(x)
y_pred_qda = qda_model.predict(x)
```

In [40]:

```
# Scoring
log_model_score = logitm_model.score(x_poly, y)
lda_model_score = lda_model.score(x, y)
qda_model_score = qda_model.score(x, y)
```

In [41]:

```
# Plot parameters
plt.figure(figsize=(10, 10))
fig, ((ax1, ax2, ax3)) = plt.subplots(1, 3, figsize=(15, 5))
# Ploting the decision boundaries
print "Chicago: Year 2010"
plot_decision_boundary(x, y, logitm_model, True, 'Plot VII - Logistic score:' + str(round(log_model_score, 3)), ax1)
plot_decision_boundary(x, y, lda_model, False, 'Plot VIII - LDA score:' + str(round(lda_model_score, 3)), ax2)
plot_decision_boundary(x, y, qda_model, False, 'Plot IX - QDA score:' + str(round(qda_model_score, 3)), ax3)
plt.tight_layout()
```

In [42]:

```
# Loop trough the model scores over all the years
score = []
for year in range(2000, 2011):
df = df_demo[df_demo['year'] == year]
# Extracting the y and x variables
y = df['econ_id'].values
x = df[['longitude', 'latitude']].values
# Create polynomial effects
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(x)
# Scoring
log_model_score = logitm_model.score(x_poly, y)
lda_model_score = lda_model.score(x, y)
qda_model_score = qda_model.score(x, y)
score.append([year, log_model_score, lda_model_score, qda_model_score])
# Create np array
score = np.array(score)
```

In [43]:

```
# Plot
plt.figure(figsize = (18, 8))
width = 0.25
years = np.arange(2000, 2011)
plt.bar(score[:,0], score[:, 1], width, color='darkblue', label='Logistic regression', edgecolor='white', alpha = 0.8)
plt.bar(score[:,0] + width, score[:, 2], width, color='black', label='LDA', edgecolor='white', alpha = 0.8)
plt.bar(score[:,0] + 2 * width, score[:, 3], width, color='darkred', label='QDA', edgecolor='white', alpha = 0.8)
plt.legend(loc='best')
plt.xlabel('Year')
plt.ylabel('Prediction accuracy')
plt.xticks(years + 1.5*width , years);
plt.grid()
plt.title('Plot X: Predicting all the years')
plt.show()
```

In [44]:

```
# Starting even distribution:
start = np.array([1/3., 1/3., 1/3.])
multi = np.array([1, 1, 1.25])
sampling_probability = [start * multi**i / np.sum(start * multi**i) for i in range(0, 11)]
```

In [45]:

```
# Loop trough the model scores over all the years
score = []
lda_model_prop = lda_model
qda_model_prop = qda_model
for year in range(2000, 2011):
df = df_demo[df_demo['year'] == year]
# Extracting the y and x variables
y = df['econ_id'].values
x = df[['longitude', 'latitude']].values
# Create polynomial effects
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(x)
# Update the coefficients
sp = np.array(sampling_probability[year - 2000]).reshape(-1, 1)
lda_model_prop.coef_ = lda_model_prop.coef_ + np.log(sp)
qda_model_prop.priors_ = sampling_probability[year - 2000]
# Scoring
log_model_score = logitm_model.score(x_poly, y)
lda_model_score = lda_model.score(x, y)
lda_model_prop_score = lda_model_prop.score(x, y)
qda_model_score = qda_model.score(x, y)
qda_model_prop_score = qda_model_prop.score(x, y)
score.append([year, log_model_score, lda_model_score, lda_model_prop_score, qda_model_score, qda_model_prop_score])
# Create np array
score = np.array(score)
```

In [46]:

```
# Plot
plt.figure(figsize = (18, 8))
width = 0.25
years = np.arange(2000, 2011)
plt.bar(score[:,0], score[:, 1], width, color='darkblue', label='Logistic regression', edgecolor='white', alpha = 0.8)
plt.bar(score[:,0] + width, score[:, 3], width, color='black', label='LDA (with priors)', edgecolor='white', alpha = 0.8)
plt.bar(score[:,0] + 2 * width, score[:, 5], width, color='darkred', label='QDA (with priors)', edgecolor='white', alpha = 0.8)
plt.legend(loc='upper left')
plt.xlabel('Year')
plt.ylabel('Prediction accuracy')
plt.xticks(years + 1.5 * width , years);
plt.grid()
plt.title('Plot X: Predicting all the years')
plt.show()
```