Homework 4: SVMs & Return of the Bayes
Advanced Topics in Data Science II
Harvard University, Spring 2017
In this problem, the task is to classify a celestial object into one of 4 categories using photometric measurements recorded about the object. The training and testing datasets are provided in the dataset_1_train.txt
and dataset_1_test.txt
respectively. Overall, there are a total of 1,379 celestial objects described by 61 attributes. The last column contains the object category we wish to predict, Class
.
In the following code chunk all the necessary setup for the modelling environment is done.
## Options
options(scipen = 10) # Disable scientific notation
update_package <- FALSE # Use old status of packages
## Init files (always execute, eta: 10s)
source("scripts/01_init.R") # Helper functions to load packages
source("scripts/02_packages.R") # Load all necessary packages
## Set Java path to C:/Program Files (x86)/Java/jre7
source("scripts/03_functions.R") # Load project specific functions
## Read data
df_train <- data.frame(read_csv("data/dataset_1_train.txt"))
df_test <- data.frame(read_csv("data/dataset_1_test.txt"))
# Transform y variable to factor
df_train$Class <- as.factor(df_train$Class)
df_test$Class <- as.factor(df_test$Class)
Fit an RBF kernel to the training set with parameters gamma
and cost
both set to 1. Use the model to predict on the test set.
fit_svm <- svm(Class ~ ., data=df_train, gamma=1, cost=1, kernel="radial")
pred_svm_train <- predict(fit_svm, df_train)
pred_svm_test <- predict(fit_svm, df_test)
Look at the confusion matricies for both the training and testing predictions from the above model. What do you notice about the predictions from this model?
conf_train <- confusionMatrix(pred_svm_train, df_train$Class)
pander(conf_train$table)
1 | 2 | 3 | 4 |
---|---|---|---|
44 | 0 | 0 | 0 |
0 | 72 | 0 | 0 |
0 | 0 | 492 | 0 |
0 | 0 | 0 | 81 |
pander(conf_train$overall)
Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull |
---|---|---|---|---|
1 | 1 | 0.9947 | 1 | 0.7141 |
AccuracyPValue | McnemarPValue |
---|---|
1.702e-101 | NA |
Looking at the confusion matrix on the training set shows, that the accuracy is at 100%. That means that the model classifies all 4 classes 100% correctly. This is an indication that there is overfitting happening. In order to check this lets look at the test data.
conf_test <- confusionMatrix(pred_svm_test, df_test$Class)
pander(conf_test$table)
1 | 2 | 3 | 4 |
---|---|---|---|
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
46 | 73 | 499 | 72 |
0 | 0 | 0 | 0 |
pander(conf_test$overall)
Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull |
---|---|---|---|---|
0.7232 | 0 | 0.6882 | 0.7563 | 0.7232 |
AccuracyPValue | McnemarPValue |
---|---|
0.5195 | NA |
rm(fit_svm, pred_svm_test, pred_svm_train)
The above table shows that the accuracy did drop heavealy from 100% to 72.3% which supports the above conclusion of an overfit. Furthermore, all the classes are classified as group 3, i.e., the majority group. This means that the model act as a ‘naive’ majority group classifier.
For the RBF kernel, make a figure showing the effect of the kernel parameter \(\gamma\) on the training and test errors? Consider some values of gamma
between 0.001 and 0.3. Explain what you are seeing.
# Set values
gamma_values <- seq(0.001, 0.3, length.out=100)
test_errors <- rep(0., length(gamma_values))
train_errors <- rep(0., length(gamma_values))
# Loop through each gamma value
for (i in 1:length(gamma_values)) {
fit <- svm(Class ~ ., gamma=gamma_values[i], cost=1, data=df_train, kernel='radial')
pred_train <- predict(fit, newdata=df_train)
pred_test <- predict(fit, newdata=df_test)
train_errors[i] <- classError(pred_train, df_train$Class)$errorRate
test_errors[i] <- classError(pred_test, df_test$Class)$errorRate
}
# Wrap in a dataframe
df_err <- data.frame(gamma_values, test_errors, train_errors)
rm(fit, gamma_values, i, pred_test, pred_train, test_errors, train_errors, conf_test, conf_train)
# Calculate best gammma
best_gamma <- df_err$gamma_values[which(df_err$test_errors == min(df_err$test_errors))]
# Print output
cat('Gamma with lowest test error:', best_gamma)
## Gamma with lowest test error: 0.007040404
cat('Test error:', min(df_err$test_errors))
## Test error: 0.02898551
# Tidy data
df_err <- gather(df_err, key=gamma_values)
names(df_err) <- c("gamma_values", "data", "error")
# Plot
ggplot(data=df_err, aes(x=gamma_values, y=error, color=data)) +
geom_point(size=0.9) +
geom_line() +
scale_colour_manual(values=c("darkred","black")) +
ylab("Error rate") +
xlab("Gamma value") +
ggtitle("Plot I: SVM train and test error rate by gamma value") +
theme_bw()
The above plot shows that intially, when the \(gamma\) value is increased, the error rate on the training as well as the test data decreases. However, after a threshold value where \(gamma\) equals 0.007 the test error starts to increase up to a plateau of arround 30% while the training error continuous to decend until it reaches 0%. That means that after the threshold \(gamma\) the svm starts overfitting the data.
Technically speaking, Gamma is the free parameter of the Gaussian radial basis function (RBF).
\[K(x_i, x_j) = exp(-gamma ||x_i-x_j||^2, gamma >0\]
A small \(gamma\) means a Gaussian with a large variance so the influence of \(x_j\) is more, i.e. if \(x_j\) is a support vector, a small gamma implies the class of this support vector will have influence on deciding the class of the vector \(x_i\) even if the distance between them is large. If gamma is large, then variance is small implying the support vector does not have wide-spread influence, i.e., a large \(gamma\) leads to high bias and low variance models, and vice-versa.
On the available data, with costs set to 1, a small \(gamma\) appears to perform better on out-of-sample data.
For the RBF kernel, make a figure showing the effect of the cost
parameter on the training and test errors? Consider some values of cost
in the range of 0.1 to 20. Explain what you are seeing.
# Initiate list with costs and lists to store error rates
cost_values <- seq(0.1, 20, length.out=50)
test_errors <- rep(0., length(cost_values))
train_errors <- rep(0., length(cost_values))
# Loop through each cost value, fitting an svm and calculating training and test error artes
for (i in 1:length(cost_values)) {
fit <- svm(Class ~ ., gamma=best_gamma, cost=cost_values[i], data=df_train, kernel='radial')
pred_train <- predict(fit, newdata=df_train)
pred_test <- predict(fit, newdata=df_test)
train_errors[i] <- classError(pred_train, df_train$Class)$errorRate
test_errors[i] <- classError(pred_test, df_test$Class)$errorRate
}
df_err <- data.frame(cost_values, test_errors, train_errors)
rm(fit, cost_values, i, pred_test, pred_train, test_errors, train_errors)
# Calculate best gammma
best_cost <- df_err$cost_values[which(df_err$test_errors == min(df_err$test_errors))]
# Print output
cat('cost with lowest test error:', best_cost)
## cost with lowest test error: 2.130612
cat('Test error:', min(df_err$test_errors))
## Test error: 0.02028986
# Tidy data
df_err <- gather(df_err, key=cost_values)
names(df_err) <- c("cost_values", "data", "error")
# Plot
ggplot(data=df_err, aes(x=cost_values, y=error, color=data)) +
geom_point(size=0.9) +
geom_line() +
scale_colour_manual(values=c("darkred","black")) +
ylab("Error rate") +
xlab("Cost value") +
ggtitle("Plot II: SVM train and test error rate by cost value") +
theme_bw()
Similar to plot I, the error rate decreases in both the training as well as the test data. At the threshold level of 2.13 the test error plateaus and starts increasing at around 8. The training error continuous decreassing up to zero.
In general, a standard SVM seeks to find a margin that separates all positive and negative examples. As this can lead to poorly fit models soft margins are used which allows some examples to be ignored or placed on the wrong side of the margin. C is the parameter for the soft margin cost function, which controls the influence of each individual support vector, i.e., trading trading error penalty for model stability.
Fit SVM models with the linear, polynomial (degree 2) and RBF kernels to the training set, and report the misclassification error on the test set for each model. Do not forget to tune all relevant parameters using 5-fold cross-validation on the training set (tuning may take a while!).
The following parameter values are informed from the allready performed optimization in 1-4. The gamma vector is only for the rbf kernel.
# Create a cost vector
cost_vect <- seq(0.001, 5, length.out=20)
# Create a gamma vector (for RBF only)
gamma_vect <- seq(0.0001, 0.2, length.out=20)
rm(best_cost, best_gamma, df_err)
# 5-fold crossvalidation
set.seed(123)
fit_lin_cv <- tune(svm,
Class ~ .,
data=df_train,
kernel="linear",
tunecontrol=tune.control(sampling="cross", cross=5),
ranges=list(cost=cost_vect))
# Build model
fit_lin <- svm(Class ~ .,
data=df_train,
kernel="linear",
cost=fit_lin_cv$best.parameters$cost)
# Prediction
pred_lin_test <- predict(fit_lin, df_test)
# 5-fold crossvalidation to tune the cost
set.seed(123)
fit_poly_cv <- tune(svm,
Class ~ .,
data=df_train,
kernel="polynomial",
degree=2,
tunecontrol=tune.control(sampling="cross", cross=5),
ranges=list(cost=cost_vect))
# Build model
fit_poly <- svm(Class ~ .,
data=df_train,
kernel="polynomial",
cost=fit_poly_cv$best.parameters$cost)
# Prediction
pred_poly_test <- predict(fit_poly, df_test)
# 5-fold crossvalidation to tune the cost
set.seed(123)
fit_radial_cv <- tune(svm,
Class ~ .,
data=df_train,
kernel="radial",
tunecontrol=tune.control(sampling="cross", cross=5),
ranges=list(cost=cost_vect,
gamma=gamma_vect))
# Build model
fit_radial <- svm(Class ~ .,
data=df_train,
kernel="radial",
cost=fit_radial_cv$best.parameters$cost,
gamma=fit_radial_cv$best.parameters$gamma)
# Prediction
pred_radial_test <- predict(fit_radial, df_test)
# Radial kernel, pick cost with optimal gamma
df_rad_cv <- fit_radial_cv$performances
df_opt <- df_rad_cv[1, 1:3]
for(i in 2:length(cost_vect)){
df_rad_cv_in <- df_rad_cv[df_rad_cv$cost == cost_vect[i], 1:3]
df_opt[i, "cost"] <- df_rad_cv_in$cost[df_rad_cv_in$error == min(df_rad_cv_in$error)]
df_opt[i, "gamma"] <- df_rad_cv_in$gamma[df_rad_cv_in$error == min(df_rad_cv_in$error)]
df_opt[i, "error"] <- df_rad_cv_in$error[df_rad_cv_in$error == min(df_rad_cv_in$error)]
}
# Tidy dataframe
df_cv <- data.frame(cost=cost_vect,
linear_cv_error=fit_lin_cv$performances$error,
poly_cv_error=fit_poly_cv$performances$error,
radial_cv_error=df_opt$error)
df_cv <- gather(df_cv, key=cost)
names(df_cv) <- c("cost", "model", "cv_error")
rm(df_rad_cv, df_rad_cv_in, i)
ggplot(data=df_cv, aes(x=cost, y=cv_error, color=model)) +
geom_point(size=0.9) +
geom_line() +
ggtitle("Plot III: Crossvalidation - Cost vs. Error") +
scale_colour_manual(values=c("black","darkred","darkgrey")) +
ylab("Cross-validation error") +
xlab("Cost (for radial kernel with optimal gamma)") +
theme_bw()
The above plot shows the cross-validation error rate compared with the cost parameter. For the radial kernel the optimal \(gamma\) value is used.
What is the best model in terms of testing accuracy? How does your final model compare with a naive classifier that predicts the most common class (3) on all points?
# Confustion matrix (testing)
conf_lin <- confusionMatrix(pred_lin_test, df_test$Class)
conf_poly <- confusionMatrix(pred_poly_test, df_test$Class)
conf_radial <- confusionMatrix(pred_radial_test, df_test$Class)
# Output table
pander(data.frame("Linear_accuracy"=conf_lin$overall[1],
"Poly_accuracy"=conf_poly$overall[1],
"Radial_accuracy"=conf_radial$overall[1]))
 | Linear_accuracy | Poly_accuracy | Radial_accuracy |
---|---|---|---|
Accuracy | 0.9739 | 0.9275 | 0.9768 |
I looks like the linear as well as radial model achieve very high accuracy rates on the test data. The radial model appears to beat the linear model by a few decimal points. However, taking the tuning time for this relativly small dataset into account, the linear kernel would be the prefered choice for modelling on bigger data.
Building a majority class naive classifier. The result is the same as the model build in part 2 above.
pred_base <- rep(3, nrow(df_test))
cat('Accuracy: ', 1- classError(pred_base, df_test$Class)$errorRate)
## Accuracy: 0.7231884
All SVM model perform better than the naive classifier (most common class model). The recommended model for this data set is a SVM with radial kernel, for larger data with similar characteristics a model with a linear kernel might be a good choice.
We’re going to continue working with the dataset introduced in Homework 3 about contraceptive usage by 1934 Bangladeshi women. The data are in dataset_2.txt
which is now a merge of the training and test data that appeared in Homework 2.
In order to focus on the benefits of Hierarchical Modeling we’re going to consider a model with only one covariate (and intercept term).
# Read data
df_bang <- data.frame(read_csv("data/dataset_2.txt"))
# Renaming the variables (replace . with _)
names(df_bang) <- gsub("[.]", "_", names(df_bang))
# Create factor variables
df_bang$district <- factor(df_bang$district)
# Split data
set.seed(987)
sample_id <- sample(1:nrow(df_bang), floor(nrow(df_bang) * 0.85))
df_train <- df_bang[row.names(df_bang) %in% sample_id, ]
df_test <- df_bang[!row.names(df_bang) %in% sample_id, ]
# Create factor variables
df_train$contraceptive_use <- factor(df_train$contraceptive_use, labels=c("No", "Yes"))
df_test$contraceptive_use <- factor(df_test$contraceptive_use, labels=c("No", "Yes"))
rm(sample_id, df_bang)
A single logistic regression for contraceptive_use
as a function of living.children
. Do not include district
information. You should use the glm
function to fit this model. Interpret the estimated model.
set.seed(987)
fit_pooled <- glm(contraceptive_use ~ living_children,
data=df_train,
family=binomial(link="logit"))
pander(summary(fit_pooled), add.significance.stars=TRUE)
 | Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|---|
living_children | 0.1954 | 0.04111 | 4.752 | 0.000002012 | * * * |
(Intercept) | -0.9417 | 0.122 | -7.721 | 0.00000000000001151 | * * * |
(Dispersion parameter for binomial family taken to be 1 )
Null deviance: | 2206 on 1642 degrees of freedom |
Residual deviance: | 2184 on 1641 degrees of freedom |
The above table shows, that in a simple model with only living children as the explanatory variable, the calculated association is highly significant.
pred_pooled_train <- ifelse(as.numeric(predict(fit_pooled,
df_train, type="response")) >= 0.5,
"Yes", "No")
pred_pooled_train <- factor(pred_pooled_train, levels=c("No", "Yes"))
pred_pooled_test <- ifelse(as.numeric(predict(fit_pooled,
df_test, type="response")) >= 0.5,
"Yes", "No")
pred_pooled_test <- factor(pred_pooled_test, levels=c("No", "Yes"))
conf_mat_test <- confusionMatrix(pred_pooled_test, df_test$contraceptive_use)
pander(conf_mat_test$table)
 | No | Yes |
---|---|---|
No | 183 | 108 |
Yes | 0 | 0 |
pander(conf_mat_test$overall)
Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull |
---|---|---|---|---|
0.6289 | 0 | 0.5706 | 0.6845 | 0.6289 |
AccuracyPValue | McnemarPValue |
---|---|
0.5263 | 7.339e-25 |
The above table shows that the overall accuracy is at 62.9%. However, the the model always predicts no as the response.
A model that instead fits a separate logistic regression for each district
. Use the glm
function to this model. Hint The separate logistic regression models can be fit using one application of glm
by having the model formula be contraceptive_use ~ -1 + living.children * as.factor(district)
. Explain why this model formula is accomplishing the task of fitting separate models per district. Examine the summary output of the fitted model. Briefly explain the reason for many of the NA
estimates of the coefficients.
The formula contraceptive_use ~ -1 + living.children * as.factor(district)
creates an interaction term between each district and living children. That means that each coefficient of the interaction terms represents an individual model for every individual district. The -1 term ensures that no intercept is beeing fitted.
set.seed(987)
fit_unpooled <- glm(contraceptive_use ~ -1 + living_children * district,
data=df_train,
family=binomial(link="logit"))
sum_unpooled <- summary(fit_unpooled)$coefficients
pander(sum_unpooled[sum_unpooled[, 4] <= 0.1, ])
 | Estimate | Std. Error | z value | Pr(>|z|) |
---|---|---|---|---|
district101 | -1.223 | 0.5974 | -2.047 | 0.04064 |
district108 | -1.959 | 0.9187 | -2.133 | 0.03294 |
district113 | -2.213 | 1.096 | -2.02 | 0.04341 |
district125 | -2.68 | 0.8105 | -3.306 | 0.0009461 |
district128 | -3.083 | 1.321 | -2.333 | 0.01965 |
district129 | -1.965 | 0.9756 | -2.014 | 0.04403 |
district130 | -2.332 | 0.7284 | -3.201 | 0.001369 |
district143 | -1.832 | 0.8043 | -2.278 | 0.02273 |
district144 | -2.853 | 1.709 | -1.67 | 0.09499 |
district145 | -2.239 | 1.101 | -2.035 | 0.04189 |
district152 | -1.123 | 0.6736 | -1.667 | 0.09556 |
district153 | -2.086 | 1.262 | -1.653 | 0.09843 |
district159 | -2.631 | 1.18 | -2.229 | 0.0258 |
living_children:district113 | 0.7147 | 0.4198 | 1.702 | 0.08866 |
living_children:district121 | -1.176 | 0.5958 | -1.973 | 0.04848 |
living_children:district125 | 0.7863 | 0.3188 | 2.467 | 0.01364 |
living_children:district130 | 0.9027 | 0.3402 | 2.654 | 0.007963 |
living_children:district143 | 0.7896 | 0.3825 | 2.064 | 0.03899 |
The above table shows, that for the unpooled model only a small portion of the coefficients for the different districts are statistically significant. Nevertheless, the model indicates, that the variable living children has a different impacts depending on the district.
pred_unpooled_train <- ifelse(as.numeric(predict(fit_unpooled,
df_train, type="response")) >= 0.5,
"Yes", "No")
pred_unpooled_train <- factor(pred_unpooled_train, levels=c("No", "Yes"))
pred_unpooled_test <- ifelse(as.numeric(predict(fit_unpooled,
df_test, type="response")) >= 0.5,
"Yes", "No")
pred_unpooled_test <- factor(pred_unpooled_test, levels=c("No", "Yes"))
conf_mat_test <- confusionMatrix(pred_unpooled_test, df_test$contraceptive_use)
pander(conf_mat_test$table)
 | No | Yes |
---|---|---|
No | 133 | 60 |
Yes | 50 | 48 |
pander(conf_mat_test$overall)
Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull |
---|---|---|---|---|
0.622 | 0.1745 | 0.5635 | 0.6779 | 0.6289 |
AccuracyPValue | McnemarPValue |
---|---|
0.621 | 0.3908 |
The above table shows that the overall accuracy for the unpooled model is at 62.2%. That means that the model performs similarly well to the pooled model. However, the model performce better looking at the prediction were it no longer always predicts No as the response.
A Bayesian hierarchical logistic regression model with district
as the grouping variable. Make sure that both coefficients of the linear predictor are assumed to vary by district
in the model specification. Describe briefly in words how the results of this model are different from the pooled and unpooled models of parts (a) and (b).
# Prepare
df_train2 <- df_train; df_test2 <- df_test
df_train2$contraceptive_use <- as.numeric(df_train2$contraceptive_use) - 1
df_test2$contraceptive_use <- as.numeric(df_test2$contraceptive_use) - 1
# Model
set.seed(987)
fit_hierarchical <- MCMChlogit(fixed=contraceptive_use ~ living_children,
random= ~living_children, group="district",
data=df_train2, r=2, R=diag(c(1, 0.1)), burnin=5000,
mcmc=10000, thin=1, verbose=1, beta.start=NA,
sigma2.start=NA, Vb.start=NA, FixOD=1, mubeta=c(0,0),
Vbeta=10000, nu=0.001, delta=0.001)
##
## Running the Gibbs sampler. It may be long, keep cool :)
##
## **********:10.0%, mean accept. rate=0.447
## **********:20.0%, mean accept. rate=0.444
## **********:30.0%, mean accept. rate=0.443
## **********:40.0%, mean accept. rate=0.442
## **********:50.0%, mean accept. rate=0.444
## **********:60.0%, mean accept. rate=0.443
## **********:70.0%, mean accept. rate=0.448
## **********:80.0%, mean accept. rate=0.444
## **********:90.0%, mean accept. rate=0.444
## **********:100.0%, mean accept. rate=0.445
In the glm case, (a) and (b), we did maximize the MLE likelihood function. That is we found the weights that maximize how likely the observed data is. In the case of the bayesian hierarchical logistic regression, the starting point is an initial belief about the distribution. Then the posterior is found that is the updated belief about the weights given evidence. This gives us a distribution over the weights. That means that the bayesian model returns the distributions of the intercept and living children for each district as well as the estimated probability for each seperate district model. In other words, contrary to the pooled and unpooled model, the hierarchical bayesian model does not provide a single value. In the above example there are 10’000 values drawn for each coefficient. In order to compare those values with the pooled and unpooled model other measure have to be used.
In class we discussed that one of the benefits of using Bayesian hierarchical models is that it naturally shares information across the groupings. In this case, information is shared across districts. This is generally known as shrinkage. To explore the degree of shrinkage, we are going to compare coefficients across models and districts based on your results from part 1 above.
Create a single figure that shows the estimated coefficient to living.children
as a function of district in each of the three models above. The horizontal axis should be the districts, and the vertical axis should be the estimated coefficient value (generally three estimated coefficients at each district corresponding to the three models). Make sure that the points plotted for each model are distinct (different colors and/or plotting characters), and that you create a legend identifying the model-specific points on the figure. You may want to consider adjusting the vertical axis if some estimated coefficients are so large (positively or negatively) that they obscure the general pattern of the bulk of points. Be sure to explain your decision.
# Districts
districts <- unique(df_train$district)
n_districts <- length(districts)
# Pooled coefficients
coeff_pooled <- rep(fit_pooled$coefficients[2], n_districts)
# Unpooled coeffficents
coeff_unpooled_all <- fit_unpooled$coefficient
n_coeff_unpooled <- length(coeff_unpooled_all)
coeff_unpooled <- coeff_unpooled_all[(n_coeff_unpooled -
n_districts + 2):n_coeff_unpooled]
coeff_unpooled <- c(0, coeff_unpooled) + coeff_unpooled_all[1]
# Bayesian coefficients
coeff_hierarchical_all <- summary(fit_hierarchical$mcmc)$statistics[, 1]
n_coeff_hierarchical <- length(coeff_hierarchical_all)
coeff_hierarchical <- coeff_hierarchical_all[(n_coeff_hierarchical -
5 - n_districts):(n_coeff_hierarchical
- 6)]
# Dataframe
df_coeff <- data.frame(districts=districts,
observations=as.numeric(table(df_train$district)),
pooled=coeff_pooled,
unpooled=coeff_unpooled,
hierarchical=coeff_hierarchical)
rm(districts, coeff_pooled, coeff_unpooled, coeff_hierarchical, n_districts,
n_coeff_unpooled, n_coeff_hierarchical, sum_unpooled, coeff_hierarchical_all,
coeff_unpooled_all, conf_mat_test)
ggplot(df_coeff, aes(x=as.numeric(as.character(districts)))) +
geom_point(aes(y=unpooled, color="unpooled", size = observations)) +
geom_point(aes(y=hierarchical, color="hierarchical", size=observations)) +
geom_point(aes(y=pooled, color="pooled")) +
scale_colour_manual(values=c("black","darkred","darkgrey")) +
ylab("Coefficient value") +
xlab("District number") +
ggtitle("Plot IV: Estimated coefficient as a function of district") +
theme_bw() +
ylim(c(-1.5, 1.5))
The vertical axis is set between -1.5 and 1.5. The reason for this is that coefficients with larger values have no practical significance and disrupt the visibilty.
As can be seen in the plot above, the pooled model has by design no variabilty in the coefficient value for living children over the different districts. However, the hierachical bayes as well as the unpooled model vary in their coefficiant value (positive as well as negative) over the different districts.
The coefficiants from the hierarchical model appear to be more stable over the different districts compared to the coefficiants from the unpooled model. Furthermore, the number of observations in a district appear to have an effect on the variability of the unpooled coefficients. The smaller the number the further away from the baseline (pooled). This could indicate some overfitting due to the small sample per district size.
Overall, the hierarchical bayes appears to be a kind of a compromise between the two models, i.e., in the pooled model all districts have the same coefficient while in the unpooled model the coefficients vary heavily.
Another benefit of shrinkage is how it affects probability estimates (recall the lucky, drunk friend from lecture whose classical estimate for the probability of guessing correctly was 100%). Extract the estimated probabilities from each model applied to the training data. That is, for the pooled and unpooled analyses, use the predict
function applied to the fitted object, using the argument type="response"
. For the hierarchical model, the $theta.pred
component of the fitted model contains the estimated probabilities.
Plot histograms of the vectors of probability estimates for each model separately. Make sure you standardize the horizontal axis so that the scales are the same. How does the distribution of estimated probabilities compare across the three models?
# Predictions
pred_pooled <- predict(fit_pooled, df_train, type='response')
pred_unpooled <- predict(fit_unpooled, df_train, type='response')
# Prepare
n_obs <- nrow(df_train)
model_type <- c(rep('pooled', n_obs), rep('unpooled', n_obs), rep('hierarchical', n_obs))
# Probabilities
model_predictions <- c(scale(pred_pooled),
scale(pred_unpooled),
scale(fit_hierarchical$theta.pred))
# Dataframe
df_pred <- data.frame(prediction=model_predictions, model=model_type)
# create hisotgram of probabilities from each model
ggplot(df_pred, aes(x=prediction, fill=model)) +
geom_histogram(alpha=0.8, color="white", bins=40, position="identity") +
scale_fill_manual(values = c("black","darkred","darkgrey")) +
theme_bw() +
ylab("Count") +
xlab("Pr(Contraceptive usage)") +
ggtitle("Plot V: Predicted probabilities")
The above plot shows, that the pooled model only has 4 potential values at around 0, 0.3, 0.6 and 1. As was already observed in 2.1 (a) the model is not very good at distinguishing the different classes. The probabilities for the hierarchical bayes as well as the unpooled model are more evenly distributed with 3 visible clusters at 0.3, 0.5, and 0.6.
Create a scatter plot comparing predicted values from Unpooled and Hierarchical Models, making sure that the scale of the horizontal and vertical axes are the same, and that the plotting region is square rather than rectangular. Include on the plot the line \(y=x\) (why do you think this is a useful line to superimpose?). Briefly interpret the relationship between the probability estimates for these two models. Are there particular features of the plot that highlight the intended benefits of using a hierarchical model over the unpooled analysis? Briefly explain.
# Dataframe
df_mod <- data.frame(district=df_train$district,
unpooled=scale(pred_unpooled),
hierarchical=scale(fit_hierarchical$theta.pred))
df_mod <- merge(df_mod, data.frame(table(df_train$district)),
by.x="district", by.y="Var1")
# Plot
ggplot(df_mod, aes(x=hierarchical, y=unpooled)) +
geom_abline(intercept=0, slope=1, color="darkred", size=1) +
geom_point(size=df_mod$Freq/20) +
coord_fixed() +
theme_bw() +
ggtitle("Plot VI: Hierarchical vs. Unpooled Model")
The above plot confirms the already seen picture in the histogram, that the 2 probabilties are positivly correlated with each other. The 45-degree line helps distinguishing the relationship between the two variables. That means that it shows where the two models differ in there predictions and it shows if there relationship is linear. Furthermore, the above plot shows, that the hierarchical model is less confident for values where the sample size is small while the unpooled model classifies them very often as either one or zero. This supports the observation made in 2.2 (b). The reason for this is that the hierarchical model incorporats the prior of the pooled model, which helps prevent it from overfitting on district’s data (shrinkage). In other words, as the sample size decreases, the average is weighted more heavily toward the sample mean.