Problem 1: Celestial Object Classification

In this problem, the task is to classify a celestial object into one of 4 categories using photometric measurements recorded about the object. The training and testing datasets are provided in the dataset_1_train.txt and dataset_1_test.txt respectively. Overall, there are a total of 1,379 celestial objects described by 61 attributes. The last column contains the object category we wish to predict, Class.

Initialize

In the following code chunk all the necessary setup for the modelling environment is done.

## Options
options(scipen = 10)                          # Disable scientific notation
update_package <- FALSE                       # Use old status of packages

## Init files (always execute, eta: 10s)
source("scripts/01_init.R")                   # Helper functions to load packages
source("scripts/02_packages.R")               # Load all necessary packages
## Set Java path to C:/Program Files (x86)/Java/jre7
source("scripts/03_functions.R")              # Load project specific functions

Load the data

## Read data
df_train <- data.frame(read_csv("data/dataset_1_train.txt"))
df_test <- data.frame(read_csv("data/dataset_1_test.txt"))

Preprocess data

# Transform y variable to factor
df_train$Class <- as.factor(df_train$Class)
df_test$Class <- as.factor(df_test$Class)

1. RBF Kernel (fixed parameters)

Fit an RBF kernel to the training set with parameters gamma and cost both set to 1. Use the model to predict on the test set.

fit_svm <- svm(Class ~ ., data=df_train, gamma=1, cost=1, kernel="radial")

pred_svm_train <- predict(fit_svm, df_train)
pred_svm_test <- predict(fit_svm, df_test)

2. Confusion Matrix

Look at the confusion matricies for both the training and testing predictions from the above model. What do you notice about the predictions from this model?

Training data

conf_train <- confusionMatrix(pred_svm_train, df_train$Class)
pander(conf_train$table)
1 2 3 4
44 0 0 0
0 72 0 0
0 0 492 0
0 0 0 81
pander(conf_train$overall)
Table continues below
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
1 1 0.9947 1 0.7141
AccuracyPValue McnemarPValue
1.702e-101 NA

Looking at the confusion matrix on the training set shows, that the accuracy is at 100%. That means that the model classifies all 4 classes 100% correctly. This is an indication that there is overfitting happening. In order to check this lets look at the test data.

Test data

conf_test <- confusionMatrix(pred_svm_test, df_test$Class)
pander(conf_test$table)
1 2 3 4
0 0 0 0
0 0 0 0
46 73 499 72
0 0 0 0
pander(conf_test$overall)
Table continues below
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
0.7232 0 0.6882 0.7563 0.7232
AccuracyPValue McnemarPValue
0.5195 NA
rm(fit_svm, pred_svm_test, pred_svm_train)

The above table shows that the accuracy did drop heavealy from 100% to 72.3% which supports the above conclusion of an overfit. Furthermore, all the classes are classified as group 3, i.e., the majority group. This means that the model act as a ‘naive’ majority group classifier.

3. The \(gamma\) parameter for a RBF kernel

For the RBF kernel, make a figure showing the effect of the kernel parameter \(\gamma\) on the training and test errors? Consider some values of gamma between 0.001 and 0.3. Explain what you are seeing.

Gamma simulation

# Set values
gamma_values <- seq(0.001, 0.3, length.out=100)
test_errors <- rep(0., length(gamma_values))
train_errors <- rep(0., length(gamma_values))

# Loop through each gamma value
for (i in 1:length(gamma_values)) {
  fit <- svm(Class ~ ., gamma=gamma_values[i], cost=1, data=df_train, kernel='radial')
  pred_train <- predict(fit, newdata=df_train)
  pred_test <- predict(fit, newdata=df_test)
  train_errors[i] <- classError(pred_train, df_train$Class)$errorRate
  test_errors[i] <- classError(pred_test, df_test$Class)$errorRate
}

# Wrap in a dataframe
df_err <- data.frame(gamma_values, test_errors, train_errors)
rm(fit, gamma_values, i, pred_test, pred_train, test_errors, train_errors, conf_test, conf_train)

Output

# Calculate best gammma
best_gamma <- df_err$gamma_values[which(df_err$test_errors == min(df_err$test_errors))]

# Print output
cat('Gamma with lowest test error:', best_gamma)
## Gamma with lowest test error: 0.007040404
cat('Test error:', min(df_err$test_errors))
## Test error: 0.02898551

Visualization

# Tidy data 
df_err <- gather(df_err, key=gamma_values)
names(df_err) <- c("gamma_values", "data", "error")

# Plot
ggplot(data=df_err, aes(x=gamma_values, y=error, color=data)) +
  geom_point(size=0.9) + 
  geom_line() +
  scale_colour_manual(values=c("darkred","black")) +
  ylab("Error rate") +
  xlab("Gamma value") +
  ggtitle("Plot I: SVM train and test error rate by gamma value") +
  theme_bw()

The above plot shows that intially, when the \(gamma\) value is increased, the error rate on the training as well as the test data decreases. However, after a threshold value where \(gamma\) equals 0.007 the test error starts to increase up to a plateau of arround 30% while the training error continuous to decend until it reaches 0%. That means that after the threshold \(gamma\) the svm starts overfitting the data.

Technically speaking, Gamma is the free parameter of the Gaussian radial basis function (RBF).

\[K(x_i, x_j) = exp(-gamma ||x_i-x_j||^2, gamma >0\]

A small \(gamma\) means a Gaussian with a large variance so the influence of \(x_j\) is more, i.e. if \(x_j\) is a support vector, a small gamma implies the class of this support vector will have influence on deciding the class of the vector \(x_i\) even if the distance between them is large. If gamma is large, then variance is small implying the support vector does not have wide-spread influence, i.e., a large \(gamma\) leads to high bias and low variance models, and vice-versa.

On the available data, with costs set to 1, a small \(gamma\) appears to perform better on out-of-sample data.

4. The \(cost\) parameter

For the RBF kernel, make a figure showing the effect of the cost parameter on the training and test errors? Consider some values of cost in the range of 0.1 to 20. Explain what you are seeing.

Cost simulation

# Initiate list with costs and lists to store error rates
cost_values <- seq(0.1, 20, length.out=50)
test_errors <- rep(0., length(cost_values))
train_errors <- rep(0., length(cost_values))

# Loop through each cost value, fitting an svm and calculating training and test error artes
for (i in 1:length(cost_values)) {
  fit <- svm(Class ~ ., gamma=best_gamma, cost=cost_values[i], data=df_train, kernel='radial')
  pred_train <- predict(fit, newdata=df_train)
  pred_test <- predict(fit, newdata=df_test)
  train_errors[i] <- classError(pred_train, df_train$Class)$errorRate
  test_errors[i] <- classError(pred_test, df_test$Class)$errorRate
}

df_err <- data.frame(cost_values, test_errors, train_errors)
rm(fit, cost_values, i, pred_test, pred_train, test_errors, train_errors)

Output

# Calculate best gammma
best_cost <- df_err$cost_values[which(df_err$test_errors == min(df_err$test_errors))]

# Print output
cat('cost with lowest test error:', best_cost)
## cost with lowest test error: 2.130612
cat('Test error:', min(df_err$test_errors))
## Test error: 0.02028986

Visualization

# Tidy data 
df_err <- gather(df_err, key=cost_values)
names(df_err) <- c("cost_values", "data", "error")

# Plot
ggplot(data=df_err, aes(x=cost_values, y=error, color=data)) +
  geom_point(size=0.9) + 
  geom_line() +
  scale_colour_manual(values=c("darkred","black")) +
  ylab("Error rate") +
  xlab("Cost value") +
  ggtitle("Plot II: SVM train and test error rate by cost value") +
  theme_bw()

Similar to plot I, the error rate decreases in both the training as well as the test data. At the threshold level of 2.13 the test error plateaus and starts increasing at around 8. The training error continuous decreassing up to zero.

In general, a standard SVM seeks to find a margin that separates all positive and negative examples. As this can lead to poorly fit models soft margins are used which allows some examples to be ignored or placed on the wrong side of the margin. C is the parameter for the soft margin cost function, which controls the influence of each individual support vector, i.e., trading trading error penalty for model stability.

5. SVM Kernel tuning (linear, polynomial, RBF)

Fit SVM models with the linear, polynomial (degree 2) and RBF kernels to the training set, and report the misclassification error on the test set for each model. Do not forget to tune all relevant parameters using 5-fold cross-validation on the training set (tuning may take a while!).

Parameter values

The following parameter values are informed from the allready performed optimization in 1-4. The gamma vector is only for the rbf kernel.

# Create a cost vector
cost_vect <- seq(0.001, 5, length.out=20)

# Create a gamma vector (for RBF only)
gamma_vect <- seq(0.0001, 0.2, length.out=20)
rm(best_cost, best_gamma, df_err)

Linear Kernel

# 5-fold crossvalidation 
set.seed(123)
fit_lin_cv <- tune(svm,
                   Class ~ .,
                   data=df_train,
                   kernel="linear",
                   tunecontrol=tune.control(sampling="cross", cross=5),
                   ranges=list(cost=cost_vect)) 

# Build model
fit_lin <- svm(Class ~ ., 
               data=df_train, 
               kernel="linear",
               cost=fit_lin_cv$best.parameters$cost)

# Prediction
pred_lin_test <- predict(fit_lin, df_test)

2nd degree polynomial kernel

# 5-fold crossvalidation to tune the cost 
set.seed(123)
fit_poly_cv <- tune(svm,
                    Class ~ .,
                    data=df_train,
                    kernel="polynomial", 
                    degree=2,
                    tunecontrol=tune.control(sampling="cross", cross=5),
                    ranges=list(cost=cost_vect)) 

# Build model
fit_poly <- svm(Class ~ ., 
                data=df_train, 
                kernel="polynomial",
                cost=fit_poly_cv$best.parameters$cost)

# Prediction
pred_poly_test <- predict(fit_poly, df_test)

RBF kernel

# 5-fold crossvalidation to tune the cost 
set.seed(123)
fit_radial_cv <- tune(svm,
                      Class ~ .,
                      data=df_train,
                      kernel="radial", 
                      tunecontrol=tune.control(sampling="cross", cross=5),
                      ranges=list(cost=cost_vect,
                                  gamma=gamma_vect)) 

# Build model
fit_radial <- svm(Class ~ ., 
                  data=df_train, 
                  kernel="radial",
                  cost=fit_radial_cv$best.parameters$cost,
                  gamma=fit_radial_cv$best.parameters$gamma)

# Prediction
pred_radial_test <- predict(fit_radial, df_test)

Data preprocessing for visualization

# Radial kernel, pick cost with optimal gamma
df_rad_cv <- fit_radial_cv$performances
df_opt <- df_rad_cv[1, 1:3]
for(i in 2:length(cost_vect)){
  df_rad_cv_in <- df_rad_cv[df_rad_cv$cost == cost_vect[i], 1:3]
  df_opt[i, "cost"] <- df_rad_cv_in$cost[df_rad_cv_in$error == min(df_rad_cv_in$error)]
  df_opt[i, "gamma"] <- df_rad_cv_in$gamma[df_rad_cv_in$error == min(df_rad_cv_in$error)]
  df_opt[i, "error"] <- df_rad_cv_in$error[df_rad_cv_in$error == min(df_rad_cv_in$error)]
}

# Tidy dataframe
df_cv <- data.frame(cost=cost_vect,
                    linear_cv_error=fit_lin_cv$performances$error,
                    poly_cv_error=fit_poly_cv$performances$error,
                    radial_cv_error=df_opt$error)
df_cv <- gather(df_cv, key=cost)
names(df_cv) <- c("cost", "model", "cv_error")

rm(df_rad_cv, df_rad_cv_in, i)

Visualization

ggplot(data=df_cv, aes(x=cost, y=cv_error, color=model)) +
  geom_point(size=0.9) + 
  geom_line() +
  ggtitle("Plot III: Crossvalidation - Cost vs. Error") +
  scale_colour_manual(values=c("black","darkred","darkgrey")) +
  ylab("Cross-validation error") +
  xlab("Cost (for radial kernel with optimal gamma)") +
  theme_bw()

The above plot shows the cross-validation error rate compared with the cost parameter. For the radial kernel the optimal \(gamma\) value is used.

6. Model accuracy

What is the best model in terms of testing accuracy? How does your final model compare with a naive classifier that predicts the most common class (3) on all points?

Testing

# Confustion matrix (testing)
conf_lin <- confusionMatrix(pred_lin_test, df_test$Class)
conf_poly <- confusionMatrix(pred_poly_test, df_test$Class)
conf_radial <- confusionMatrix(pred_radial_test, df_test$Class)

# Output table
pander(data.frame("Linear_accuracy"=conf_lin$overall[1],
                  "Poly_accuracy"=conf_poly$overall[1],
                  "Radial_accuracy"=conf_radial$overall[1]))
  Linear_accuracy Poly_accuracy Radial_accuracy
Accuracy 0.9739 0.9275 0.9768

I looks like the linear as well as radial model achieve very high accuracy rates on the test data. The radial model appears to beat the linear model by a few decimal points. However, taking the tuning time for this relativly small dataset into account, the linear kernel would be the prefered choice for modelling on bigger data.

Naive classifier accuracy

Building a majority class naive classifier. The result is the same as the model build in part 2 above.

pred_base <- rep(3, nrow(df_test))
cat('Accuracy: ', 1- classError(pred_base, df_test$Class)$errorRate)
## Accuracy:  0.7231884

All SVM model perform better than the naive classifier (most common class model). The recommended model for this data set is a SVM with radial kernel, for larger data with similar characteristics a model with a linear kernel might be a good choice.

Problem 2: Return of the Bayesian Hierarchical Model

We’re going to continue working with the dataset introduced in Homework 3 about contraceptive usage by 1934 Bangladeshi women. The data are in dataset_2.txt which is now a merge of the training and test data that appeared in Homework 2.

In order to focus on the benefits of Hierarchical Modeling we’re going to consider a model with only one covariate (and intercept term).

Load the data

# Read data
df_bang <- data.frame(read_csv("data/dataset_2.txt"))

# Renaming the variables (replace . with _)
names(df_bang) <- gsub("[.]", "_", names(df_bang))

# Create factor variables
df_bang$district <- factor(df_bang$district)

# Split data
set.seed(987)
sample_id <- sample(1:nrow(df_bang), floor(nrow(df_bang) * 0.85)) 
df_train <- df_bang[row.names(df_bang) %in% sample_id, ]
df_test <- df_bang[!row.names(df_bang) %in% sample_id, ]

# Create factor variables
df_train$contraceptive_use <- factor(df_train$contraceptive_use, labels=c("No", "Yes"))
df_test$contraceptive_use <- factor(df_test$contraceptive_use, labels=c("No", "Yes"))

rm(sample_id, df_bang)

1. Fit three models

(a) Pooled Model

A single logistic regression for contraceptive_use as a function of living.children. Do not include district information. You should use the glm function to fit this model. Interpret the estimated model.

Model

set.seed(987)
fit_pooled <- glm(contraceptive_use ~ living_children,
                  data=df_train,
                  family=binomial(link="logit"))

Summary

pander(summary(fit_pooled), add.significance.stars=TRUE)
  Estimate Std. Error z value Pr(>|z|)
living_children 0.1954 0.04111 4.752 0.000002012 * * *
(Intercept) -0.9417 0.122 -7.721 0.00000000000001151 * * *

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 2206 on 1642 degrees of freedom
Residual deviance: 2184 on 1641 degrees of freedom

The above table shows, that in a simple model with only living children as the explanatory variable, the calculated association is highly significant.

Prediction

pred_pooled_train <- ifelse(as.numeric(predict(fit_pooled,
                                               df_train, type="response")) >= 0.5,
                            "Yes", "No")
pred_pooled_train <- factor(pred_pooled_train, levels=c("No", "Yes"))
pred_pooled_test <- ifelse(as.numeric(predict(fit_pooled,
                                              df_test, type="response")) >= 0.5,
                           "Yes", "No")
pred_pooled_test <- factor(pred_pooled_test, levels=c("No", "Yes"))

conf_mat_test <- confusionMatrix(pred_pooled_test, df_test$contraceptive_use)
pander(conf_mat_test$table)
  No Yes
No 183 108
Yes 0 0
pander(conf_mat_test$overall)
Table continues below
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
0.6289 0 0.5706 0.6845 0.6289
AccuracyPValue McnemarPValue
0.5263 7.339e-25

The above table shows that the overall accuracy is at 62.9%. However, the the model always predicts no as the response.

(b) Unpooled Model

A model that instead fits a separate logistic regression for each district. Use the glm function to this model. Hint The separate logistic regression models can be fit using one application of glm by having the model formula be contraceptive_use ~ -1 + living.children * as.factor(district). Explain why this model formula is accomplishing the task of fitting separate models per district. Examine the summary output of the fitted model. Briefly explain the reason for many of the NA estimates of the coefficients.

The formula contraceptive_use ~ -1 + living.children * as.factor(district) creates an interaction term between each district and living children. That means that each coefficient of the interaction terms represents an individual model for every individual district. The -1 term ensures that no intercept is beeing fitted.

Model

set.seed(987)
fit_unpooled <- glm(contraceptive_use ~ -1 + living_children * district,
                  data=df_train,
                  family=binomial(link="logit"))

Summary (Of significant values)

sum_unpooled <- summary(fit_unpooled)$coefficients
pander(sum_unpooled[sum_unpooled[, 4] <= 0.1, ])
  Estimate Std. Error z value Pr(>|z|)
district101 -1.223 0.5974 -2.047 0.04064
district108 -1.959 0.9187 -2.133 0.03294
district113 -2.213 1.096 -2.02 0.04341
district125 -2.68 0.8105 -3.306 0.0009461
district128 -3.083 1.321 -2.333 0.01965
district129 -1.965 0.9756 -2.014 0.04403
district130 -2.332 0.7284 -3.201 0.001369
district143 -1.832 0.8043 -2.278 0.02273
district144 -2.853 1.709 -1.67 0.09499
district145 -2.239 1.101 -2.035 0.04189
district152 -1.123 0.6736 -1.667 0.09556
district153 -2.086 1.262 -1.653 0.09843
district159 -2.631 1.18 -2.229 0.0258
living_children:district113 0.7147 0.4198 1.702 0.08866
living_children:district121 -1.176 0.5958 -1.973 0.04848
living_children:district125 0.7863 0.3188 2.467 0.01364
living_children:district130 0.9027 0.3402 2.654 0.007963
living_children:district143 0.7896 0.3825 2.064 0.03899

The above table shows, that for the unpooled model only a small portion of the coefficients for the different districts are statistically significant. Nevertheless, the model indicates, that the variable living children has a different impacts depending on the district.

Prediction

pred_unpooled_train <- ifelse(as.numeric(predict(fit_unpooled,
                                                 df_train, type="response")) >= 0.5,
                              "Yes", "No")
pred_unpooled_train <- factor(pred_unpooled_train, levels=c("No", "Yes"))
pred_unpooled_test <- ifelse(as.numeric(predict(fit_unpooled,
                                                df_test, type="response")) >= 0.5,
                             "Yes", "No")
pred_unpooled_test <- factor(pred_unpooled_test, levels=c("No", "Yes"))

conf_mat_test <- confusionMatrix(pred_unpooled_test, df_test$contraceptive_use)
pander(conf_mat_test$table)
  No Yes
No 133 60
Yes 50 48
pander(conf_mat_test$overall)
Table continues below
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
0.622 0.1745 0.5635 0.6779 0.6289
AccuracyPValue McnemarPValue
0.621 0.3908

The above table shows that the overall accuracy for the unpooled model is at 62.2%. That means that the model performs similarly well to the pooled model. However, the model performce better looking at the prediction were it no longer always predicts No as the response.

(c) Bayesian Hierarchical Logistic Model

A Bayesian hierarchical logistic regression model with district as the grouping variable. Make sure that both coefficients of the linear predictor are assumed to vary by district in the model specification. Describe briefly in words how the results of this model are different from the pooled and unpooled models of parts (a) and (b).

Model

# Prepare
df_train2 <- df_train; df_test2 <- df_test
df_train2$contraceptive_use <- as.numeric(df_train2$contraceptive_use) - 1
df_test2$contraceptive_use <- as.numeric(df_test2$contraceptive_use) - 1

# Model
set.seed(987)
fit_hierarchical <- MCMChlogit(fixed=contraceptive_use ~ living_children,
                               random= ~living_children, group="district",
                               data=df_train2, r=2, R=diag(c(1, 0.1)), burnin=5000,
                               mcmc=10000, thin=1, verbose=1, beta.start=NA,
                               sigma2.start=NA, Vb.start=NA, FixOD=1, mubeta=c(0,0),
                               Vbeta=10000, nu=0.001, delta=0.001)
## 
## Running the Gibbs sampler. It may be long, keep cool :)
## 
## **********:10.0%, mean accept. rate=0.447
## **********:20.0%, mean accept. rate=0.444
## **********:30.0%, mean accept. rate=0.443
## **********:40.0%, mean accept. rate=0.442
## **********:50.0%, mean accept. rate=0.444
## **********:60.0%, mean accept. rate=0.443
## **********:70.0%, mean accept. rate=0.448
## **********:80.0%, mean accept. rate=0.444
## **********:90.0%, mean accept. rate=0.444
## **********:100.0%, mean accept. rate=0.445

In the glm case, (a) and (b), we did maximize the MLE likelihood function. That is we found the weights that maximize how likely the observed data is. In the case of the bayesian hierarchical logistic regression, the starting point is an initial belief about the distribution. Then the posterior is found that is the updated belief about the weights given evidence. This gives us a distribution over the weights. That means that the bayesian model returns the distributions of the intercept and living children for each district as well as the estimated probability for each seperate district model. In other words, contrary to the pooled and unpooled model, the hierarchical bayesian model does not provide a single value. In the above example there are 10’000 values drawn for each coefficient. In order to compare those values with the pooled and unpooled model other measure have to be used.

2. Benefits of hierarchical models

In class we discussed that one of the benefits of using Bayesian hierarchical models is that it naturally shares information across the groupings. In this case, information is shared across districts. This is generally known as shrinkage. To explore the degree of shrinkage, we are going to compare coefficients across models and districts based on your results from part 1 above.

(a) Single figure for living children

Create a single figure that shows the estimated coefficient to living.children as a function of district in each of the three models above. The horizontal axis should be the districts, and the vertical axis should be the estimated coefficient value (generally three estimated coefficients at each district corresponding to the three models). Make sure that the points plotted for each model are distinct (different colors and/or plotting characters), and that you create a legend identifying the model-specific points on the figure. You may want to consider adjusting the vertical axis if some estimated coefficients are so large (positively or negatively) that they obscure the general pattern of the bulk of points. Be sure to explain your decision.

Coefficients

# Districts
districts <- unique(df_train$district)
n_districts <- length(districts)

# Pooled coefficients
coeff_pooled <- rep(fit_pooled$coefficients[2], n_districts)

# Unpooled coeffficents
coeff_unpooled_all <- fit_unpooled$coefficient
n_coeff_unpooled <- length(coeff_unpooled_all)
coeff_unpooled <- coeff_unpooled_all[(n_coeff_unpooled -
                                        n_districts + 2):n_coeff_unpooled]
coeff_unpooled <- c(0, coeff_unpooled) + coeff_unpooled_all[1]

# Bayesian coefficients
coeff_hierarchical_all <- summary(fit_hierarchical$mcmc)$statistics[, 1]
n_coeff_hierarchical <- length(coeff_hierarchical_all)
coeff_hierarchical <- coeff_hierarchical_all[(n_coeff_hierarchical -
                                                5 - n_districts):(n_coeff_hierarchical
                                                                  - 6)]

Create a dataframe

# Dataframe
df_coeff <- data.frame(districts=districts,
                       observations=as.numeric(table(df_train$district)),
                       pooled=coeff_pooled,
                       unpooled=coeff_unpooled,
                       hierarchical=coeff_hierarchical)
rm(districts, coeff_pooled, coeff_unpooled, coeff_hierarchical, n_districts,
   n_coeff_unpooled, n_coeff_hierarchical, sum_unpooled, coeff_hierarchical_all,
   coeff_unpooled_all, conf_mat_test)

Plot

ggplot(df_coeff, aes(x=as.numeric(as.character(districts)))) +
  geom_point(aes(y=unpooled, color="unpooled", size = observations)) + 
  geom_point(aes(y=hierarchical, color="hierarchical", size=observations)) +
  geom_point(aes(y=pooled, color="pooled")) +
  scale_colour_manual(values=c("black","darkred","darkgrey")) +
  ylab("Coefficient value") +
  xlab("District number") +
  ggtitle("Plot IV: Estimated coefficient as a function of district") +
  theme_bw() +
  ylim(c(-1.5, 1.5))

The vertical axis is set between -1.5 and 1.5. The reason for this is that coefficients with larger values have no practical significance and disrupt the visibilty.

(b) Short summary

As can be seen in the plot above, the pooled model has by design no variabilty in the coefficient value for living children over the different districts. However, the hierachical bayes as well as the unpooled model vary in their coefficiant value (positive as well as negative) over the different districts.

The coefficiants from the hierarchical model appear to be more stable over the different districts compared to the coefficiants from the unpooled model. Furthermore, the number of observations in a district appear to have an effect on the variability of the unpooled coefficients. The smaller the number the further away from the baseline (pooled). This could indicate some overfitting due to the small sample per district size.

Overall, the hierarchical bayes appears to be a kind of a compromise between the two models, i.e., in the pooled model all districts have the same coefficient while in the unpooled model the coefficients vary heavily.

3. Model shrinkage

Another benefit of shrinkage is how it affects probability estimates (recall the lucky, drunk friend from lecture whose classical estimate for the probability of guessing correctly was 100%). Extract the estimated probabilities from each model applied to the training data. That is, for the pooled and unpooled analyses, use the predict function applied to the fitted object, using the argument type="response". For the hierarchical model, the $theta.pred component of the fitted model contains the estimated probabilities.

(a) Probabilty estimates

Plot histograms of the vectors of probability estimates for each model separately. Make sure you standardize the horizontal axis so that the scales are the same. How does the distribution of estimated probabilities compare across the three models?

Predictions

# Predictions
pred_pooled <- predict(fit_pooled, df_train, type='response')
pred_unpooled <- predict(fit_unpooled, df_train, type='response')

# Prepare
n_obs <- nrow(df_train)
model_type <- c(rep('pooled', n_obs), rep('unpooled', n_obs), rep('hierarchical', n_obs))

# Probabilities
model_predictions <- c(scale(pred_pooled),
                       scale(pred_unpooled),
                       scale(fit_hierarchical$theta.pred))

# Dataframe
df_pred <- data.frame(prediction=model_predictions, model=model_type)

Visualization

# create hisotgram of probabilities from each model 
ggplot(df_pred, aes(x=prediction, fill=model)) +
  geom_histogram(alpha=0.8, color="white", bins=40, position="identity") +
  scale_fill_manual(values = c("black","darkred","darkgrey")) +
  theme_bw() +
  ylab("Count") +
  xlab("Pr(Contraceptive usage)") +
  ggtitle("Plot V: Predicted probabilities")

The above plot shows, that the pooled model only has 4 potential values at around 0, 0.3, 0.6 and 1. As was already observed in 2.1 (a) the model is not very good at distinguishing the different classes. The probabilities for the hierarchical bayes as well as the unpooled model are more evenly distributed with 3 visible clusters at 0.3, 0.5, and 0.6.

(b) Comparing predictions

Create a scatter plot comparing predicted values from Unpooled and Hierarchical Models, making sure that the scale of the horizontal and vertical axes are the same, and that the plotting region is square rather than rectangular. Include on the plot the line \(y=x\) (why do you think this is a useful line to superimpose?). Briefly interpret the relationship between the probability estimates for these two models. Are there particular features of the plot that highlight the intended benefits of using a hierarchical model over the unpooled analysis? Briefly explain.

Prepare dataframe

# Dataframe
df_mod <- data.frame(district=df_train$district,
                     unpooled=scale(pred_unpooled),
                     hierarchical=scale(fit_hierarchical$theta.pred))

df_mod <- merge(df_mod, data.frame(table(df_train$district)),
                by.x="district", by.y="Var1")

Visualization

# Plot
ggplot(df_mod, aes(x=hierarchical, y=unpooled)) +
  geom_abline(intercept=0, slope=1, color="darkred", size=1) +
  geom_point(size=df_mod$Freq/20) +
  coord_fixed() +
  theme_bw() +
  ggtitle("Plot VI: Hierarchical vs. Unpooled Model")

The above plot confirms the already seen picture in the histogram, that the 2 probabilties are positivly correlated with each other. The 45-degree line helps distinguishing the relationship between the two variables. That means that it shows where the two models differ in there predictions and it shows if there relationship is linear. Furthermore, the above plot shows, that the hierarchical model is less confident for values where the sample size is small while the unpooled model classifies them very often as either one or zero. This supports the observation made in 2.2 (b). The reason for this is that the hierarchical model incorporats the prior of the pooled model, which helps prevent it from overfitting on district’s data (shrinkage). In other words, as the sample size decreases, the average is weighted more heavily toward the sample mean.