Wheelchair Ramps Classification

On 25th April I took my Data Science Professional Certification from Data Camp, after going through several skill assessments in Python, Statistics, SQL, and Machine Learning along with a coding challenge in data management and exploratory analysis using python I was given this case study as my final project in which I have 24 hours to complete it and present my findings in a video format. I’ve made some improvements after submitting the project, particularly in the hyperparameter tuning section.

Problem Statement

Congratulations, you have landed your first job as a data scientist at National Accessibility! National Accessibility currently installs wheelchair ramps for office buildings and schools. However, the marketing manager wants the company to start installing ramps for event venues as well. According to a new survey, approximately 40% of event venues are not wheelchair accessible. However, it is not easy to know whether a venue already has a ramp installed.

The marketing manager would like to know whether you can develop a model to predict whether an event venue has a wheelchair ramp. To help you with this, he has provided you with a dataset of London venues. This data includes whether the venue has a ramp.

It is a waste of time to contact venues that already have a ramp installed, and it also looks bad for the company. Therefore, it is especially important to exclude locations that already have a ramp. Ideally, at least two-thirds of venues predicted to be without a ramp should not have a ramp.

The data you will use for this analysis can be accessed here: "data/event_venues.csv"

Load Data

# Use this cell to begin, and add as many cells as you need to complete your analysis!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from statistics import mean

plt.style.use('seaborn')
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
df = pd.read_csv("data/event_venues.csv")
df.head(10)
venue_nameLoud music / eventsVenue provides alcoholWi-FisupervenueU-Shaped_maxmax_standingTheatre_maxPromoted / ticketed eventsWheelchair accessible
0techspace aldgate eastFalse0TrueFalse35.0454550112.715867FalseFalse
1green rooms hotelTrue1TrueFalse40.00000012080.000000TrueFalse
2148 leadenhall streetFalse0TrueFalse35.0454550112.715867FalseFalse
3conway hallFalse0TrueFalse35.0454556060.000000FalseFalse
4gridiron buildingFalse0TrueFalse35.0454550112.715867FalseFalse
5kimpton fitzroy londonTrue1TrueFalse6.0000000112.715867TrueFalse
6lloyds avenueFalse0TrueFalse35.0454550112.715867FalseFalse
7public space | members-style bar & diningTrue1TrueFalse35.045455200112.715867FalseFalse
816 old queen streetFalse0TrueFalse35.0454550112.715867FalseFalse
9siorai barTrue1TrueFalse35.04545518020.000000TrueFalse
# No missing value
df.isnull().sum()
venue_name                    0
Loud music / events           0
Venue provides alcohol        0
Wi-Fi                         0
supervenue                    0
U-Shaped_max                  0
max_standing                  0
Theatre_max                   0
Promoted / ticketed events    0
Wheelchair accessible         0
dtype: int64
df['venue_name'] = df['venue_name'].astype('category')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3910 entries, 0 to 3909
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   venue_name                  3910 non-null   category
 1   Loud music / events         3910 non-null   bool    
 2   Venue provides alcohol      3910 non-null   int64   
 3   Wi-Fi                       3910 non-null   bool    
 4   supervenue                  3910 non-null   bool    
 5   U-Shaped_max                3910 non-null   float64 
 6   max_standing                3910 non-null   int64   
 7   Theatre_max                 3910 non-null   float64 
 8   Promoted / ticketed events  3910 non-null   bool    
 9   Wheelchair accessible       3910 non-null   bool    
dtypes: bool(5), category(1), float64(2), int64(2)
memory usage: 192.6 KB
len(df)
3910

First before we perform anything let’s split our data for training and testing, we will leave the testing data alone until the very end once we’ve determined a model we think is best. This prevent overfitting to the test data and provides a non bias overview of our model generalization capability.

df_train, df_test = train_test_split(df, test_size=0.2, random_state = 0)
target = 'Wheelchair accessible'

EDA

# First lets try to spot any potential outliers in our numerical data using histogram and summary statistics.
df_train.hist(figsize=(8,8), xrot=45, bins=20)
plt.show()

png

df_train.describe()
Venue provides alcoholU-Shaped_maxmax_standingTheatre_max
count3128.0000003128.0000003128.0000003128.000000
mean0.71547334.469905111.485934111.876338
std0.45126120.041665249.709427119.055118
min0.0000002.0000000.0000002.000000
25%0.00000035.0454550.00000080.000000
50%1.00000035.04545550.000000112.715867
75%1.00000035.045455120.000000112.715867
max1.000000900.0000007500.0000002500.000000

Looking at the summary statistics we can see that the distributions in u-Shaped_max, max_standing, and Theatre_max are quite large. This could be some consideration whether or not we should perform feature scaling. Furthermore the maximum value of the feature are really far out from the mean which could suggest outliers. However, to be able to say that it is an outlier require us to have a good understanding at the source of our data. For this case I would say that these are not as it make sense that there are some amount of venue which have a total capacity far larger from the rest.

Imbalance Class?

# Let's check if there is any imbalances in our target variable
sns.countplot(x = target, data = df_train)
<AxesSubplot:xlabel='Wheelchair accessible', ylabel='count'>

png

We can see that the number of venue accessible to wheelchair and those that are not are pretty balance so we don’t have to perform any resampling.

Segment and group by the target feature

df_train[target] = df_train[target].astype('category')
<ipython-input-11-badd2281d496>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train[target] = df_train[target].astype('category')
num_cols = ['U-Shaped_max', 'max_standing', 'Theatre_max']
for col in num_cols:
    ax = sns.boxplot(y = target, x = col, data=df_train)
    ax.set_xlim(0,500)
    plt.show()

png

png

png

From the box plot we can see that there are a lot of so called outlier in our numerical features and also the distribution for Theatre_max and U-shaped_max are really dispersed. But more importantly in a glance we can see that a larger proportion of venue who have larger max and standing capacity tends to be Wheelchair accessible. Keeping these in mind these 2 features might be important in us predicting venue with wheelchair accessibility.

Segment Categorical features by the target classes

categorical = ['Venue provides alcohol', 'Loud music / events', 'Wi-Fi', 'supervenue', 'Promoted / ticketed events']
for col in categorical:
        g = sns.catplot(x = col, kind='count', col = target, data=df_train, sharey=False)

png

png

png

png

png

We can see that there are some difference in how the target variable is distributed in venue that provides alcohol and venue that hosts promoted/ticketed events. We see that in venue that hosts promoted/ticketed events tend to be wheelchair accesible, while venue that doesn’t provide alcohol tend to be non accessible to wheelchair. This shows that these features might be a good indicator of the target.

Correlation Matrix

df_train[target] = df_train[target].astype('bool')
corr = df_train.corr()
plt.figure(figsize=(6,6))
sns.heatmap(corr, cmap='RdBu_r', annot=True, vmax=1, vmin=-1)
plt.show()
<ipython-input-15-e770310b8beb>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train[target] = df_train[target].astype('bool')

png

From the correlation matrix we can see that there isn’t any strong positive or negative linear correlation between the features and our target variable Wheelchair accesible.

Feature Pre-Processing

df_train.dtypes
venue_name                    category
Loud music / events               bool
Venue provides alcohol           int64
Wi-Fi                             bool
supervenue                        bool
U-Shaped_max                   float64
max_standing                     int64
Theatre_max                    float64
Promoted / ticketed events        bool
Wheelchair accessible             bool
dtype: object
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
X = df_train.iloc[:, 1:-1]
y = df_train[target].replace({False: 1, True: 0})

You might be wondering why I set the False target variable which indicate venue that is not accesible to wheelchair to 1 instead of 0. Well the reason is simple if we remember earlier one of the main requirements given by the Sales Manager for this project is to minimize predicting a venue as non-accessible to wheelchair ramp when it is in fact accessible, this means that we want to optimize the precision for when the target variable is False. To do so we need to set our False target variable as the positive variable (denoted as 1) since Sk-learn can only calculate the precision of the positive class.

Scaling

As we’ve seen from our previous data exploratory analysis we see that the features in our data have varying distributions and even indication of so called “outliers”, this might effect the performance of distance based algorithm (KNN, SVM) and gradient descent algorithm (logistic regression). To fix this it is important to scale our data before feeding them to our model, I will try 2 methods of scaling standard scaling: Standard Scaling and MinMaxScaler. Standard scaling will help reduce the importance of outliers by scaling the distribution to an std = 1. MinMaxScaler on the other hand doesn’t reduce the important of outliers and are less distruptive to the information of the original data it will scale the data to a default range of 0 - 1.

We will simply try both of these scaling method and see which yields best result on our model.

# Standard Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scale = scaler.fit_transform(X)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X)
X_scale = pd.DataFrame(X_scale, index = X.index, columns = X.columns)
X_scale.head(3)
Loud music / eventsVenue provides alcoholWi-FisupervenueU-Shaped_maxmax_standingTheatre_maxPromoted / ticketed events
788-0.752041-1.5857510.270315-0.2634420.028722-0.4465340.007053-0.788928
1101-0.752041-1.5857510.270315-0.263442-0.223066-0.326375-0.687828-0.788928
519-0.752041-1.5857510.2703153.7959010.028722-0.4064810.007053-0.788928
X_scale.describe()
Loud music / eventsVenue provides alcoholWi-FisupervenueU-Shaped_maxmax_standingTheatre_maxPromoted / ticketed events
count3.128000e+033.128000e+033.128000e+033.128000e+033.128000e+033.128000e+033.128000e+033.128000e+03
mean-2.661980e-17-1.171271e-18-3.510264e-17-2.413528e-16-1.001900e-15-2.259488e-16-1.323450e-151.426821e-17
std1.000160e+001.000160e+001.000160e+001.000160e+001.000160e+001.000160e+001.000160e+001.000160e+00
min-7.520409e-01-1.585751e+00-3.699385e+00-2.634420e-01-1.620379e+00-4.465340e-01-9.230507e-01-7.889275e-01
25%-7.520409e-01-1.585751e+002.703152e-01-2.634420e-012.872223e-02-4.465340e-01-2.677872e-01-7.889275e-01
50%-7.520409e-016.306160e-012.703152e-01-2.634420e-012.872223e-02-2.462693e-017.052730e-03-7.889275e-01
75%1.329715e+006.306160e-012.703152e-01-2.634420e-012.872223e-023.410135e-027.052730e-031.267544e+00
max1.329715e+006.306160e-012.703152e-013.795901e+004.319344e+012.959318e+012.006218e+011.267544e+00
X_norm = pd.DataFrame(X_norm, index = X.index, columns = X.columns)
X_norm.head(3)
Loud music / eventsVenue provides alcoholWi-FisupervenueU-Shaped_maxmax_standingTheatre_maxPromoted / ticketed events
7880.00.01.00.00.0367990.0000000.0443220.0
11010.00.01.00.00.0311800.0040000.0112090.0
5190.00.01.01.00.0367990.0013330.0443220.0
X_norm.describe()
Loud music / eventsVenue provides alcoholWi-FisupervenueU-Shaped_maxmax_standingTheatre_maxPromoted / ticketed events
count3128.0000003128.0000003128.0000003128.0000003128.0000003128.0000003128.0000003128.000000
mean0.3612530.7154730.9319050.0648980.0361580.0148650.0439860.383632
std0.4804410.4512610.2519480.2463850.0223180.0332950.0476600.486348
min0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000000.0000001.0000000.0000000.0367990.0000000.0312250.000000
50%0.0000001.0000001.0000000.0000000.0367990.0066670.0443220.000000
75%1.0000001.0000001.0000000.0000000.0367990.0160000.0443221.000000
max1.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000

Model Training

K-Fold Cross Validation

To evaluate our model performance on our training set we will use K-Fold cross validation more specifically 5-Fold since our data are quite limited. K-Fold validation will allow us to get a fairly accurate overview of our model performance and generalization capability since it trains our model using different part of our data and it works well with small size dataset due to the repeated cross fold validation.

cv = KFold(n_splits = 5, random_state = 0, shuffle=True)
def get_score(model, X, y, metric):
    return cross_val_score(model, X = X, y = y, scoring = metric, cv = cv, n_jobs = -1)

For the next part I’ll implement a bunch of different algorithm with their default parameter and see which one works best.

Logistic Regresssion

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0)
lr.fit(X, y)

logreg_score = mean(get_score(lr, X, y, 'precision'))
print(logreg_score)
c:\Users\David\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.6084089246120791
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0)
lr.fit(X_scale, y)

scale_logreg_score = mean(get_score(lr, X_scale, y, 'precision'))
print(scale_logreg_score)
0.6063055862836805
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0)
lr.fit(X_norm, y)

norm_logreg_score = mean(get_score(lr, X_norm, y, 'precision'))
print(norm_logreg_score)
0.6056049667216648

Decision Tree

dt = DecisionTreeClassifier(random_state = 0)
dt.fit(X, y)

dt_score = mean(get_score(dt, X, y, 'precision'))
print(dt_score)
0.6321141459006742
dt.get_n_leaves(), len(X)
(825, 3128)
# let's add some contrain to our tree to prevent overfitting
dt = DecisionTreeClassifier(random_state = 0, min_samples_leaf = 25)
dt.fit(X, y)

dt_score = mean(get_score(dt, X, y, 'precision'))
print(dt_score)
0.6282740022143144
dt.get_n_leaves(), len(X)
(83, 3128)

Feature Importance

Feature importance will give a good idea of which features are most useful to the tree when splitting the node to get a better perfomance.

def feat_importance(m, df):
    return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
                       ).sort_values('imp', ascending=False)
fi = feat_importance(dt, X)
fi[:10]
colsimp
6Theatre_max0.528387
5max_standing0.241195
7Promoted / ticketed events0.070997
1Venue provides alcohol0.061862
0Loud music / events0.045346
4U-Shaped_max0.041549
2Wi-Fi0.005378
3supervenue0.005286
def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(fi)
<AxesSubplot:ylabel='cols'>

png

# Let's try removing some of these features that have low importance score as they might no be that relevant in our prediction.
to_keep = fi[fi.imp>0.01].cols
X_imp_dt = X[to_keep]
dt_imp = DecisionTreeClassifier(random_state = 0, min_samples_leaf = 25)
dt_imp.fit(X_imp_dt, y)

dt_score = mean(get_score(dt_imp, X_imp_dt, y, 'precision'))
print(dt_score)
0.6281904981143176
fi = feat_importance(dt_imp, X_imp_dt)
fi[:10]
colsimp
0Theatre_max0.532250
1max_standing0.243995
2Promoted / ticketed events0.072290
3Venue provides alcohol0.062989
4Loud music / events0.046171
5U-Shaped_max0.042305

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 500, min_samples_leaf = 0.1, random_state = 0)
rf.fit(X, y)

rf_score = mean(get_score(rf, X, y, 'precision'))
print(rf_score)
0.6546381919484513
rf_fi = feat_importance(rf, X)
rf_fi[:10]
colsimp
5max_standing0.289665
6Theatre_max0.217849
7Promoted / ticketed events0.212952
1Venue provides alcohol0.174762
4U-Shaped_max0.067801
0Loud music / events0.036971
2Wi-Fi0.000000
3supervenue0.000000

Let’s remove Wi-Fi and supervenue since they have no contribution to the random forest.

rf_to_keep = rf_fi[rf_fi.imp>0.05].cols
X_imp_rf = X[rf_to_keep]
rf_imp = RandomForestClassifier(n_estimators = 500, min_samples_leaf = 0.1, random_state = 0)
rf_imp.fit(X_imp_rf, y)

rf_score = mean(get_score(rf_imp, X_imp_rf, y, 'precision'))
print(rf_score)
0.6496156104808377
rf_fi = feat_importance(rf_imp, X_imp_rf)
rf_fi[:10]
colsimp
0max_standing0.342606
1Theatre_max0.287633
2Promoted / ticketed events0.203867
3Venue provides alcohol0.136621
4U-Shaped_max0.029273

Boosting

# AdaBoost
from sklearn.ensemble import AdaBoostClassifier
adb = AdaBoostClassifier(base_estimator = dt, n_estimators = 400, random_state = 0)

ab_score = mean(get_score(adb, X, y, 'precision'))
print(ab_score)
0.6279814689432012
# Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators = 400, random_state = 0)

gb_score = mean(get_score(gbc, X, y, 'precision'))
print(gb_score)
0.638820254532681
# Schochastic Gradient Boosting
sgb = GradientBoostingClassifier(n_estimators = 400, subsample = 0.8, max_features = 0.2, random_state = 0)
sgb.fit(X, y)

sgb_score = mean(get_score(sgb, X, y, 'precision'))
print(sgb_score)
0.6410017296544348
fi = feat_importance(sgb, X)
fi[:10]
colsimp
6Theatre_max0.329313
5max_standing0.257683
4U-Shaped_max0.164669
7Promoted / ticketed events0.083803
1Venue provides alcohol0.065942
3supervenue0.036859
2Wi-Fi0.033934
0Loud music / events0.027796

It seems that all of the features are relevant when using gradient boosting so we won’t remove any feature like we did in our decision tree and random forest classifier as it will only lower the performance

KNN

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)

knn_score = mean(get_score(knn, X, y, 'precision'))
print(knn_score)
0.6198340928379169
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)

scale_knn_score = mean(get_score(knn, X_scale, y, 'precision'))
print(scale_knn_score)
0.6327969183534807
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)

norm_knn_score = mean(get_score(knn, X_norm, y, 'precision'))
print(norm_knn_score)
0.6317485643334988

SVM

from sklearn.svm import SVC, LinearSVC
svc = SVC(random_state = 0)

svc_score = mean(get_score(svc, X, y, 'precision'))
print(svc_score)
0.6269802103034781
from sklearn.svm import SVC, LinearSVC
svc = SVC(random_state = 0)

scale_svc_score = mean(get_score(svc, X_scale, y, 'precision'))
print(scale_svc_score)
0.6180157484965274
from sklearn.svm import SVC, LinearSVC
svc = SVC(random_state = 0)

norm_svc_score = mean(get_score(svc, X_norm, y, 'precision'))
print(norm_svc_score)
0.6227870276502987
models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 
              'Ada Boost', 'Gradient Boosting', 'Stochastic Gradient Boosting', 
              'Scaled KNN', 'SVM'],
    'Precision Score': [logreg_score, dt_score , rf_score, 
              ab_score, gb_score, sgb_score, 
              scale_knn_score, svc_score]})
models.sort_values(by='Precision Score', ascending=False)
ModelPrecision Score
2Random Forest0.649616
5Stochastic Gradient Boosting0.641002
4Gradient Boosting0.638820
6Scaled KNN0.632797
1Decision Tree0.628190
3Ada Boost0.627981
7SVM0.626980
0Logistic Regression0.608409

Based on the comparisons we can find that tree based models like Random Forest and Gradient Boosing yields the highest precision score. Now let’s take two of our best model and try to optimize them with some hyperparameter tuning.

Hyperparameter Tuning.

Since this is a fairly small dataset we won’t be using any advanced informed search algorithm like bayesian optimizaiton or genetic algorithm. We’ll simply be using the trusty old GridSearch and RandomSearch.

# Import the necessary module.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import validation_curve

Random Forest

Random forest are extremely resilient to hyperparameter choices and should not overfit even with large number of tree as they are independent from one another. To tune the model I first perform Randomized Search CV to get an estimation of the optimal hyperparameters, then I narrow down the range of values for each hyperparameters and perform a Grid Search CV.

rs_param_grid = {
    "n_estimators": list((range(300, 500))),
    "max_depth": list((range(4, 20, 2))),
    "min_samples_leaf": list((range(4, 16, 2))),
    "min_samples_split": list((range(10, 50, 5))),
    "max_features": ['auto', 'sqrt']
}

rf = RandomForestClassifier(random_state = 0)

rf_rs = RandomizedSearchCV(
    estimator=rf,
    param_distributions=rs_param_grid,
    cv=cv,  # Number of folds
    n_iter=100,  # Number of parameter candidate settings to sample
    verbose=0,  # The higher this is, the more messages are outputed
    scoring="precision",  # Metric to evaluate performance
    random_state=0,
    n_jobs= -1
)

rf_rs.fit(X, y)
RandomizedSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
                   estimator=RandomForestClassifier(random_state=0), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'max_depth': [4, 6, 8, 10, 12, 14, 16,
                                                      18],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [4, 6, 8, 10, 12,
                                                             14],
                                        'min_samples_split': [10, 15, 20, 25,
                                                              30, 35, 40, 45],
                                        'n_estimators': [300, 301, 302, 303,
                                                         304, 305, 306, 307,
                                                         308, 309, 310, 311,
                                                         312, 313, 314, 315,
                                                         316, 317, 318, 319,
                                                         320, 321, 322, 323,
                                                         324, 325, 326, 327,
                                                         328, 329, ...]},
                   random_state=0, scoring='precision')
rf_rs.best_params_, rf_rs.best_score_
({'n_estimators': 303,
  'min_samples_split': 30,
  'min_samples_leaf': 4,
  'max_features': 'sqrt',
  'max_depth': 16},
 0.6592529494144459)
rs_param_grid = {
    "n_estimators": list((range(200, 450, 50))),
    "max_depth": list((range(10, 22, 2))),
    "min_samples_leaf": list((range(2, 14, 2))),
    "min_samples_split": list((range(10, 50, 5))),
    "max_features": ['sqrt']
}

rf = RandomForestClassifier(random_state = 0)

rf_rs = GridSearchCV(
    estimator=rf,
    param_grid=rs_param_grid,
    cv=cv,  # Number of folds 
    verbose=0,  # The higher this is, the more messages are outputed
    scoring="precision",  # Metric to evaluate performance
    n_jobs= -1
)

rf_rs.fit(X, y)
GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
             param_grid={'max_depth': [10, 12, 14, 16, 18, 20],
                         'max_features': ['sqrt'],
                         'min_samples_leaf': [2, 4, 6, 8, 10, 12],
                         'min_samples_split': [10, 15, 20, 25, 30, 35, 40, 45],
                         'n_estimators': [200, 250, 300, 350, 400]},
             scoring='precision')
rf_rs.best_params_, rf_rs.best_score_
({'max_depth': 16,
  'max_features': 'sqrt',
  'min_samples_leaf': 2,
  'min_samples_split': 25,
  'n_estimators': 300},
 0.6656854437117669)
rf_tuned = RandomForestClassifier(n_estimators= 300, min_samples_split= 25, min_samples_leaf = 2, max_features= 'sqrt', max_depth= 16, random_state=0)
rf_tuned.fit(X, y)

rf_prec_score = mean(get_score(rf_tuned, X, y, 'precision'))
rf_acc_score = mean(get_score(rf_tuned, X, y, 'accuracy'))
print("Precision: {}, Accuracy: {}".format(rf_prec_score, rf_acc_score))
Precision: 0.6656854437117667, Accuracy: 0.6704030670926517
rf_fi = feat_importance(rf_tuned, X)
rf_fi[:10]
colsimp
6Theatre_max0.337579
5max_standing0.302497
4U-Shaped_max0.137787
7Promoted / ticketed events0.072131
1Venue provides alcohol0.061466
0Loud music / events0.030885
3supervenue0.029704
2Wi-Fi0.027951
def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(rf_fi)
<AxesSubplot:ylabel='cols'>

png

Stochastic Gradient Boosting

Performing optimization to SGB model are trickier compared to random forest model, they are extremely sensitive to the choice of hyperparameters and there’s nothing stopping us from overfitting as we increase the number of tree. The following steps are based on this useful article which give a comprehensive guide to tuning a Gradient Boosting model.

param_test1 = {'n_estimators':range(10,110,10)}
sgb = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=30, min_samples_leaf=4, max_depth=8, max_features='sqrt', subsample=0.8, random_state=0)
gsearch1 = GridSearchCV(estimator = sgb , param_grid = param_test1, scoring='precision', n_jobs=-1, cv=cv)
gsearch1.fit(X, y)
GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=GradientBoostingClassifier(max_depth=8,
                                                  max_features='sqrt',
                                                  min_samples_leaf=4,
                                                  min_samples_split=30,
                                                  random_state=0,
                                                  subsample=0.8),
             n_jobs=-1, param_grid={'n_estimators': range(20, 110, 10)},
             scoring='precision')
gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_
({'mean_fit_time': array([0.12240252, 0.18079839, 0.22800064, 0.29360037, 0.31699891,
         0.42240057, 0.47680054, 0.53859844, 0.50699825]),
  'std_fit_time': array([0.00831587, 0.00875006, 0.00384716, 0.02402232, 0.02019837,
         0.02303587, 0.01146052, 0.02033394, 0.03014674]),
  'mean_score_time': array([0.02039647, 0.00740061, 0.00860014, 0.0087996 , 0.0084013 ,
         0.00959821, 0.00859933, 0.00939989, 0.0066009 ]),
  'std_score_time': array([0.01032626, 0.00135684, 0.00206077, 0.00222705, 0.00102076,
         0.00320091, 0.0008004 , 0.00101937, 0.00101976]),
  'param_n_estimators': masked_array(data=[20, 30, 40, 50, 60, 70, 80, 90, 100],
               mask=[False, False, False, False, False, False, False, False,
                     False],
         fill_value='?',
              dtype=object),
  'params': [{'n_estimators': 20},
   {'n_estimators': 30},
   {'n_estimators': 40},
   {'n_estimators': 50},
   {'n_estimators': 60},
   {'n_estimators': 70},
   {'n_estimators': 80},
   {'n_estimators': 90},
   {'n_estimators': 100}],
  'split0_test_score': array([0.67630058, 0.66959064, 0.66951567, 0.66666667, 0.67422096,
         0.68091168, 0.67705382, 0.67323944, 0.66946779]),
  'split1_test_score': array([0.63380282, 0.63043478, 0.62933333, 0.63487738, 0.6398892 ,
         0.63858696, 0.6344086 , 0.6284153 , 0.62942779]),
  'split2_test_score': array([0.64179104, 0.64371257, 0.63988095, 0.64477612, 0.64011799,
         0.64094955, 0.64450867, 0.64222874, 0.64431487]),
  'split3_test_score': array([0.69496855, 0.70253165, 0.69811321, 0.69811321, 0.69349845,
         0.6863354 , 0.68965517, 0.68847352, 0.6875    ]),
  'split4_test_score': array([0.60422961, 0.61261261, 0.6105919 , 0.6125    , 0.60625   ,
         0.609375  , 0.61370717, 0.61419753, 0.60869565]),
  'mean_test_score': array([0.65021852, 0.65177645, 0.64948701, 0.65138668, 0.65079532,
         0.65123172, 0.65186669, 0.64931091, 0.64788122]),
  'std_test_score': array([0.03205719, 0.03145706, 0.03090932, 0.02913855, 0.03029704,
         0.02874288, 0.02784757, 0.02766231, 0.02801565]),
  'rank_test_score': array([6, 2, 7, 3, 5, 4, 1, 8, 9])},
 {'n_estimators': 80},
 0.6518666869112405)
param_test2 = {'max_depth':range(4,14,2), 'min_samples_split':range(5, 35, 5)}
sgb = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_features='sqrt', subsample=0.8, random_state=0)
gsearch2 = GridSearchCV(estimator = sgb, param_grid = param_test2, scoring='precision', n_jobs=-1, cv=cv)
gsearch2.fit(X, y)
GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=GradientBoostingClassifier(max_features='sqrt',
                                                  n_estimators=80,
                                                  random_state=0,
                                                  subsample=0.8),
             n_jobs=-1,
             param_grid={'max_depth': range(4, 14, 2),
                         'min_samples_split': range(5, 35, 5)},
             scoring='precision')
gsearch2.best_params_, gsearch2.best_score_
({'max_depth': 10, 'min_samples_split': 20}, 0.6539445149040549)
param_test3 = {'max_depth':range(6,16,2), 'min_samples_split':range(5, 35, 5), 'min_samples_leaf':range(2, 20, 2)}
sgb = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_features='sqrt', subsample=0.8, random_state=0)
gsearch3 = GridSearchCV(estimator = sgb, param_grid = param_test3, scoring='precision', n_jobs=-1, cv=cv)
gsearch3.fit(X, y)
GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=GradientBoostingClassifier(max_features='sqrt',
                                                  n_estimators=80,
                                                  random_state=0,
                                                  subsample=0.8),
             n_jobs=-1,
             param_grid={'max_depth': range(6, 16, 2),
                         'min_samples_leaf': range(2, 20, 2),
                         'min_samples_split': range(5, 35, 5)},
             scoring='precision')
gsearch3.best_params_, gsearch3.best_score_
({'max_depth': 14, 'min_samples_leaf': 18, 'min_samples_split': 5},
 0.6587001254347038)
param_test4 = {"max_features": range(1, 9, 1)}
sgb = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_depth=14, min_samples_leaf=18, min_samples_split=5, subsample=0.8, random_state=0)
gsearch4 = GridSearchCV(estimator = sgb, param_grid = param_test4, scoring='precision', n_jobs=-1, cv=cv)
gsearch4.fit(X, y)
GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=GradientBoostingClassifier(max_depth=14,
                                                  min_samples_leaf=18,
                                                  min_samples_split=5,
                                                  n_estimators=80,
                                                  random_state=0,
                                                  subsample=0.8),
             n_jobs=-1, param_grid={'max_features': range(1, 9)},
             scoring='precision')
gsearch4.best_params_, gsearch4.best_score_
({'max_features': 2}, 0.6587001254347038)
param_test5 = {"subsample": np.arange(0.6, 0.9, 0.05)}
sgb = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_depth=14, min_samples_leaf=18, min_samples_split=5, max_features = 2, random_state=0)
gsearch5 = GridSearchCV(estimator = sgb, param_grid = param_test5, scoring='precision', n_jobs=-1, cv=cv)
gsearch5.fit(X, y)
GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=GradientBoostingClassifier(max_depth=14, max_features=2,
                                                  min_samples_leaf=18,
                                                  min_samples_split=5,
                                                  n_estimators=80,
                                                  random_state=0),
             n_jobs=-1,
             param_grid={'subsample': array([0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 ])},
             scoring='precision')
gsearch5.best_params_, gsearch5.best_score_
({'subsample': 0.8000000000000002}, 0.6587001254347038)
sgb_tuned = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_depth=14, min_samples_leaf=18, min_samples_split=5, max_features = 2, subsample=0.8, random_state=0)
sgb_tuned.fit(X, y)

sgb_prec_score = mean(get_score(sgb_tuned, X, y, 'precision'))
sgb_acc_score = mean(get_score(sgb_tuned, X, y, 'accuracy'))
print("Precision: {}, Accuracy: {}".format(sgb_prec_score, sgb_acc_score))
Precision: 0.6587001254347037, Accuracy: 0.669128178913738
sgb_fi = feat_importance(sgb_tuned, X)
sgb_fi[:10]
colsimp
5max_standing0.329886
6Theatre_max0.321755
4U-Shaped_max0.155595
7Promoted / ticketed events0.066759
1Venue provides alcohol0.049638
0Loud music / events0.032210
2Wi-Fi0.024035
3supervenue0.020122
def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(sgb_fi)
<AxesSubplot:ylabel='cols'>

png

result = pd.DataFrame({
    'Model': ['Fine-tuned Random Forest', 'Fine-tuned Stochastic Gradient Boosting'],
    'Precision Score': [rf_prec_score, sgb_prec_score],
    'Accuracy': [rf_acc_score,sgb_acc_score]})
    
result
ModelPrecision ScoreAccuracy
0Fine-tuned Random Forest0.6656850.670403
1Fine-tuned Stochastic Gradient Boosting0.6587000.669128

It seems that the Random Forest model yields the highest precision and accuracy score so let’s pick that as our final model.

Final Evaluation

For the final evaluation we will use our fine-tuned Random Forest model to predict the test set we’ve set aside.

df_test.head(5)
venue_nameLoud music / eventsVenue provides alcoholWi-FisupervenueU-Shaped_maxmax_standingTheatre_maxPromoted / ticketed eventsWheelchair accessible
3538the great hall and chambers leytonFalse1FalseFalse35.04545580112.715867FalseTrue
192dock street studiosTrue0FalseFalse35.04545515112.715867FalseFalse
2065clayton crown hotelFalse1TrueFalse80.000000380400.000000FalseTrue
2490techspace aldgate eastFalse0TrueFalse35.0454550112.715867FalseTrue
598the long acreTrue1TrueFalse35.045455200112.715867FalseFalse
X_test = df_test.iloc[:, 1:-1]
y_test = df_test[target].replace({False: 1, True: 0})
# Predict our test set using our trained Random Forest model.
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

rf_tuned = RandomForestClassifier(n_estimators= 300, min_samples_split= 25, min_samples_leaf = 2, max_features= 'sqrt', max_depth= 16, random_state=0)
rf_tuned.fit(X, y)

y_pred = rf_tuned.predict(X_test)
print("Precision:", precision_score(y_test, y_pred), ", Accuracy:", accuracy_score(y_test, y_pred))
Precision: 0.6577669902912622 , Accuracy: 0.6636828644501279

As we can see the model achieve a pretty good performance on the test set with only a slight decrease from the training which is to be expected. This means that our model is able to generalize well in data it has not seen before. We are also able to achieve a precision of around 66% which satisfied one of this project initial requirement (ideally two-thirds of venues predicted to be without a ramp should not have a ramp).

Outcome

Future Works