Wheelchair Ramps Classification

On 25th April I took my Data Science Professional Certification from Data Camp, after going through several skill assessments in Python, Statistics, SQL, and Machine Learning along with a coding challenge in data management and exploratory analysis using python I was given this case study as my final project in which I have 24 hours to complete it and present my findings in a video format. I’ve made some improvements after submitting the project, particularly in the hyperparameter tuning section.

Problem Statement

Congratulations, you have landed your first job as a data scientist at National Accessibility! National Accessibility currently installs wheelchair ramps for office buildings and schools. However, the marketing manager wants the company to start installing ramps for event venues as well. According to a new survey, approximately 40% of event venues are not wheelchair accessible. However, it is not easy to know whether a venue already has a ramp installed.

The marketing manager would like to know whether you can develop a model to predict whether an event venue has a wheelchair ramp. To help you with this, he has provided you with a dataset of London venues. This data includes whether the venue has a ramp.

It is a waste of time to contact venues that already have a ramp installed, and it also looks bad for the company. Therefore, it is especially important to exclude locations that already have a ramp. Ideally, at least two-thirds of venues predicted to be without a ramp should not have a ramp.

The data you will use for this analysis can be accessed here: "data/event_venues.csv"

Load Data

# Use this cell to begin, and add as many cells as you need to complete your analysis!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from statistics import mean

plt.style.use('seaborn')
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier

df = pd.read_csv("data/event_venues.csv")
df.head(10)

	venue_name	Loud music / events	Venue provides alcohol	Wi-Fi	supervenue	U-Shaped_max	max_standing	Theatre_max	Promoted / ticketed events	Wheelchair accessible
0	techspace aldgate east	False	0	True	False	35.045455	0	112.715867	False	False
1	green rooms hotel	True	1	True	False	40.000000	120	80.000000	True	False
2	148 leadenhall street	False	0	True	False	35.045455	0	112.715867	False	False
3	conway hall	False	0	True	False	35.045455	60	60.000000	False	False
4	gridiron building	False	0	True	False	35.045455	0	112.715867	False	False
5	kimpton fitzroy london	True	1	True	False	6.000000	0	112.715867	True	False
6	lloyds avenue	False	0	True	False	35.045455	0	112.715867	False	False
7	public space \| members-style bar & dining	True	1	True	False	35.045455	200	112.715867	False	False
8	16 old queen street	False	0	True	False	35.045455	0	112.715867	False	False
9	siorai bar	True	1	True	False	35.045455	180	20.000000	True	False

# No missing value
df.isnull().sum()

venue_name                    0
Loud music / events           0
Venue provides alcohol        0
Wi-Fi                         0
supervenue                    0
U-Shaped_max                  0
max_standing                  0
Theatre_max                   0
Promoted / ticketed events    0
Wheelchair accessible         0
dtype: int64

df['venue_name'] = df['venue_name'].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3910 entries, 0 to 3909
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   venue_name                  3910 non-null   category
 1   Loud music / events         3910 non-null   bool    
 2   Venue provides alcohol      3910 non-null   int64   
 3   Wi-Fi                       3910 non-null   bool    
 4   supervenue                  3910 non-null   bool    
 5   U-Shaped_max                3910 non-null   float64 
 6   max_standing                3910 non-null   int64   
 7   Theatre_max                 3910 non-null   float64 
 8   Promoted / ticketed events  3910 non-null   bool    
 9   Wheelchair accessible       3910 non-null   bool    
dtypes: bool(5), category(1), float64(2), int64(2)
memory usage: 192.6 KB

len(df)

First before we perform anything let’s split our data for training and testing, we will leave the testing data alone until the very end once we’ve determined a model we think is best. This prevent overfitting to the test data and provides a non bias overview of our model generalization capability.

df_train, df_test = train_test_split(df, test_size=0.2, random_state = 0)

target = 'Wheelchair accessible'

EDA

# First lets try to spot any potential outliers in our numerical data using histogram and summary statistics.
df_train.hist(figsize=(8,8), xrot=45, bins=20)
plt.show()

png

df_train.describe()

	Venue provides alcohol	U-Shaped_max	max_standing	Theatre_max
count	3128.000000	3128.000000	3128.000000	3128.000000
mean	0.715473	34.469905	111.485934	111.876338
std	0.451261	20.041665	249.709427	119.055118
min	0.000000	2.000000	0.000000	2.000000
25%	0.000000	35.045455	0.000000	80.000000
50%	1.000000	35.045455	50.000000	112.715867
75%	1.000000	35.045455	120.000000	112.715867
max	1.000000	900.000000	7500.000000	2500.000000

Looking at the summary statistics we can see that the distributions in u-Shaped_max, max_standing, and Theatre_max are quite large. This could be some consideration whether or not we should perform feature scaling. Furthermore the maximum value of the feature are really far out from the mean which could suggest outliers. However, to be able to say that it is an outlier require us to have a good understanding at the source of our data. For this case I would say that these are not as it make sense that there are some amount of venue which have a total capacity far larger from the rest.

Imbalance Class?

# Let's check if there is any imbalances in our target variable
sns.countplot(x = target, data = df_train)

<AxesSubplot:xlabel='Wheelchair accessible', ylabel='count'>

png

We can see that the number of venue accessible to wheelchair and those that are not are pretty balance so we don’t have to perform any resampling.

Segment and group by the target feature

df_train[target] = df_train[target].astype('category')

<ipython-input-11-badd2281d496>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train[target] = df_train[target].astype('category')

num_cols = ['U-Shaped_max', 'max_standing', 'Theatre_max']
for col in num_cols:
    ax = sns.boxplot(y = target, x = col, data=df_train)
    ax.set_xlim(0,500)
    plt.show()

png

From the box plot we can see that there are a lot of so called outlier in our numerical features and also the distribution for Theatre_max and U-shaped_max are really dispersed. But more importantly in a glance we can see that a larger proportion of venue who have larger max and standing capacity tends to be Wheelchair accessible. Keeping these in mind these 2 features might be important in us predicting venue with wheelchair accessibility.

Segment Categorical features by the target classes

categorical = ['Venue provides alcohol', 'Loud music / events', 'Wi-Fi', 'supervenue', 'Promoted / ticketed events']

for col in categorical:
        g = sns.catplot(x = col, kind='count', col = target, data=df_train, sharey=False)

png

We can see that there are some difference in how the target variable is distributed in venue that provides alcohol and venue that hosts promoted/ticketed events. We see that in venue that hosts promoted/ticketed events tend to be wheelchair accesible, while venue that doesn’t provide alcohol tend to be non accessible to wheelchair. This shows that these features might be a good indicator of the target.

Correlation Matrix

df_train[target] = df_train[target].astype('bool')
corr = df_train.corr()
plt.figure(figsize=(6,6))
sns.heatmap(corr, cmap='RdBu_r', annot=True, vmax=1, vmin=-1)
plt.show()

<ipython-input-15-e770310b8beb>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train[target] = df_train[target].astype('bool')

png

From the correlation matrix we can see that there isn’t any strong positive or negative linear correlation between the features and our target variable Wheelchair accesible.

Feature Pre-Processing

df_train.dtypes

venue_name                    category
Loud music / events               bool
Venue provides alcohol           int64
Wi-Fi                             bool
supervenue                        bool
U-Shaped_max                   float64
max_standing                     int64
Theatre_max                    float64
Promoted / ticketed events        bool
Wheelchair accessible             bool
dtype: object

from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

X = df_train.iloc[:, 1:-1]
y = df_train[target].replace({False: 1, True: 0})

You might be wondering why I set the False target variable which indicate venue that is not accesible to wheelchair to 1 instead of 0. Well the reason is simple if we remember earlier one of the main requirements given by the Sales Manager for this project is to minimize predicting a venue as non-accessible to wheelchair ramp when it is in fact accessible, this means that we want to optimize the precision for when the target variable is False. To do so we need to set our False target variable as the positive variable (denoted as 1) since Sk-learn can only calculate the precision of the positive class.

Scaling

As we’ve seen from our previous data exploratory analysis we see that the features in our data have varying distributions and even indication of so called “outliers”, this might effect the performance of distance based algorithm (KNN, SVM) and gradient descent algorithm (logistic regression). To fix this it is important to scale our data before feeding them to our model, I will try 2 methods of scaling standard scaling: Standard Scaling and MinMaxScaler. Standard scaling will help reduce the importance of outliers by scaling the distribution to an std = 1. MinMaxScaler on the other hand doesn’t reduce the important of outliers and are less distruptive to the information of the original data it will scale the data to a default range of 0 - 1.

We will simply try both of these scaling method and see which yields best result on our model.

# Standard Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scale = scaler.fit_transform(X)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X)

X_scale = pd.DataFrame(X_scale, index = X.index, columns = X.columns)
X_scale.head(3)

	Loud music / events	Venue provides alcohol	Wi-Fi	supervenue	U-Shaped_max	max_standing	Theatre_max	Promoted / ticketed events
788	-0.752041	-1.585751	0.270315	-0.263442	0.028722	-0.446534	0.007053	-0.788928
1101	-0.752041	-1.585751	0.270315	-0.263442	-0.223066	-0.326375	-0.687828	-0.788928
519	-0.752041	-1.585751	0.270315	3.795901	0.028722	-0.406481	0.007053	-0.788928

X_scale.describe()

	Loud music / events	Venue provides alcohol	Wi-Fi	supervenue	U-Shaped_max	max_standing	Theatre_max	Promoted / ticketed events
count	3.128000e+03	3.128000e+03	3.128000e+03	3.128000e+03	3.128000e+03	3.128000e+03	3.128000e+03	3.128000e+03
mean	-2.661980e-17	-1.171271e-18	-3.510264e-17	-2.413528e-16	-1.001900e-15	-2.259488e-16	-1.323450e-15	1.426821e-17
std	1.000160e+00	1.000160e+00	1.000160e+00	1.000160e+00	1.000160e+00	1.000160e+00	1.000160e+00	1.000160e+00
min	-7.520409e-01	-1.585751e+00	-3.699385e+00	-2.634420e-01	-1.620379e+00	-4.465340e-01	-9.230507e-01	-7.889275e-01
25%	-7.520409e-01	-1.585751e+00	2.703152e-01	-2.634420e-01	2.872223e-02	-4.465340e-01	-2.677872e-01	-7.889275e-01
50%	-7.520409e-01	6.306160e-01	2.703152e-01	-2.634420e-01	2.872223e-02	-2.462693e-01	7.052730e-03	-7.889275e-01
75%	1.329715e+00	6.306160e-01	2.703152e-01	-2.634420e-01	2.872223e-02	3.410135e-02	7.052730e-03	1.267544e+00
max	1.329715e+00	6.306160e-01	2.703152e-01	3.795901e+00	4.319344e+01	2.959318e+01	2.006218e+01	1.267544e+00

X_norm = pd.DataFrame(X_norm, index = X.index, columns = X.columns)
X_norm.head(3)

	Wi-Fi	supervenue	U-Shaped_max	max_standing	Theatre_max
788	1.0	0.0	0.036799	0.000000	0.044322
1101	1.0	0.0	0.031180	0.004000	0.011209
519	1.0	1.0	0.036799	0.001333	0.044322

X_norm.describe()

	Loud music / events	Venue provides alcohol	Wi-Fi	supervenue	U-Shaped_max	max_standing	Theatre_max	Promoted / ticketed events
count	3128.000000	3128.000000	3128.000000	3128.000000	3128.000000	3128.000000	3128.000000	3128.000000
mean	0.361253	0.715473	0.931905	0.064898	0.036158	0.014865	0.043986	0.383632
std	0.480441	0.451261	0.251948	0.246385	0.022318	0.033295	0.047660	0.486348
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	1.000000	0.000000	0.036799	0.000000	0.031225	0.000000
50%	0.000000	1.000000	1.000000	0.000000	0.036799	0.006667	0.044322	0.000000
75%	1.000000	1.000000	1.000000	0.000000	0.036799	0.016000	0.044322	1.000000
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

Model Training

K-Fold Cross Validation

To evaluate our model performance on our training set we will use K-Fold cross validation more specifically 5-Fold since our data are quite limited. K-Fold validation will allow us to get a fairly accurate overview of our model performance and generalization capability since it trains our model using different part of our data and it works well with small size dataset due to the repeated cross fold validation.

cv = KFold(n_splits = 5, random_state = 0, shuffle=True)

def get_score(model, X, y, metric):
    return cross_val_score(model, X = X, y = y, scoring = metric, cv = cv, n_jobs = -1)

For the next part I’ll implement a bunch of different algorithm with their default parameter and see which one works best.

Logistic Regresssion

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0)
lr.fit(X, y)

logreg_score = mean(get_score(lr, X, y, 'precision'))
print(logreg_score)

c:\Users\David\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.6084089246120791

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0)
lr.fit(X_scale, y)

scale_logreg_score = mean(get_score(lr, X_scale, y, 'precision'))
print(scale_logreg_score)

0.6063055862836805

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0)
lr.fit(X_norm, y)

norm_logreg_score = mean(get_score(lr, X_norm, y, 'precision'))
print(norm_logreg_score)

0.6056049667216648

Decision Tree

dt = DecisionTreeClassifier(random_state = 0)
dt.fit(X, y)

dt_score = mean(get_score(dt, X, y, 'precision'))
print(dt_score)

0.6321141459006742

dt.get_n_leaves(), len(X)

(825, 3128)

# let's add some contrain to our tree to prevent overfitting
dt = DecisionTreeClassifier(random_state = 0, min_samples_leaf = 25)
dt.fit(X, y)

dt_score = mean(get_score(dt, X, y, 'precision'))
print(dt_score)

0.6282740022143144

dt.get_n_leaves(), len(X)

(83, 3128)

Feature Importance

Feature importance will give a good idea of which features are most useful to the tree when splitting the node to get a better perfomance.

def feat_importance(m, df):
    return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
                       ).sort_values('imp', ascending=False)

fi = feat_importance(dt, X)
fi[:10]

	cols	imp
6	Theatre_max	0.528387
5	max_standing	0.241195
7	Promoted / ticketed events	0.070997
1	Venue provides alcohol	0.061862
0	Loud music / events	0.045346
4	U-Shaped_max	0.041549
2	Wi-Fi	0.005378
3	supervenue	0.005286

def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(fi)

<AxesSubplot:ylabel='cols'>

png

# Let's try removing some of these features that have low importance score as they might no be that relevant in our prediction.
to_keep = fi[fi.imp>0.01].cols
X_imp_dt = X[to_keep]

dt_imp = DecisionTreeClassifier(random_state = 0, min_samples_leaf = 25)
dt_imp.fit(X_imp_dt, y)

dt_score = mean(get_score(dt_imp, X_imp_dt, y, 'precision'))
print(dt_score)

0.6281904981143176

fi = feat_importance(dt_imp, X_imp_dt)
fi[:10]

	cols	imp
0	Theatre_max	0.532250
1	max_standing	0.243995
2	Promoted / ticketed events	0.072290
3	Venue provides alcohol	0.062989
4	Loud music / events	0.046171
5	U-Shaped_max	0.042305

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators = 500, min_samples_leaf = 0.1, random_state = 0)
rf.fit(X, y)

rf_score = mean(get_score(rf, X, y, 'precision'))
print(rf_score)

0.6546381919484513

rf_fi = feat_importance(rf, X)
rf_fi[:10]

	cols	imp
5	max_standing	0.289665
6	Theatre_max	0.217849
7	Promoted / ticketed events	0.212952
1	Venue provides alcohol	0.174762
4	U-Shaped_max	0.067801
0	Loud music / events	0.036971
2	Wi-Fi	0.000000
3	supervenue	0.000000

Let’s remove Wi-Fi and supervenue since they have no contribution to the random forest.

rf_to_keep = rf_fi[rf_fi.imp>0.05].cols
X_imp_rf = X[rf_to_keep]

rf_imp = RandomForestClassifier(n_estimators = 500, min_samples_leaf = 0.1, random_state = 0)
rf_imp.fit(X_imp_rf, y)

rf_score = mean(get_score(rf_imp, X_imp_rf, y, 'precision'))
print(rf_score)

0.6496156104808377

rf_fi = feat_importance(rf_imp, X_imp_rf)
rf_fi[:10]

	cols	imp
0	max_standing	0.342606
1	Theatre_max	0.287633
2	Promoted / ticketed events	0.203867
3	Venue provides alcohol	0.136621
4	U-Shaped_max	0.029273

Boosting

# AdaBoost
from sklearn.ensemble import AdaBoostClassifier
adb = AdaBoostClassifier(base_estimator = dt, n_estimators = 400, random_state = 0)

ab_score = mean(get_score(adb, X, y, 'precision'))
print(ab_score)

0.6279814689432012

# Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators = 400, random_state = 0)

gb_score = mean(get_score(gbc, X, y, 'precision'))
print(gb_score)

0.638820254532681

# Schochastic Gradient Boosting
sgb = GradientBoostingClassifier(n_estimators = 400, subsample = 0.8, max_features = 0.2, random_state = 0)
sgb.fit(X, y)

sgb_score = mean(get_score(sgb, X, y, 'precision'))
print(sgb_score)

0.6410017296544348

fi = feat_importance(sgb, X)
fi[:10]

	cols	imp
6	Theatre_max	0.329313
5	max_standing	0.257683
4	U-Shaped_max	0.164669
7	Promoted / ticketed events	0.083803
1	Venue provides alcohol	0.065942
3	supervenue	0.036859
2	Wi-Fi	0.033934
0	Loud music / events	0.027796

It seems that all of the features are relevant when using gradient boosting so we won’t remove any feature like we did in our decision tree and random forest classifier as it will only lower the performance

KNN

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)

knn_score = mean(get_score(knn, X, y, 'precision'))
print(knn_score)

0.6198340928379169

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)

scale_knn_score = mean(get_score(knn, X_scale, y, 'precision'))
print(scale_knn_score)

0.6327969183534807

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)

norm_knn_score = mean(get_score(knn, X_norm, y, 'precision'))
print(norm_knn_score)

0.6317485643334988

SVM

from sklearn.svm import SVC, LinearSVC
svc = SVC(random_state = 0)

svc_score = mean(get_score(svc, X, y, 'precision'))
print(svc_score)

0.6269802103034781

from sklearn.svm import SVC, LinearSVC
svc = SVC(random_state = 0)

scale_svc_score = mean(get_score(svc, X_scale, y, 'precision'))
print(scale_svc_score)

0.6180157484965274

from sklearn.svm import SVC, LinearSVC
svc = SVC(random_state = 0)

norm_svc_score = mean(get_score(svc, X_norm, y, 'precision'))
print(norm_svc_score)

0.6227870276502987

models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 
              'Ada Boost', 'Gradient Boosting', 'Stochastic Gradient Boosting', 
              'Scaled KNN', 'SVM'],
    'Precision Score': [logreg_score, dt_score , rf_score, 
              ab_score, gb_score, sgb_score, 
              scale_knn_score, svc_score]})
models.sort_values(by='Precision Score', ascending=False)

	Model	Precision Score
2	Random Forest	0.649616
5	Stochastic Gradient Boosting	0.641002
4	Gradient Boosting	0.638820
6	Scaled KNN	0.632797
1	Decision Tree	0.628190
3	Ada Boost	0.627981
7	SVM	0.626980
0	Logistic Regression	0.608409

Based on the comparisons we can find that tree based models like Random Forest and Gradient Boosing yields the highest precision score. Now let’s take two of our best model and try to optimize them with some hyperparameter tuning.

Hyperparameter Tuning.

Since this is a fairly small dataset we won’t be using any advanced informed search algorithm like bayesian optimizaiton or genetic algorithm. We’ll simply be using the trusty old GridSearch and RandomSearch.

# Import the necessary module.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import validation_curve

Random Forest

Random forest are extremely resilient to hyperparameter choices and should not overfit even with large number of tree as they are independent from one another. To tune the model I first perform Randomized Search CV to get an estimation of the optimal hyperparameters, then I narrow down the range of values for each hyperparameters and perform a Grid Search CV.

rs_param_grid = {
    "n_estimators": list((range(300, 500))),
    "max_depth": list((range(4, 20, 2))),
    "min_samples_leaf": list((range(4, 16, 2))),
    "min_samples_split": list((range(10, 50, 5))),
    "max_features": ['auto', 'sqrt']
}

rf = RandomForestClassifier(random_state = 0)

rf_rs = RandomizedSearchCV(
    estimator=rf,
    param_distributions=rs_param_grid,
    cv=cv,  # Number of folds
    n_iter=100,  # Number of parameter candidate settings to sample
    verbose=0,  # The higher this is, the more messages are outputed
    scoring="precision",  # Metric to evaluate performance
    random_state=0,
    n_jobs= -1
)

rf_rs.fit(X, y)

RandomizedSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
                   estimator=RandomForestClassifier(random_state=0), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'max_depth': [4, 6, 8, 10, 12, 14, 16,
                                                      18],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [4, 6, 8, 10, 12,
                                                             14],
                                        'min_samples_split': [10, 15, 20, 25,
                                                              30, 35, 40, 45],
                                        'n_estimators': [300, 301, 302, 303,
                                                         304, 305, 306, 307,
                                                         308, 309, 310, 311,
                                                         312, 313, 314, 315,
                                                         316, 317, 318, 319,
                                                         320, 321, 322, 323,
                                                         324, 325, 326, 327,
                                                         328, 329, ...]},
                   random_state=0, scoring='precision')

rf_rs.best_params_, rf_rs.best_score_

({'n_estimators': 303,
  'min_samples_split': 30,
  'min_samples_leaf': 4,
  'max_features': 'sqrt',
  'max_depth': 16},
 0.6592529494144459)

rs_param_grid = {
    "n_estimators": list((range(200, 450, 50))),
    "max_depth": list((range(10, 22, 2))),
    "min_samples_leaf": list((range(2, 14, 2))),
    "min_samples_split": list((range(10, 50, 5))),
    "max_features": ['sqrt']
}

rf = RandomForestClassifier(random_state = 0)

rf_rs = GridSearchCV(
    estimator=rf,
    param_grid=rs_param_grid,
    cv=cv,  # Number of folds 
    verbose=0,  # The higher this is, the more messages are outputed
    scoring="precision",  # Metric to evaluate performance
    n_jobs= -1
)

rf_rs.fit(X, y)

GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
             param_grid={'max_depth': [10, 12, 14, 16, 18, 20],
                         'max_features': ['sqrt'],
                         'min_samples_leaf': [2, 4, 6, 8, 10, 12],
                         'min_samples_split': [10, 15, 20, 25, 30, 35, 40, 45],
                         'n_estimators': [200, 250, 300, 350, 400]},
             scoring='precision')

rf_rs.best_params_, rf_rs.best_score_

({'max_depth': 16,
  'max_features': 'sqrt',
  'min_samples_leaf': 2,
  'min_samples_split': 25,
  'n_estimators': 300},
 0.6656854437117669)

rf_tuned = RandomForestClassifier(n_estimators= 300, min_samples_split= 25, min_samples_leaf = 2, max_features= 'sqrt', max_depth= 16, random_state=0)
rf_tuned.fit(X, y)

rf_prec_score = mean(get_score(rf_tuned, X, y, 'precision'))
rf_acc_score = mean(get_score(rf_tuned, X, y, 'accuracy'))
print("Precision: {}, Accuracy: {}".format(rf_prec_score, rf_acc_score))

Precision: 0.6656854437117667, Accuracy: 0.6704030670926517

rf_fi = feat_importance(rf_tuned, X)
rf_fi[:10]

	cols	imp
6	Theatre_max	0.337579
5	max_standing	0.302497
4	U-Shaped_max	0.137787
7	Promoted / ticketed events	0.072131
1	Venue provides alcohol	0.061466
0	Loud music / events	0.030885
3	supervenue	0.029704
2	Wi-Fi	0.027951

def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(rf_fi)

<AxesSubplot:ylabel='cols'>

png

Stochastic Gradient Boosting

Performing optimization to SGB model are trickier compared to random forest model, they are extremely sensitive to the choice of hyperparameters and there’s nothing stopping us from overfitting as we increase the number of tree. The following steps are based on this useful article which give a comprehensive guide to tuning a Gradient Boosting model.

param_test1 = {'n_estimators':range(10,110,10)}
sgb = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=30, min_samples_leaf=4, max_depth=8, max_features='sqrt', subsample=0.8, random_state=0)
gsearch1 = GridSearchCV(estimator = sgb , param_grid = param_test1, scoring='precision', n_jobs=-1, cv=cv)
gsearch1.fit(X, y)

GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=GradientBoostingClassifier(max_depth=8,
                                                  max_features='sqrt',
                                                  min_samples_leaf=4,
                                                  min_samples_split=30,
                                                  random_state=0,
                                                  subsample=0.8),
             n_jobs=-1, param_grid={'n_estimators': range(20, 110, 10)},
             scoring='precision')

gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_

({'mean_fit_time': array([0.12240252, 0.18079839, 0.22800064, 0.29360037, 0.31699891,
         0.42240057, 0.47680054, 0.53859844, 0.50699825]),
  'std_fit_time': array([0.00831587, 0.00875006, 0.00384716, 0.02402232, 0.02019837,
         0.02303587, 0.01146052, 0.02033394, 0.03014674]),
  'mean_score_time': array([0.02039647, 0.00740061, 0.00860014, 0.0087996 , 0.0084013 ,
         0.00959821, 0.00859933, 0.00939989, 0.0066009 ]),
  'std_score_time': array([0.01032626, 0.00135684, 0.00206077, 0.00222705, 0.00102076,
         0.00320091, 0.0008004 , 0.00101937, 0.00101976]),
  'param_n_estimators': masked_array(data=[20, 30, 40, 50, 60, 70, 80, 90, 100],
               mask=[False, False, False, False, False, False, False, False,
                     False],
         fill_value='?',
              dtype=object),
  'params': [{'n_estimators': 20},
   {'n_estimators': 30},
   {'n_estimators': 40},
   {'n_estimators': 50},
   {'n_estimators': 60},
   {'n_estimators': 70},
   {'n_estimators': 80},
   {'n_estimators': 90},
   {'n_estimators': 100}],
  'split0_test_score': array([0.67630058, 0.66959064, 0.66951567, 0.66666667, 0.67422096,
         0.68091168, 0.67705382, 0.67323944, 0.66946779]),
  'split1_test_score': array([0.63380282, 0.63043478, 0.62933333, 0.63487738, 0.6398892 ,
         0.63858696, 0.6344086 , 0.6284153 , 0.62942779]),
  'split2_test_score': array([0.64179104, 0.64371257, 0.63988095, 0.64477612, 0.64011799,
         0.64094955, 0.64450867, 0.64222874, 0.64431487]),
  'split3_test_score': array([0.69496855, 0.70253165, 0.69811321, 0.69811321, 0.69349845,
         0.6863354 , 0.68965517, 0.68847352, 0.6875    ]),
  'split4_test_score': array([0.60422961, 0.61261261, 0.6105919 , 0.6125    , 0.60625   ,
         0.609375  , 0.61370717, 0.61419753, 0.60869565]),
  'mean_test_score': array([0.65021852, 0.65177645, 0.64948701, 0.65138668, 0.65079532,
         0.65123172, 0.65186669, 0.64931091, 0.64788122]),
  'std_test_score': array([0.03205719, 0.03145706, 0.03090932, 0.02913855, 0.03029704,
         0.02874288, 0.02784757, 0.02766231, 0.02801565]),
  'rank_test_score': array([6, 2, 7, 3, 5, 4, 1, 8, 9])},
 {'n_estimators': 80},
 0.6518666869112405)

param_test2 = {'max_depth':range(4,14,2), 'min_samples_split':range(5, 35, 5)}
sgb = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_features='sqrt', subsample=0.8, random_state=0)
gsearch2 = GridSearchCV(estimator = sgb, param_grid = param_test2, scoring='precision', n_jobs=-1, cv=cv)
gsearch2.fit(X, y)

GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=GradientBoostingClassifier(max_features='sqrt',
                                                  n_estimators=80,
                                                  random_state=0,
                                                  subsample=0.8),
             n_jobs=-1,
             param_grid={'max_depth': range(4, 14, 2),
                         'min_samples_split': range(5, 35, 5)},
             scoring='precision')

gsearch2.best_params_, gsearch2.best_score_

({'max_depth': 10, 'min_samples_split': 20}, 0.6539445149040549)

param_test3 = {'max_depth':range(6,16,2), 'min_samples_split':range(5, 35, 5), 'min_samples_leaf':range(2, 20, 2)}
sgb = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_features='sqrt', subsample=0.8, random_state=0)
gsearch3 = GridSearchCV(estimator = sgb, param_grid = param_test3, scoring='precision', n_jobs=-1, cv=cv)
gsearch3.fit(X, y)

GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=GradientBoostingClassifier(max_features='sqrt',
                                                  n_estimators=80,
                                                  random_state=0,
                                                  subsample=0.8),
             n_jobs=-1,
             param_grid={'max_depth': range(6, 16, 2),
                         'min_samples_leaf': range(2, 20, 2),
                         'min_samples_split': range(5, 35, 5)},
             scoring='precision')

gsearch3.best_params_, gsearch3.best_score_

({'max_depth': 14, 'min_samples_leaf': 18, 'min_samples_split': 5},
 0.6587001254347038)

param_test4 = {"max_features": range(1, 9, 1)}
sgb = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_depth=14, min_samples_leaf=18, min_samples_split=5, subsample=0.8, random_state=0)
gsearch4 = GridSearchCV(estimator = sgb, param_grid = param_test4, scoring='precision', n_jobs=-1, cv=cv)
gsearch4.fit(X, y)

GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=GradientBoostingClassifier(max_depth=14,
                                                  min_samples_leaf=18,
                                                  min_samples_split=5,
                                                  n_estimators=80,
                                                  random_state=0,
                                                  subsample=0.8),
             n_jobs=-1, param_grid={'max_features': range(1, 9)},
             scoring='precision')

gsearch4.best_params_, gsearch4.best_score_

({'max_features': 2}, 0.6587001254347038)

param_test5 = {"subsample": np.arange(0.6, 0.9, 0.05)}
sgb = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_depth=14, min_samples_leaf=18, min_samples_split=5, max_features = 2, random_state=0)
gsearch5 = GridSearchCV(estimator = sgb, param_grid = param_test5, scoring='precision', n_jobs=-1, cv=cv)
gsearch5.fit(X, y)

GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=True),
             estimator=GradientBoostingClassifier(max_depth=14, max_features=2,
                                                  min_samples_leaf=18,
                                                  min_samples_split=5,
                                                  n_estimators=80,
                                                  random_state=0),
             n_jobs=-1,
             param_grid={'subsample': array([0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 ])},
             scoring='precision')

gsearch5.best_params_, gsearch5.best_score_

({'subsample': 0.8000000000000002}, 0.6587001254347038)

sgb_tuned = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_depth=14, min_samples_leaf=18, min_samples_split=5, max_features = 2, subsample=0.8, random_state=0)
sgb_tuned.fit(X, y)

sgb_prec_score = mean(get_score(sgb_tuned, X, y, 'precision'))
sgb_acc_score = mean(get_score(sgb_tuned, X, y, 'accuracy'))
print("Precision: {}, Accuracy: {}".format(sgb_prec_score, sgb_acc_score))

Precision: 0.6587001254347037, Accuracy: 0.669128178913738

sgb_fi = feat_importance(sgb_tuned, X)
sgb_fi[:10]

	cols	imp
5	max_standing	0.329886
6	Theatre_max	0.321755
4	U-Shaped_max	0.155595
7	Promoted / ticketed events	0.066759
1	Venue provides alcohol	0.049638
0	Loud music / events	0.032210
2	Wi-Fi	0.024035
3	supervenue	0.020122

def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(sgb_fi)

<AxesSubplot:ylabel='cols'>

png

result = pd.DataFrame({
    'Model': ['Fine-tuned Random Forest', 'Fine-tuned Stochastic Gradient Boosting'],
    'Precision Score': [rf_prec_score, sgb_prec_score],
    'Accuracy': [rf_acc_score,sgb_acc_score]})
    
result

	Model	Precision Score	Accuracy
0	Fine-tuned Random Forest	0.665685	0.670403
1	Fine-tuned Stochastic Gradient Boosting	0.658700	0.669128

It seems that the Random Forest model yields the highest precision and accuracy score so let’s pick that as our final model.

Final Evaluation

For the final evaluation we will use our fine-tuned Random Forest model to predict the test set we’ve set aside.

df_test.head(5)

	venue_name	Loud music / events	Venue provides alcohol	Wi-Fi	supervenue	U-Shaped_max	max_standing	Theatre_max	Promoted / ticketed events	Wheelchair accessible
3538	the great hall and chambers leyton	False	1	False	False	35.045455	80	112.715867	False	True
192	dock street studios	True	0	False	False	35.045455	15	112.715867	False	False
2065	clayton crown hotel	False	1	True	False	80.000000	380	400.000000	False	True
2490	techspace aldgate east	False	0	True	False	35.045455	0	112.715867	False	True
598	the long acre	True	1	True	False	35.045455	200	112.715867	False	False

X_test = df_test.iloc[:, 1:-1]
y_test = df_test[target].replace({False: 1, True: 0})

# Predict our test set using our trained Random Forest model.
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

rf_tuned = RandomForestClassifier(n_estimators= 300, min_samples_split= 25, min_samples_leaf = 2, max_features= 'sqrt', max_depth= 16, random_state=0)
rf_tuned.fit(X, y)

y_pred = rf_tuned.predict(X_test)
print("Precision:", precision_score(y_test, y_pred), ", Accuracy:", accuracy_score(y_test, y_pred))

Precision: 0.6577669902912622 , Accuracy: 0.6636828644501279

As we can see the model achieve a pretty good performance on the test set with only a slight decrease from the training which is to be expected. This means that our model is able to generalize well in data it has not seen before. We are also able to achieve a precision of around 66% which satisfied one of this project initial requirement (ideally two-thirds of venues predicted to be without a ramp should not have a ramp).

Outcome

In conclusion we found out that features related to the capacity of the venue like Theatre_max, max_standing, and U-Shaped_max are an important predictor to determining whether a venue is wheelchair accessible, more importantly the larger the capacity the more likely they are to be wheelchair accessible. My hypothesis is that these venue that are larger must’ve have more budget and funding behind them, as such they are likely to be more well prepared and are able to afford wheelchair ramps to accomodate those with disability. Another feature that are also an important predictor is promoted/ticketed events which make sense since venue that host promoted/ticketed event would want to appease to all sort of audience in order to boost their income.
After trying out different models I found out that decision tree based models works best to create a model that can predict whether or not a venue are wheel chair accessible. I then picked the two highest models - Random Forest and Stochastic Gradient Boosting, and perform hyperparameter tuning. In the end the Random Forest yields the better performance and so I picked it for the final evaluation. After evaluating on the unseen test set the model did a good job and yielded a precision and accuracy score of around 66%, this proves that the model doesn’t only have high performance on the dataset it’s train on but also on new unseen dataset which simulate real life application.
The evaluation on the test set also shows that the model have successfully achieve the ideal requirement of this project which is that ideally two-thirds of the venue predicted to be without a ramp doesn’t actually have one.

Future Works

It is possible that some feature engineering on features that are related with the capacity of the venue like Theatre_max, max_standing, and U-Shaped_max might improve the performance of gradient descent based model (Logistic Regression) and distance based model (KNN) as they might be affected by multicollinearity.

David Samuel