Predicting No-Shows at Clinics
The goal of this assignment is to help you understand the concepts of classification through having hands-on
experience with training and applying Logistic Regression models.
You are given a dataset of patient information who scheduled medical appointments at clinics and whether they
showed up for those appointments or not. Your task is to build a classification model that can predict likelihood
that a certain patient will not show up for a scheduled medical appointment.
The dataset consists of the information of 300,000 patients of different ages and medical conditions. The data also
includes whether the patient showed up for the appointment or not. The columns in the data set are as follows:
- Age: Patient Age
- Gender: Male (M) or Female (F)
- DayOfWeek: The day of week on which the appointment is scheduled to happen
- Diabetes: Whether the patient has diabetes
- Alcoolism: Whether the patient consumes alcohol
- Hyper Tension: Whether the patient has high blood pressure
- Handicap: Whether the patient is a handicap
- Smokes: Whether the patient is a smoker
- SMS_Reminder: Whether the clinic has sent an SMS reminder to the patient before the appointment
- AwaitingTime: The number of days between the date when the patient contacted the clinic to schedule an
appointment and the day of the appointment.
- Status: Whether the patient showed-up for the appointment or not. This is the label that we want to build
a model to learn to predict.
• Create a Jupyter Notebook that shows how you do the following in python:
- Load the data from the csv file using Pandas
- Preview/print the top 10 rows of the data
- Print the label distribution using the collection.Counter function. This should print the number of
instances that have a “Show-up” label and the number of instances that have a “No-Show” label.
- Create the Features matrix (columns 1-12 above – i.e. exclude the Status column – the label)
- Create the Labels vector (the Status column)
- Convert the categorical features to multiple binary features. For example, the gender feature
should be transformed to “Gender_M” and “Gender_F”, each takes a value of 0 or 1. Same for
the DayOfWeek feature. It should be broken into 7 binary features. This can be done using the
- Split the data into a training set (80% of the data) and a test set (20%) of the data using the
train_test_split function of the cross_validation class of the sklearn package.
- Train a logistic regression model using the training set.
- Apply the trained model in 8 to the test set.
- Print the confusion matrix of the prediction from 9 and the actual labels of the test set.
- Compute and print the accuracy, precision, recall, F1_score, PR AUC, and ROC_AUC for the
trained model on the test set (based on the output from step 9).
- Plot the Precision-Recall Curve and the ROC Curve for the trained model based on the test set
(the output of step 9).
- Remove the “SMS_Reminder” feature from both the training and test set and repeat the steps 8-
- When you plot the PR and ROC curves plot the same curves from step 12 (when all features
were used) on the same corresponding figure. So, the ROC Curve at this step should contain two
curves: the curve from step 12 and the curve from this step (after removing the “SMS_Reminder”
- Re-train the model on all features using 10-fold cross validation on the full data set (the original
set before splitting into train and test).
- Compute and print the average accuracy, average precision, average recall, and average F1_score
from the 10 folds.
What to submit
- Submit the Jupyter Notebook that shows all your work exactly as described above. Your notebook should
include section headers and descriptive text that explains what you are doing at each step (follow the
style of the notebooks we develop at class.)
- Submit a document in PDF format that shows the results of the experiments you The results should be
shown in tables similar to the following