Assignment 3
Logistic Regression
Predicting No-Shows at Clinics
The goal of this assignment is to help you understand the concepts of classification through having hands-on
experience with training and applying Logistic Regression models.
You are given a dataset of patient information who scheduled medical appointments at clinics and whether they
showed up for those appointments or not. Your task is to build a classification model that can predict likelihood
that a certain patient will not show up for a scheduled medical appointment.
Dataset Description:
The dataset consists of the information of 300,000 patients of different ages and medical conditions. The data also
includes whether the patient showed up for the appointment or not. The columns in the data set are as follows:

  1. Age: Patient Age
  2. Gender: Male (M) or Female (F)
  3. DayOfWeek: The day of week on which the appointment is scheduled to happen
  4. Diabetes: Whether the patient has diabetes
  5. Alcoolism: Whether the patient consumes alcohol
  6. Hyper Tension: Whether the patient has high blood pressure
  7. Handicap: Whether the patient is a handicap
  8. Smokes: Whether the patient is a smoker
  9. Scholarship
  10. Tuberculosis
  11. SMS_Reminder: Whether the clinic has sent an SMS reminder to the patient before the appointment
  12. AwaitingTime: The number of days between the date when the patient contacted the clinic to schedule an
    appointment and the day of the appointment.
  13. Status: Whether the patient showed-up for the appointment or not. This is the label that we want to build
    a model to learn to predict.
    • Create a Jupyter Notebook that shows how you do the following in python:
  14. Load the data from the csv file using Pandas
  15. Preview/print the top 10 rows of the data
  16. Print the label distribution using the collection.Counter function. This should print the number of
    instances that have a “Show-up” label and the number of instances that have a “No-Show” label.
  17. Create the Features matrix (columns 1-12 above – i.e. exclude the Status column – the label)
  18. Create the Labels vector (the Status column)
  19. Convert the categorical features to multiple binary features. For example, the gender feature
    should be transformed to “Gender_M” and “Gender_F”, each takes a value of 0 or 1. Same for
    the DayOfWeek feature. It should be broken into 7 binary features. This can be done using the
    pandas.get_dummies() function.
  20. Split the data into a training set (80% of the data) and a test set (20%) of the data using the
    train_test_split function of the cross_validation class of the sklearn package.
  21. Train a logistic regression model using the training set.
  22. Apply the trained model in 8 to the test set.
  23. Print the confusion matrix of the prediction from 9 and the actual labels of the test set.
  24. Compute and print the accuracy, precision, recall, F1_score, PR AUC, and ROC_AUC for the
    trained model on the test set (based on the output from step 9).
  25. Plot the Precision-Recall Curve and the ROC Curve for the trained model based on the test set
    (the output of step 9).
  26. Remove the “SMS_Reminder” feature from both the training and test set and repeat the steps 8-
  27. When you plot the PR and ROC curves plot the same curves from step 12 (when all features
    were used) on the same corresponding figure. So, the ROC Curve at this step should contain two
    curves: the curve from step 12 and the curve from this step (after removing the “SMS_Reminder”
  28. Re-train the model on all features using 10-fold cross validation on the full data set (the original
    set before splitting into train and test).
  29. Compute and print the average accuracy, average precision, average recall, and average F1_score
    from the 10 folds.
    What to submit
  30. Submit the Jupyter Notebook that shows all your work exactly as described above. Your notebook should
    include section headers and descriptive text that explains what you are doing at each step (follow the
    style of the notebooks we develop at class.)
  31. Submit a document in PDF format that shows the results of the experiments you The results should be
    shown in tables similar to the following