Applying the right Machine Learning model for accurate statistics of Lobectomy Patients

More than ten different classification methods such as Logistic Regression, Random Forest, and Xgboost for different feature combinations were used to compare our target classification metrics and choose an optimum model.

Models that consistently showed the close range of scores in their validation phase were chosen. The best performing models were further optimized for high recall scores through cross-validation and grid search methods while keeping precision and accuracy in an acceptable range.  We chose an XGBoost model with a combination of socioeconomic and medical code groups as the final model due to its 75% recall, the ability for interpretation, high efficiency, and fast scoring time.

XGBoost, which falls into the gradient boosting framework of machine learning algorithms, has been a consistent, highly efficient problem solver and can run in major distributed environments.

Recall is the ability of a model to find all relevant cases within a dataset. In our case, true positives (TP) were the correctly classified readmitted patients, and false positives (FP) were the readmitted patients who were incorrectly classified as not readmitted.

We specifically aimed for higher recall scores (TP/TP+FP) since accuracy for an imbalanced dataset would not be a good measure to assess model performance, and we had to focus on identifying the readmitted patients to target and further analyze their underlying features properly.

Feature importance of the final XGBoost model and recall/accuracy curve

The final model showed that socioeconomic features such as the pay category being Medicare, patient age, gender, wage index, and the population category of patients and their diagnosis code groups and many other features that contribute to classification for readmission.

Follow us on LinkedIn and do not miss our final blog on the Machine Learning for Lung Cancer.