Exploratory Analysis of Mental Illness Data amongst Lung Cancer Patients

During the course of our study, we specifically focused on lung cancer patients who have undergone lobectomy (lung cancer surgery) and analyze if any specific mental illness/psychiatric diagnoses or groups of diagnoses increase perioperative death risk.

Data Analysis

As per the Project Lifecycle above, data was used from HCUP (Healthcare Cost and Utilization project in the United States) and includes dataset from NIS(National Inpatient sample database), which derives its data from billing data submitted by hospitals statewide across the U.S. This data represents a 20% sample of all the hospitalizations in the U.S. The data used for this study was from NIS 2016 and NIS 2017 data. Datasets contained data related to patient diagnosis and procedure codes in ICD-10-CM/PCS format. Below is the snapshot of the sample data:


The initial data size, which includes all lobectomy cases for 2016 and 2017, was 13,892 cases in the HCUP NIS database. Filtering these cases for SMI’s resulted in a dataset that is 5581 cases. The following groups of diagnoses codes were considered in our study.

Mental Illness Diagnoses Code GroupsAt Allwyn, we use big data techniques to extract, transform and load the data to conclude with a meaningful dataset that is quality controlled as well as engineered for any missing values, outliers, and grouping of mental illness codes. We also ensure that we balance the datasets so that the results are not skewed. We explored hundreds of data elements along with our subject matter expert to understand the significance of diagnoses codes, procedure codes, or discharge information. We focused on the categorical variables and performed additional research based on our SME’s inputs as well as obtaining feedback from HCUP.

Our core data file included important data elements such as patient’s details such as age, race, and urban/rural location, hospital data like date of admission, hospital location, etc., and important medical information such as procedure codes and Diagnosis codes. Some of the important features of the dataset used for this study are included below: These features/ attributes of the dataset allowed us to identify and explore different aspects of the data.

We will be sharing our experience on how we leveraged Machine Learning Algorithms to determine Correlation between Lung Cancer and Severe Mental Illness Data. Watch this space or follow us on LinkedIn.