Determining Correlation between Mental Illness and Lung Cancer using Machine Learning


For any machine learning algorithm to be designed, it is important to understand the variability of the data and skewness, as well as the assumptions that we can make to build machine learning models. Here are some key statistical distribution models of the dataset we used for our study:

Percentage of Male and Female Lung cancer Patients

Number of patients with different mental illness

By grouping of the mental illness diagnoses codes after filtering Lobectomy procedure codes, we observed a majority of data to be falling under F1, which refers to F10-F19.

Correlation between Mental Illness and Lung Cancer and Death

From the above plots, there is a slight Correlation between F0 (Mental disorders due to known physiological conditions), F7(Mental retardation) with death.

Data distribution of Length of Stay in the hospital for patients who have undergone lobectomy and have a mental illness

We observed that patients with mental illness codes F0, F7, and F8 stayed longer. Patients with F0, F2 paid more charges in the hospital than other mental illness groups.

F0 – Mental disorders due to known physiological conditions

F1 – Mental and behavioral disorders due to psychoactive substance use

F2 – Schizophrenia, schizotypal, delusional, and other Non-mood psychotic disorders

F3 – Mood [affective] disorders

F4 – Anxiety, dissociative, stress-related, somatoform and other nonpsychotic mental disorders

F5 – Behavioral syndromes associated with physiological Disturbances and physical factors

F6 – Disorders of adult personality and behavior

F7 – Mental retardation

F8 – Pervasive and specific developmental disorders

F9 – Behavioral and emotional disorders with onset usually occurring in childhood and adolescence

Modeling and Algorithm

In order for machine learning algorithms to provide less degree of variability and a higher level of accuracy, the data is generally split into two different segments — training and testing. The algorithm is trained on a partial set of data called training, and once satisfied with the results, the algorithm is then run on the testing set. The performance of the model is measured in the training data across the various types of machine learning models.

Our next week’s final post of the series will give you more insights into the conclusive analysis and our experience in utilizing data to predict the Length of Stay of lobectomy patients with severe mental illness. Watch this space or follow us on LinkedIn to stay tuned.