Determining Correlation between Mental Illness and Lung Cancer using Machine Learning
DATA DISTRIBUTION
For any machine learning algorithm to be designed, it is important to understand the variability of the data and skewness, as well as the assumptions that we can make to build machine learning models. Here are some key statistical distribution models of the dataset we used for our study:
Percentage of Male and Female Lung cancer Patients
Number of patients with different mental illness
By grouping of the mental illness diagnoses codes after filtering Lobectomy procedure codes, we observed a majority of data to be falling under F1, which refers to F10-F19.
Correlation between Mental Illness and Lung Cancer and Death
From the above plots, there is a slight Correlation between F0 (Mental disorders due to known physiological conditions), F7(Mental retardation) with death.
Data distribution of Length of Stay in the hospital for patients who have undergone lobectomy and have a mental illness
We observed that patients with mental illness codes F0, F7, and F8 stayed longer. Patients with F0, F2 paid more charges in the hospital than other mental illness groups.
F0 – Mental disorders due to known physiological conditions
F1 – Mental and behavioral disorders due to psychoactive substance use
F2 – Schizophrenia, schizotypal, delusional, and other Non-mood psychotic disorders
F3 – Mood [affective] disorders
F4 – Anxiety, dissociative, stress-related, somatoform and other nonpsychotic mental disorders
F5 – Behavioral syndromes associated with physiological Disturbances and physical factors
F6 – Disorders of adult personality and behavior
F7 – Mental retardation
F8 – Pervasive and specific developmental disorders
F9 – Behavioral and emotional disorders with onset usually occurring in childhood and adolescence
Modeling and Algorithm
In order for machine learning algorithms to provide less degree of variability and a higher level of accuracy, the data is generally split into two different segments — training and testing. The algorithm is trained on a partial set of data called training, and once satisfied with the results, the algorithm is then run on the testing set. The performance of the model is measured in the training data across the various types of machine learning models.
Our next week’s final post of the series will give you more insights into the conclusive analysis and our experience in utilizing data to predict the Length of Stay of lobectomy patients with severe mental illness. Watch this space or follow us on LinkedIn to stay tuned.