Machine Learning to Improve Outcomes by Analyzing Lung Cancer Data

Finding a suitable dataset for machine learning to predict readmission was the first challenging task we had to overcome. Since, presently available datasets in the healthcare world, could either be dirty and unstructured or clean but lacking information.

Most patient-level data are not publicly available for research due to privacy reasons.

With these limitations in mind, after researching multiple data sources, including SEER-MEDICARE, HCUP, and public repositories, we decided to choose the Nationwide Readmissions Database (NRD) from Healthcare Cost and Utilization Project (HCUP). The Agency creates the HCUP databases for Healthcare Research and Quality (AHRQ) through a Federal-State-Industry partnership, and NRD is a unique database designed to support various types of analyses of national readmission rates for all patients, regardless of the expected payer for the hospital stay.

Our research involved using machine learning and statistical methods to analyze NRD. Data understanding, preparation, and engineering were the most time-consuming and complex phases of this data science project, which took nearly seventy percent of the overall time.

Using big data processing and extraction technologies like Spark and Python, 40 million patients’ records were filtered. (only the ones who have at least undergone a lobectomy procedure once). The filtered data was later put through the best data quality check processes and cleaned while imputing missing values.  And more than 100 input variables were explored that were analyzed correlations with the outcome and understood our target group’s demographics or were redundant.

Many of these features were categorical that required additional research and feature engineering.

NRD dataset mainly consists of three main files: Core, Hospital, Severity.

Core file mainly included the patient-level medical and non-medical factors like their age, gender, payment category, urban/rural location of a patient, and many more are among the socioeconomic factors. However, medical factors include detailed information about every diagnosis code, procedure code, their respective diagnosis-related groups (DRG), time of those procedures, yearly quarter of the admission, etc.

Allwyn data engineering practices included analyzing every single feature, researching, and creating data dictionaries and feature transformation to see which features contribute to our prediction algorithms.  With an average age of 65 for lobectomy patients, the data showed that women had more lobectomies than men, more men were readmitted than women.

Severity file further provided us the summarized severity level of the diagnosis codes. The Hospital dataset presented us information with hospital-level information such as bed size, control/ownership of the hospital, urban/rural designation, and teaching status of urban hospitals, etc.

We consulted subject matter experts in the lung cancer field and, through their advice, added additional features such as Elixhauser and Charlson comorbidity indices to enrich our existing dataset. By delving deep into the clinical features, we also ensured the chosen variables are pre-procedure information and verified no information leakage from post-operative or known future level variables.

The features were then analyzed to check whether they had statistical significance with our selection of predictive models by looking at correlation matrices and feature importance charts.

Analyzing the initial data distribution for many of the features required us to remove outliers, transform skewed distributions, and scale the majority of the features for algorithms that were particularly sensitive to non-normalized variables. Diagnosis codes were grouped into 22 categories to reduce dimensionality and improve interpretation.

The resulting dataset was highly imbalanced in terms of the readmitted and not readmitted classes, 8% and 92%, respectively. Most classification models are extremely sensitive to imbalanced datasets, and multiple data balancing techniques such as oversampling the minority class, under-sampling the majority class, and Synthetic Minority Oversampling Technique (SMOTE) were used to train our algorithms and compare the outcomes.

Initial machine learning models had both low precision and recall scores. Although this could be due to many different reasons, the Allwyn team focused mainly on additional feature engineering to remove the high dimensionality of initial input variables while also comparing different data balancing methods. This was a time-consuming iterative process and required training more than a thousand different models on different combinations or groupings of diagnosis codes (shown in Table 2) along with other non-medical factors.

K-fold cross-validation was also used during the training and validation to ensure the training results represent the testing. We weighted the admission and readmission classes by training models and comparing their validation scores to classify the readmitted patients further.

We also collaborated with George Mason University through their DAEN Capstone program.  The team led by Dr. James Baldo and several participants from the graduate program analyzed the underlying data and developed predictive models using various technologies, including AWS SageMaker Autopilot. The resulting models and their respective hyperparameters were further analyzed and tuned to achieve high recall.

After choosing the best model, we designed and implemented this workflow in Alteryx Designer to automate our process and put it into a feedback-re-evaluation phase as a Cross-Industry Standard Process for Data Mining (CRISP-DM) to enable our model to evolve and be deployed in production.

To know more about how we decided on the best model and associated classification methods, follow us on LinkedIn.


Read More

Predicting hospital readmissions and underlying risk factors of Lung Cancer with Machine Learning

Readmission after pulmonary lobectomy is a frequent challenge for hospitals, healthcare plans, and insurance providers. Readmission is a condition when a patient is admitted to a hospital for any reason within 30 days of discharge from their hospital. Re-occurring problems and readmissions have been a major issue in the healthcare system. Readmissions are often costly; however, their findings can be incredibly beneficial for both the public and healthcare industries. With this in consideration, to improve Americans’ healthcare, Hospital Readmissions Reduction Program (HRRP) was brought in motion by the Centers for Medicare & Medicaid Services (CMS). This program penalizes hospitals with excessive readmissions.

Allwyn is developing a machine learning based approach to reduce readmissions by recommending data-driven preventive actions prior to a lobectomy procedure. This approach can be used by various organizations such as hospitals or healthcare companies to take proactive measures and circumvent readmissions by predicting:

  • The probability of a patient’s readmission
  • Underlying risk factors

We will be sharing the challenges with Data Exploration and Engineering, followed by our Strategy and its impact. Follow us on LinkedIn as we share our approach in the coming weeks.

Read More

5 ways to fight cyber attacks using AI

The last few years has seen an increase in cyber attacks – whether it is hacking into personal data or bringing down electric grids or tampering with federal data. According to the State of Cyber 2019 report, there is an exponentially increasing breach rate of 232 records/sec. This is only going to see upward trends as the number of connected devices increases, exposing the risk of cyber attacks.

Source: Wipro State of Cyber Report 2019

It is humanly impossible to handle the terabytes of data that is vulnerable to such attacks. Automation is the only answer to this challenge of defending our data. However, unlike traditional software, Artificial intelligence tools like machine learning can plough through the vast quantities of data to find vulnerabilities, hacking patterns and response mechanisms. Machine learning is a discipline of AI where an algorithm can be help in learning from vast quantities of data and make predictions without being explicitly programmed for an output.

Here we take a look at five ways to use AI and machine learning to fight cyber attacks.

1. Intrusion detection:

Typical intrusion detection and defense software use monitors based on previously classified intruders and malicious attributes. Using deep learning, a technique of machine learning, intrusion detection can identify previously unrecognized patterns. Deep learning has the ability to learn from highly unstructured data coming from heterogeneous environments. They are better than other forms of machine learning due to their ability to learn incrementally and extrapolate new features from a limited data set.

2. Multi-entity response:

With the advent of machine learning, a new form of Intelligent Threat response is being used to rapidly and accurately respond to threats. Based on the results obtained by threat detection, threat responses can be driven by machine learning algorithms. These responses are typically undertaken based on recommendations by the users. Based on the type of threat, AI programs can block the source automatically or outmaneuver by sending false signals to gather additional information. As threat volume increases, it is increasingly useful to deploy automated responses to cyber attacks in order to reduce the security incident response times.

3. Tracing the dark web:

Dark web is content on the Internet that requires specific software, configurations or authorization to access. It is usually a nesting ground for illegal activities and can be a source for emerging cyber threats. Machine learning can be used in two ways to monitor activities in the dark web 1. To identify potential threats and keep you abreast of the upcoming trends of attacks or patterns of detection and 2. To identify any information pertaining to your organization, your employees or your products. They can also be used to identify if your company assets like software source code are being openly developed or traded. The exploits identified in the dark web will help accelerate your responses to any attacks. As most hackers constantly change their IPs and domain infrastructure, it is almost impossible to track their activities using traditional mechanisms. Machine learning is helpful to gather insights into these chaotic patterns. Another feature of the dark web is the use of local languages and machine learning and natural language processing can be used to successfully transcend these linguistic and geospatial barriers.


Source: Kali Tutorials DarkWeb Statistics

4. Endpoint and network monitoring:

Cyber security teams are often challenged with reduced budgets and increase in security activities such as detection and response. Automating the monitoring of networks and device endpoints is crucial to ensure compliance with your security governance rules. Machine learning/AI provides you the tools to automate the monitoring process. Machine learning can also help you break down data silos and authenticate all users accessing the various sources of data – whether it is transactional or reporting systems. With the help of machine learning, you can monitor new variants of malware by understanding and learning from various aspects and attributes of malware or viruses. You can also use machine learning to simplify your multitude endpoint and networking monitoring tools and consolidate them into a single dashboard.

5. Third party detection:

While in-house systems, applications and devices are vast in a huge organization, it is almost impossible to keep track of third-party systems like vendors and suppliers that often integrate with your systems. Your ecosystem multiplies your risk and exposes your systems if they do not take security as seriously as you do. Recent research shows that organizations are way behind on instituting the governance and technology around third-party risks, across software supply chain, access governance, or data handling.
Machine learning can be used to detect, monitor and alert data coming in and out of third party systems by learning the patterns of data or breaches that occur. In order to effectively manage security of third party data, you would need additional monitoring, controls and governance in place. Machine learning can help you automate the monitoring process across a wide variety of unstructured data. It can also be used to enforce system controls and security policies.

In conclusion, we are in an age of data proliferation, increased cyber-attacks and cyber security incidents. The only way to manage data protection, reduce risk and increase security is to automate the process. Artificial Intelligence mechanisms like machine learning can help with sifting through the vast quantities of data and use intelligent algorithms to learn and detect patterns of vulnerability so that cyber threats are thwarted and your organization is protected.

Read More