Predicting Length of Stay of Patients with Lung Cancer and Mental Illness

For determining the Correlation between lung cancer patients who have undergone lobectomy and have a mental illness, the team developed multiple machine learning models. For this study, we split the data into 80% training data (4464 sample data) and 20% test data (1117 sample data).

We divided this problem statement into two areas of evaluation:

  • Predicting LOS of a patient with both lung cancer and mental illness using only Diagnosis codes.
  • Predicting LOS of a patient with both lung cancer and mental illness using both Diagnosis codes and Socio-demographic features.

The following algorithms were then developed:

  • SGDRegressor
  • GradientBoostingRegressor
  • LinearRegression
  • KNeighborsRegressor
  • RandomForestRegressor
  • SVR
  • TensorFlow


Read More

Determining Correlation between Mental Illness and Lung Cancer using Machine Learning


For any machine learning algorithm to be designed, it is important to understand the variability of the data and skewness, as well as the assumptions that we can make to build machine learning models. Here are some key statistical distribution models of the dataset we used for our study:


Read More

Exploratory Analysis of Mental Illness Data amongst Lung Cancer Patients

During the course of our study, we specifically focused on lung cancer patients who have undergone lobectomy (lung cancer surgery) and analyze if any specific mental illness/psychiatric diagnoses or groups of diagnoses increase perioperative death risk.


Read More

Analyzing Severe Mental Illness in Lung Cancer Patients

Lung cancer is the number one cause of cancer-related deaths worldwide. Patients with severe mental illness (SMI) are a group who are overrepresented in the lung cancer population. SMI refers to psychological problems, including mood disorders, major depression, schizophrenia, bipolar disorder, and substance abuse disorders, that inhibit a person’s ability to engage in functional and occupational activities.

Cancer patients diagnosed with SMI may not adhere to treatment plans and may have reduced access to healthcare. Individuals with SMI may have advanced tumor growth at diagnosis due to factors such as limited access to healthcare and healthcare systems. The aggregation of inadequate healthcare and increased risk for somatic disorders in patients with SMI can explain higher mortality rates. Many research papers have indicated that cancer represents a significant proportion of excess mortality for people with mental illness. Mental illness is typically associated with suicide, but much of the excess mortality rates associated with mental illness are due to cardiovascular or respiratory diseases and cancer. (more…)

Read More

The Perfect Data Strategy for Improved Business Analytics

Advancements in AI and Machine Learning have given rise to data analytics’s growing importance and, therefore, data itself. Unless you have established the pre-requisite data collection steps, data storage, and data preparation, it is impossible to make a move to the data science process.

At Allwyn, we believe that the journey towards improved operations and decision-making starts with establishing a good data strategy and establishing the tools and processes required to easily analyze your enterprise data. This involves starting with your data discovery and data collection, organizing the data in a data warehouse or a data lake, and finally using Machine Learning to perform deep data analytics to enhance productivity, launch new business models or establish a strong competitive advantage. We have an established data life cycle process that starts with data discovery and ends with reaching business outcomes through Data Analysis, Machine Learning, and AI. We employ a two-phased approach to data transformation and operational transformation, as shown below.

In the first phase of data transformation, our goal is to design, build and maintain an enterprise data warehouse or a data lake. This helps in making the most of an organization’s valuable data assets, break down data silos, and create a data maturity model that helps accelerate providing accurate and near-real-time data for the next phase. During this phase, we also establish data governance that focuses on the privacy and security of the data.

The second phase focuses on data analytics – predictive, prescriptive or diagnostic analytics that can help the various departments of your business with actionable insights. In this phase, we also help with rapid prototyping and experimenting with advanced analytics such as machine learning and AI. We help you adopt machine learning into your data analytics to help with your product innovation and offering you a competitive edge in the marketplace.

Our data management strategy provides an enterprise with quick and complete access to the data and the analytics it needs through four steps.

Our four-step solution for Enterprise Data Management is elaborated below.

  1. Collect: Ingestion/Data Prep/Data Quality/Transformation

In this step, we access and analyze both real-time and stationary data to reliably determine data quality and extract, transform, and blend data from multiple sources.  We then map and prepare the data to load into a target Data Lake. It is important to identify all your data sources and data streams to determine your data acquisition and establish the frequency of your batch process. This also involves establishing your infrastructure to help with the high volume of data streams and supporting a distributed environment.

Since multiple systems exist in silos, to make data-driven decisions with all members of the organization not operating off of the same data, Businesses these days are moving towards a single source of truth model to overcome this challenge.

With a single source of truth (SSOT), data is aggregated from many systems within an organization to a single location. This ensures zero duplication and hence, enhances the data quality. A SSOT is not a system, tool, or strategy, but rather a state of being for a company’s data in that it can all be found via a single reference point.

  1. Store

We use a scalable, reliable (a Cloud-Based Data Lake) comprised of various data repositories for both structured and unstructured formats to ensure reliable data storage. In this step, you cleanse, categorize, and store the data as per your business functions. For example, you can establish separate functional areas for sales, marketing, finance, and procurement-related data. This will help you establish a functional unit while identifying the need for data integrators across functions.

  1. Process/Analyze

Once the data is identified, organized, and stored, your data is ready for data analysis, building machine learning models, or statistical analysis. Data analysts or data scientists can run multiple queries or develop algorithms to analyze trends, discover business intelligence, and present outcomes to make smart decisions.

  1. Visualize

The output of the data analysis needs to be presented in a visual dashboard to provide meaningful answers to key questions driving business decisions.  Here, we not only provide insightful visual dashboards but provide search-driven “Google-like” products with Natural language processing capabilities to provide answers to easy-to-understand presentations for all levels of data users and the public. With products like Thought Spot, users can type a simple Google-like search in natural language to instantly analyze billions of rows of data.  Users can also easily converse with data using casual, everyday language and get precise answers to questions instantly.


Getting your data strategy in place is the first step to start with data analytics, data science and the AI journey. As the marketplace continues to rattle business models, adopting newer data analytics tools such as machine learning can help you not only stay ahead of competition but also continue to operate your business successfully in uncertain times. This can lead to a data-driven value cycle that can help pave the way for accomplishing the transformational change that is essential to become an AI-enabled organization.

Watch this space or follow us on LinkedIn to stay tuned to the latest digital trends and technology advancements.

Read More

Eliminating Major Barriers for Data Insights

The lifecycle of Data, Data Analytics, and Data Science starts with collecting data from relevant data sources, performing ETL (Extraction, Loading, and Transformation) functions, cleaning, and enabling data in a machine-readable format. Once the data is ready, statistical analysis or machine learning algorithms can identify patterns, predict outcomes, or even perform functions using Natural Language Processing (NLP). Since data is at the core of data analytics, it is imperative to understand the challenges we possibly might face during its successful implementation. Here we present the top four data challenges :

Complexity: Data spread across various sources

Merging data from multiple sources is a major challenge for most enterprise organizations. According to McAfee, an enterprise with an average of 500 employees can deploy more than 20 applications. Larger enterprises with more than 50,000 employees run more than 700 applications. Unifying the data from these applications is a complicated task that can lead to duplication, inconsistency, discrepancies, and errors. With the help of data integration and profiling, the accuracy, completeness, and validity of the data can be determined.

Quality: Quality of incoming Data

One of the common data quality issues in the merging process is duplicate records. Multiple copies of the same record can lead to inaccurate insights as well as computation and storage overuse.

What if the collected data is missing, inconsistent, and not updated? Data verification and matching methods need to be implemented at each collection point to prevent flawed insights and biased results.

Volume: Volume of data available

To find relationships and correlations, a successful machine learning algorithm depends on large volumes of data. Data collected from multiple sources and multiple time frames is essential in creating machine learning models during training, validation, and deployment phases. More data does not necessarily mean gathering more records but can mean adding more features to the existing data from different sources that can improve the algorithm.

Algorithm: Conscious effort to remove confirmation bias from the approach

The major advantage of AI over humans is garnering insights into an algorithm’s decision-making process (using explainable AI). Furthermore, algorithms can be analyzed for biases, and their outcomes verified for unfair advantages to protected classes. Although AI, on the onset, can be viewed as perpetuating human biases, it offers better insights into the data and decision-making process.

Over the last decade, Allwyn has surpassed these common Data challenges with the proven experience of its seasoned Data professionals. We will share our own Data Management Strategy in next week’s post. Watch this space or follow us on LinkedIn to stay tuned.

Read More

Opportunities and Challenges of Driving Value from Data Analytics

Over the next few posts, we will be talking about the progression of Data Analytics — where we are today and where we are headed next. But, first, we start with some history. With basic statistics being the foundation of Analytics, the use of Analytics dates back to the 1900s, which began receiving significant attention in the late 1960s when computers became decision-making support systems.

Data analytics has dominated almost all the industries of the world, and data collection has become an integral part of any organization. These days every click or scroll you do, and every time you open an app, huge amounts of data are being generated and stored for business intelligence and data mining.

Various industries like finance, banking, transportation, manufacturing, e-commerce, and healthcare, use this data to make smarter decisions, gain meaningful insights and predict outcomes. Today, businesses are increasingly using data science to uncover patterns, build predictions using data, and employ machine learning and AI techniques.

For example, the Banking industry uses data analytics in credit risk modeling, fraud detection, and evaluate customer lifetime value. Erica, the virtual assistant of Bank of America, gets smarter with every transaction made by studying customers’ banking habits and suggests relevant financial advice. Finance industries use machine learning algorithms to segment their customers, personalize relationships with them, and increase their businesses’ profitability.


Predictive analytics is another aspect of data science that has become necessary for the transportation and logistics industry. Public and private transportation providers use statistical data analysis to map customer journeys and provide people with personalized experiences during normal and unexpected circumstances. Logistics companies use artificial intelligence to optimize their operations in distribution networks, anticipate demand, and allocate resources accordingly.

Data science and AI in biomedical and healthcare data are modernizing the healthcare industry by providing public health solutions. From medical image analysis and drug discovery to personalized medicine, data analytics is revolutionizing patient outcomes.  Data science and machine learning have revealed that there are solutions to the most difficult problems in different industries, and the future success of companies relies on their adoption of data-centric approaches to discover actionable insights. By automating the analytic process, the time value of unlocking insights can be accelerated to provide rapid forecasting and decision making.

“By 2020, 50% of analytic queries will be generated using search, natural-language processing or voice, or will be auto-generated.” – Gartner Analytics Magic Quadrant, 2019.”

We will discuss major challenges and opportunities in adopting various Data Analytics techniques for their businesses in next week’s post. Watch this space or follow us on LinkedIn to stay tuned.

Read More

Expert Analysis on Implementation of Machine Learning on Lobectomy Data.

Our research has enabled us to develop models suitable for targeting and capturing nearly eight readmitted patients out of every 10. Our final model revealed a combination of demographic and diagnosis related features. These combinations further allowed us to analyze the likelihood of someone being readmitted when going through a lobectomy procedure.

This has helped us understand which variables contribute the most to the model.

Circulatory system diseases (I00-I99), certain infectious and parasitic diseases (A00-B99), neoplasms (C00-D49), musculoskeletal system and connective tissue diseases (M00-M99) were among the top contributing factors to the predictive ability of our model in the medical factors.

By understanding the likelihood of a patient’s readmission, pre/post-operative interventions such as weight loss, home monitoring programs, or additional medical procedures can be introduced into a patient’s hospital care cycle, which would improve their outcome and reduce the relative costs for them, healthcare provider, and the hospital.

Likewise, our approach can target different medical procedures for any dataset with similar information but not necessarily all the features used in our models.


One of the key limitations we faced in our research was the ICD10 data being available only from Q415 to Q417. This limited us only to research the existing data from a two-year period.

Similar research done on readmission cases covers a decade’s worth data.

Acquisition of more data can enable us further optimizing the models based on the desired target metric and help with class imbalance. The study is limited to the non-medical factors that are being collected in the NRD, and depending on healthcare information providers, the final model is subject to change.

Next Steps

  • Refine the readmission predictive analysis model on a smaller subset of medical and non-medical features and perform more real-world data validation.
  • Refine the model by applying to more massive data sets from other sources.
  • Working with the medical community on possible preventive actions to reduce readmissions.

The Healthcare industry is one of the primary adopters of Machine Learning initiatives in the past decade. Applications of ML goes beyond this prescriptive analysis and can even contribute to highly sensitive AI operations.

Follow us on LinkedIn to stay tuned with the latest technology trends. Or connect with our experts on

Read More

Applying the right Machine Learning model for accurate statistics of Lobectomy Patients

More than ten different classification methods such as Logistic Regression, Random Forest, and Xgboost for different feature combinations were used to compare our target classification metrics and choose an optimum model.

Models that consistently showed the close range of scores in their validation phase were chosen. The best performing models were further optimized for high recall scores through cross-validation and grid search methods while keeping precision and accuracy in an acceptable range.  We chose an XGBoost model with a combination of socioeconomic and medical code groups as the final model due to its 75% recall, the ability for interpretation, high efficiency, and fast scoring time.

XGBoost, which falls into the gradient boosting framework of machine learning algorithms, has been a consistent, highly efficient problem solver and can run in major distributed environments.

Recall is the ability of a model to find all relevant cases within a dataset. In our case, true positives (TP) were the correctly classified readmitted patients, and false positives (FP) were the readmitted patients who were incorrectly classified as not readmitted.

We specifically aimed for higher recall scores (TP/TP+FP) since accuracy for an imbalanced dataset would not be a good measure to assess model performance, and we had to focus on identifying the readmitted patients to target and further analyze their underlying features properly.

Feature importance of the final XGBoost model and recall/accuracy curve

The final model showed that socioeconomic features such as the pay category being Medicare, patient age, gender, wage index, and the population category of patients and their diagnosis code groups and many other features that contribute to classification for readmission.

Follow us on LinkedIn and do not miss our final blog on the Machine Learning for Lung Cancer.

Read More

Machine Learning to Improve Outcomes by Analyzing Lung Cancer Data

Finding a suitable dataset for machine learning to predict readmission was the first challenging task we had to overcome. Since, presently available datasets in the healthcare world, could either be dirty and unstructured or clean but lacking information.

Most patient-level data are not publicly available for research due to privacy reasons.

With these limitations in mind, after researching multiple data sources, including SEER-MEDICARE, HCUP, and public repositories, we decided to choose the Nationwide Readmissions Database (NRD) from Healthcare Cost and Utilization Project (HCUP). The Agency creates the HCUP databases for Healthcare Research and Quality (AHRQ) through a Federal-State-Industry partnership, and NRD is a unique database designed to support various types of analyses of national readmission rates for all patients, regardless of the expected payer for the hospital stay.

Our research involved using machine learning and statistical methods to analyze NRD. Data understanding, preparation, and engineering were the most time-consuming and complex phases of this data science project, which took nearly seventy percent of the overall time.

Using big data processing and extraction technologies like Spark and Python, 40 million patients’ records were filtered. (only the ones who have at least undergone a lobectomy procedure once). The filtered data was later put through the best data quality check processes and cleaned while imputing missing values.  And more than 100 input variables were explored that were analyzed correlations with the outcome and understood our target group’s demographics or were redundant.

Many of these features were categorical that required additional research and feature engineering.

NRD dataset mainly consists of three main files: Core, Hospital, Severity.

Core file mainly included the patient-level medical and non-medical factors like their age, gender, payment category, urban/rural location of a patient, and many more are among the socioeconomic factors. However, medical factors include detailed information about every diagnosis code, procedure code, their respective diagnosis-related groups (DRG), time of those procedures, yearly quarter of the admission, etc.

Allwyn data engineering practices included analyzing every single feature, researching, and creating data dictionaries and feature transformation to see which features contribute to our prediction algorithms.  With an average age of 65 for lobectomy patients, the data showed that women had more lobectomies than men, more men were readmitted than women.

Severity file further provided us the summarized severity level of the diagnosis codes. The Hospital dataset presented us information with hospital-level information such as bed size, control/ownership of the hospital, urban/rural designation, and teaching status of urban hospitals, etc.

We consulted subject matter experts in the lung cancer field and, through their advice, added additional features such as Elixhauser and Charlson comorbidity indices to enrich our existing dataset. By delving deep into the clinical features, we also ensured the chosen variables are pre-procedure information and verified no information leakage from post-operative or known future level variables.

The features were then analyzed to check whether they had statistical significance with our selection of predictive models by looking at correlation matrices and feature importance charts.

Analyzing the initial data distribution for many of the features required us to remove outliers, transform skewed distributions, and scale the majority of the features for algorithms that were particularly sensitive to non-normalized variables. Diagnosis codes were grouped into 22 categories to reduce dimensionality and improve interpretation.

The resulting dataset was highly imbalanced in terms of the readmitted and not readmitted classes, 8% and 92%, respectively. Most classification models are extremely sensitive to imbalanced datasets, and multiple data balancing techniques such as oversampling the minority class, under-sampling the majority class, and Synthetic Minority Oversampling Technique (SMOTE) were used to train our algorithms and compare the outcomes.

Initial machine learning models had both low precision and recall scores. Although this could be due to many different reasons, the Allwyn team focused mainly on additional feature engineering to remove the high dimensionality of initial input variables while also comparing different data balancing methods. This was a time-consuming iterative process and required training more than a thousand different models on different combinations or groupings of diagnosis codes (shown in Table 2) along with other non-medical factors.

K-fold cross-validation was also used during the training and validation to ensure the training results represent the testing. We weighted the admission and readmission classes by training models and comparing their validation scores to classify the readmitted patients further.

We also collaborated with George Mason University through their DAEN Capstone program.  The team led by Dr. James Baldo and several participants from the graduate program analyzed the underlying data and developed predictive models using various technologies, including AWS SageMaker Autopilot. The resulting models and their respective hyperparameters were further analyzed and tuned to achieve high recall.

After choosing the best model, we designed and implemented this workflow in Alteryx Designer to automate our process and put it into a feedback-re-evaluation phase as a Cross-Industry Standard Process for Data Mining (CRISP-DM) to enable our model to evolve and be deployed in production.

To know more about how we decided on the best model and associated classification methods, follow us on LinkedIn.


Read More