Business Problem
Delay in train arrivals is a common phenomenon and it needs to be efficiently managed to minimize any time delay for passengers waiting to commute in a train. The primary objective of this project was to analyze the existing delay in train schedules in a time bound manner and later predict the time delay for any given time and train routes considering various factors that influence the arrival of a train to the station. Our client, a large Railway Transportation Company in the US, provides public train commute services and manages scheduling and operation of trains on various train routes. The delay in schedules for trains on routes they are operated has to be managed with the least possible and predictable delay time. The passengers and the client are the main stakeholders in the project and any operational improvements has a direct impact on their participation in the Railroad service.
Maintaining a predictable train delay impacts the commute pattern of passengers and impacts the frequency of train services on all routes.
The following technologies were used in developing our Predictive Analytics solution:
● Data Engineering using Time series and meta data
● Machine Learning Operations (MLOPS)
● Deep AR on Amazon Web service
● Professional Visualization using Tableau
● Test and train ML models using DeepDR+ Algorithms
● Amazon S3 captures training dataset
Complexity of this project is due to the following factors:
● The complexity is at the predictability of the delay time and it is highly dependent on the historical data set and identifying the patterns of delay time in the past.
● Always maintaining a clean data set for accurate prediction is necessary.
● Feature Engineering and identification of the most influential parameters for model
● Training and testing of ML models using DeepAR+ algorithms.
● Controlling the error rate over forecasted and actual schedule is challenging in real-time.
● Tableau for visualization of 14-day predictions
Methodology/approach used for executing the work
Data extraction
The ingestion of huge volumes of train delay time data in real time is necessary to understand the existing pattern of delay time. Data ingestion are delay data, route data, and schedule data. Data ingestion is in the CSV format using AWS Glue data integration platform. The data ingested in split and used for training and testing models which are developed after MLOPS performing data digestion and information extraction.
Data Analysis
The data analysis is the crucial part of understanding how the train delay time happens in real-time and it helps in identifying key parameters that shape the train delay time data set. The analysis starts with target data set which is uploaded from AWS Glue data integration platform into the Amazon S3 bucket. This ingested target data is now cleansed for basic sanity and is transformed into a valid dataset. Amazon step functions or Lambda functions processes basic data wrangling operations and makes the data clean for further machine learning processes.
Training and testing
The training set is also extracted from the target data set at the data ingestion phase and is used to train the model developed using machine learning algorithms. The Amazon Glue data integration platform ingests target data in CSV format into Amazon S3 as training data along with transformed data into ML Lambda function. This isa part of MLOPS that is automated to learn subsequently in real-time.
Predicting delay time
The training data set extracted from the target data set is fed into the Amazon step function or AWS Lambda. Amazon Forecast is used to perform supervised machine learning techniques which use time series and hyper parameterization function. Together known as DeepAR+ is a time series based supervised algorithm that is inside Amazon forecast to keep up with the prediction of delay time from real time data ingestion.
Pertinent results or outcomes
The result of this project aims at developing a MLOPS workflow using AWS web services cloud computing platform to store and retrieve data sets in real time. The following is the nature of workflow that is intended to be adhered to achieve a better delay time prediction capability using MLOPS,
● Creating a target dataset
● Ingest data set into Data Integration platform
● Import data as training and test data set in the pre-processing stage
● Create a predictor model using Deep AR+ algorithm and train the model
● Create a forecast on how delay time is expected to be in the future (for the next 14 days)
● Update data sources and other model parameters in real-time
● Reduce error rate in the forecasted schedule over the actual schedule
The Architecture
Our solutioning for train delay prediction is illustrated below --- the technology stack used is primarily AWS, including S3, Forecast, and Athena.
Visualization of the data – Tableau
Here we see the interpretation of the comparison of delay minutes grouped by delay code between the Actual delay time and the predicted delay time. This dashboard is updated every two weeks in production based on forecast