Karen Cheng, Principal Investigator, Data Scientist
Ron Keesing, Division Manager
Mark Clark, Program Manager
Caitlin Burgess, Program Management Support
Tifani O’Brien, Pilot Project Lead and Concept Initiator
Coleen Davis, Data Scientist
David Morgenthaler, Data Scientist
Jevon Spivey, Architecture Administrator
Leidos, formerly known as Science Applications International Corporation (SAIC), is an American defense, aviation, information technology (Lockheed Martin IS&GS), and biomedical research company headquartered in Reston, Virginia, that provides scientific, engineering, systems integration, and technical services. The Leidos Innovations Center (LInC) rapidly prototypes and field solutions in areas such as Artificial Intelligence/Machine Learning, big data, cyber, surveillance systems, autonomy, sensors, applied biology, and directed energy. This project is a Machine Learning and data analytics web-based deployment that analyzes project execution data for continuous process evaluation and improvement using the full lifecycle pipeline of Dataiku 1) data preparation, 2) data exploration and visualization, 3) AutoML machine learning, and 4) web-based user dashboard deployment.
Software development teams often don’t have sufficient actionable information and analysis to reliably forecast efforts, or real time metrics, to monitor and assess the production of software development teams. Our goal in this effort is to use analytics to improve agile-based software project execution processes by identifying key drivers of success, and predicting various outcomes.
The Software Development Analytics project creates data-mining analytical and visualization approaches that Leidos will use to identify and analyze software best practices. The team will use predictive machine learning classification approaches that incorporate the identified key performance indicators to accurately forecast software development success probabilities. Predictive analytics will learn from historical performance data to predict and quality-check anticipated level of efforts for successful task completions. Lastly, the visualizations will be deployed via a web-accessible dashboard to support ongoing program performance tracking and to make the data-mined visualizations and predictive analytics accessible to interested parties.
This research analyzes various data produced during the agile software development process that indicates measurable business activity impacting the quality and delivery of software code. Efficient data Extraction, Translation, Loading (ETL), data cleaning, aggregation, and joining is required to assemble and store the data.
Our project plan was to initially analyze pilot software programs that could scale in the future to support evaluation of multiple programs. Therefore, an understandable and reproducible pipeline is ideal.
We desire to use state-of-the-art machine learning and Bayesian analytics to identify the key drivers for successful software execution, as well as discover pitfalls. We will also identify the best technical approaches for classification and supervised predictive learning approaches. This requires extensive data analysis and an iterative model exploration approach.
Lastly, as the insights discovered will also be used for process monitoring and evaluation, a dashboard is will enable our technical development team to make the results accessible to various stakeholders.
This project involves the full data analysis lifecycle from data wrangling to an interactive dashboard that showcases the resulting visualizations and analytics as depicted below.
We employed Dataiku in all phases of our pipeline:
1. Repeatable pipelines and workflow analysis
Dataiku greatly facilitates the organization and visualization of the pipeline workflows. Dataiku’s DSS pipeline allows us to easily scale the project to evaluate additional software programs because we are able to quickly identify the single point within the pipeline that needs modification, without disturbing the common components of the pipeline. The clean workflow presentation helps our team keep the code more maintainable and understandable. The sequential and modularized organizational approach of the pipeline steps supports an easier transition when adding new developers to the project, as the flow visualization is inherently self-documenting since the processing steps are more apparent.
2. Data acquisition and storage
Dataiku was used to assemble, store, and “data wrangle” the various input files. Dataiku’s built-in file system and database solutions allowed us to quickly access the data and utilize SQL on the resulting datasets, without requiring us to spend our time on building a data lake.
3. Data processing
Dataiku’s visual recipes supported rapid data transformations in data joining, column manipulation and data pruning. Dataiku’s ability to combine pre-packaged analyses with our own customized scripts gave us the significant flexibility we required to accomplish all of our data transformation needs.
4. Data visualization and analysis
Dataiku’s rapid visualization of the raw and processed data was invaluable in allowing us to gain a quick understanding of the data distributions and data integrity. Dataiku greatly facilitates identification of missing data, invalid data, and outliers, allowing us to have confidence in the data we are processing. Dataiku’s built-in graphics were intuitive, allowing us to quickly look at the composition of the data and the relationships between datasets and enabling us to gain rapid understanding of the value within the data.
5. Auto-ML Machine Learning
We deployed Dataiku’s Auto-ML approaches to verify performance of our candidate machine learning classification and predictive models, as well as identify additional candidate models that we should consider. Dataiku’s metrics evaluation interfaces allowed us to quickly look at performance trade-offs using multiple industry-standard metrics, and to identify overfitting conditions when training a model. DSS’s model Evaluation Recipe allows us to ascertain performance on a given test set.
6. Web-based deployment
We took advantage of Dataiku’s ability to integrate web-based applications into the workflow. We were pleased that Dataiku supported current leading-edge web-based deployment technologies, thus allowing us to maintain our entire deployment implementation within the DSS workflow and to host it from Dataiku’s web application services.
7. Amazon Elastic Container Service for Kubernetes (EKS) architecture
We instantiated Dataiku’s EKS capabilities which allows us to integrate with AWS security and scale our future development efforts.
Dataiku had a great impact on numerous aspects of this project throughout the entire pipeline, the most important ones are highlighted below.
1. Deployment efficiency
Significant time-saving was achieved in the combination, manipulation, and storage of data. We were able to implement the data processing pipeline in days, as opposed to months.
2. Ability to focus on our area of expertise in Machine Learning
Not having to invest time in database setup and file system organization allowed us to focus on our core research interests that will address our machine learning challenges. By taking advantage of Dataiku’s web deployment capabilities, we saved a significant amount of time by avoiding the need to setup additional webservers. Consequently, our team did not require a web application specialist.
3. More robust organization and maintainability
While this benefit can be overlooked, the impact to an organization can be tremendous. Dataiku provided us with additional version control, a framework for teamwork contribution, and process step readability and maintainability.
4. Rapid Machine Learning exploration and performance assessment
Dataiku allowed us to search the algorithmic space and performance efficiently. We were able to consider additional models we might not have originally considered and were able to rapidly perform model tradeoffs. It would normally be time-consuming to consider a large number of models, but Dataiku makes this process efficient enabling us to look at tradeoffs between candidate approaches such as neural network versus decision-tree implementations. The model building process also allows us to fine-tune and compare hyperparameter settings.
5. Excellence in research category only
While predictive approaches are often used to predict various data influences on the dependent variable, such as neural networks and decision-trees, we are interested in more than just the predictive results. One of our key research areas in this project involves identifying the key drivers of the dependent variable, in this case it is software project implementation planning and timeliness success. This capability provides us the ability to learn from our data to guide actionable software process improvement.
The other research aspect of this project is the identification of the best classification and predictive approaches when predicting performance. Dataiku’s AutoML feature has greatly helped us to rapidly identify and assess candidate algorithms, explore hyperparameter settings, and to consider additional algorithms we may not have thought of. We are also quickly able to retrain models using different optimization goals. Since we are able to explore the algorithmic space quickly, we are able to become confident that our final model is optimal for our problem set.
6. Alan Turning category only
In addition to the above, our project innovations include combining the web deployment pipeline with the overall data preparation and modeling pipeline. Historically, these project steps are performed by different teams and require web developer support. The combined pipeline approach was made possible by the latest version of Dataiku dashboard capabilities, which include state-of-the-art web development libraries. This end-to-end pipeline capability is visionary and leading-edge, enabling to deploy the latest models in near real-time to our end users of the dashboard.