NYU Student - A New Perspective on the Status of Clinical Trials

Lyndsey · August 2023

Name: Lyndsey Miyakado

Country: United States

Organization: New York University

I discovered Dataiku in my last semester at NYU, supervised by Stavros Zervoudakis. During this time I was an intern at Scientist.com under Sherman Tang. I also worked closely with Eleftheria Pissadaki to gain knowledge on the pharmaceutical industry.

Awards Categories:

Best Positive Impact Use Case

Business Challenge:

As an undergraduate student at NYU, I have learned how researchers use machine learning models for drug discovery. As an intern at Scientist.com, I was introduced to how AI powered tools can connect researchers and suppliers more efficiently.

The most time-consuming aspect of drug development is conducting clinical trials. Spanning over 10 years and costing billions. It could benefit investigators or sponsors to have a tool that could inform them of how successful a study is based on past and existing studies. For example, while reviewing past diabetes clinical trials, we examined the enrollment size, phase, title, summary, condition, start month, start year, and sponsor.

A dataset compiled by Aero Data Lab, was used in combination with Dataiku to offer a new perspective for how we examine a disease in the clinical research process. If majority of the clinical trials did not complete phase 3, it could point out how the design of a study is not evolving. For decades, pharmaceutical companies have been using the same processes, contributing to the difficulty in navigating through the maze of data in the clinical trial data life cycle. Some sponsors elect not to invest in a type of trial if it is repeatedly terminated.

If a tool was incorporated in the process of selecting services or suppliers, it could help investigators better execute their studies. CEO of Scientist.com, Kevin Lustig says, “pharmaceutical companies have built out their own massive procurement departments, with internal experts assigned to setting up and managing a trading platform for purchasing products from approved suppliers. As a result, many other scientists and procurement managers are entrenched in the old procurement processes”.

Learning from developers and researchers at Scientist.com I was able to see how there are many opportunities to use historical data to customize workflows to outsource projects faster. Now if we were to incorporate clinical trial data into how researchers design experiments it could alter how investigators select a supplier to work with.

Business Solution:

I selected a dataset from Aero Data Lab containing 10 large pharmaceutical companies (AbbVie, Bayer, GSK, Gilead, Johnson and Johnson, Merck, Novartis, Pfizer, Roche, Sanofi) with a total of 13,748 studies from 1984 to 2020. The original 10 clinical trial features listed in the dataset are only a sample of the list of characteristics used to identify a study in databases such as clinicaltrials.gov.

To begin it was important to find why studies had low enrollment and were being terminated before phase 3. With the use of Dataiku I focused on the status of a study. I also utilized Python and Microsoft Excel to predict whether a study will end early based on the studies' characteristics.

I was faced with an obstacle when trying to predict a study’s status. The status of a study is dynamic and can produce nine possible outcomes within a phase ('not yet recruiting', 'recruiting', 'active not recruiting', 'completed', 'terminated', 'enrolling by invitation', 'suspended', 'withdrawn', 'unknown'). I was able to compare three algorithms to see which model performed the best. Except, to offer more information or reveal irregularities, the model’s prediction would have to specify when the status will occur.

To overcome this challenge, I adjusted the multi-class classification into a binary classification. Instead of predicting which of the nine statuses a study will output, it is predicting if a clinical trial has stopped early, which is the result of a study producing a status: ‘withdrawn’, ‘suspended’, or ‘terminated’. If the model did not produce one of those three statuses it would result an ‘on track’ status.

With Dataiku, I generated a new feature to increase the performance of the model. I learned when faced with an obstacle I can find a solution by reevaluating the problem through a different perspective and adapting the original problem statement.

Business Area Enhanced: Analytics

Use Case Stage: In Progress

Value Generated:

After deconstructing and exploring the 'status' of a study, it revealed the need for more data to improve the performance of the machine learning algorithms. An issue that would occur is having features be too similar and overfit the models. The results of the model could be improved by combining features from the clinicaltrials.gov site such as country or patient data. The 'enrollment' feature could kick start future machine learning projects to increase efficiency when recruiting participants.

This use case opened my eyes to the opportunities where machine learning can be incorporated into clinical trials. If an investigator could easily know before beginning a clinical trial how successful or unsuccessful it could be it could change the types of diseases an organization chooses to research. If a study's costs or length could be reduced by even the smallest margin it would change how we produce life changing vaccines.

Value Brought by Dataiku:

Dataiku helped produce the analysis within a spring semester at NYU. By creating a workflow of datasets and algorithms I was able to stay organized when experimenting with different techniques. I discovered the benefits to adding participant's data such as age, gender, residence, medical records, feedback, or even social media to help speed up the recruiting process for a clinical trial. I was able to use Dataiku to play with the sentiment analysis to ultimately find what variables are important to the predictive models.

NYU Student - A New Perspective on the Status of Clinical Trials

Business Challenge:

Business Solution:

Value Generated:

Value Brought by Dataiku:

Categories

Setup Info

Tags