Survey banner
Share your feedback on the Dataiku documentation with this 5 min survey. Thanks! TAKE THE SURVEY

InfoCepts - An End-to-end Data Workflow to Conduct Clinical Research at Scale

Team members:
Nilesh Lahoti
Anil Kumar M.S.
Mohit A. Jichkar
Ananth Kumar Chamarthi

United States


InfoCepts, a global leader of end-to-end data and analytics, enables customers to become data-driven and stay modern. We bring people, process, and technology together the InfoCepts way to deliver predictable outcomes with guaranteed ROI. Working in partnership with you, we help businesses modernize data platforms, advance data-driven capabilities, build augmented business applications, create data products, and support systems.
Founded in 2004, InfoCepts is headquartered in Tysons Corner, VA, with offices throughout North America, Europe, and Asia. Every day more than 160,000 users use solutions powered by InfoCepts to make smarter decisions and businesses achieve better outcomes. For more information, please visit or follow @InfoCepts on Twitter.

Awards Categories:

  • Value at Scale
  • Excellence in Research


The client is a leading pharma company, which wanted to analyze the market and make a decision to invest in the research of drugs to avoid risks, save time, and at the same time be profitable in the near future.

The lack of both qualitative and quantifiable data at the client’s hand was a big concern to correctly analyze and understand the present market to plan and organize the future business. To develop any predictive model or to draw insights of any business using machine learning algorithms, it is very important to have real quality data telling about health symptoms that users are experiencing.

The purpose of this research project was to collect real-time data from the end-user, store the collected data, perform analytics, and build a predictive model on top of it. Their challenges involved with the previous approach are summarized below:

  • It was not an easy task to collect data from individuals as no one wants to share their identity while disclosing their health information, hence the need to anonymize the personal identity of the user.
  • Manual data collection was a tedious process that involved sending an email and getting the details on an Excel spreadsheet.
  • Lack of central storage mechanism and process to save and update the data regularly.
  • Extensive coding was involved to prepare, clean, and aggregate the collected data before it can be analyzed.
  • Heavy reliance on a third-party application to perform analytics and build predictive models.
  • High cost involved in the purchase of data from third-party resources to perform analytics.
  • Need to integrate the end reports derived out of multiple tools to create a single view of insights.
  • Reliance on custom-based web graphical user interface, a standalone app processing on the server, and user BI reporting tools.
  • Time-consuming data integration and pipeline orchestration with multiple technologies and scripts.



To meet the above objective, our team built a web-based user interface survey form to collect data, created a storage mechanism to store temporary as well as permanent data, and a processing engine that can run the advanced analytics based on the existing and newly collected data.

In addition, we built a business intelligence dashboard to visualize insights, plots, and analytics, along with predictions derived out of user-given inputs back to the end-user. This dashboard was presented as an output to the user to explain his/her current and future health disease condition with prediction.

The following steps summarize the activities carried out to solve the business case:

1. Real-time data collection

  • Dataiku web application capability was leveraged to create a survey form for the end-users. Our team used the Rshiny templates with Dataiku, which made it simpler and faster to create the form:


  • The web app was made public (within intranet) to be accessible to all the users within the organization
  • Apart from user input in forms, data was also fetched from internet sources like Google Trends to augment the data science models. Dataiku time-based scenarios are used to automate the process of collecting the latest trends.

2. Data preparation

A mix of visual and code-based recipes in Dataiku was used to perform the data cleaning and preprocessing activities.

3. Model development

The following models were developed using Python and Rstudio within Dataiku:

  • Disease prediction: A classification model to predict the disease condition of the end user and indicate whether the user is disease-free or has been impacted by the disease.
  • Survival analysis model: Predicts the expected age to attain the disease condition under different given medical conditions.
  • Sales forecasting: Predictive model which makes sales predictions based on user-given inputs.

4. Automation and End User Reports

  • Real-time prediction and analytics are presented to the end user via an Rshiny web app, based on the inputs provided:

Capture d’écran 2021-07-17 à 13.27.58.png

  • The output includes the prediction of disease condition, a survival analysis graph which predicts at what age the disease is expected under different given medical conditions, segmentation which shows the similar medical symptoms under different given age groups, etc.
  • Python-based models were invoked from the Rshiny web app using the APIs provided by Dataiku. The entire workflow (screenshot 1.3) was seamlessly automated using a mix of scenario-based triggers and API based calls from the web apps (screenshot 1.4):

Capture d’écran 2021-07-17 à 13.27.33.png



1. Cost savings

The solution enabled $300k of cost savings from optimized infrastructure, improved process orchestration, and 3rd party data purchase avoidance.

2. Time savings

The solution saved 50% of effort that was involved in the earlier manual effort.

3. Bridge gap between technology and business

The business users were closely involved in the iterative development, review, and continuous research. The visual recipes in Dataiku enabled business stakeholders to understand the technology and general challenges in the process very well. This increased adoption by 2X.

4. Real-time ingestion and analytics

Saved processing time in terms of data collection and data integration from the end users - since as soon as the user fills the form, the rest of the process for data preprocessing and analytics was automated within Dataiku itself.

5. Opportunities for innovation 

Real-time data collection enabled additional avenues to understand the current pharmaceutical market conditions better.

6. Improved decision-making process

Central access by all the departments helped the users to make data-driven decisions based on the current market conditions, avoiding risks, and be more profitable.

Level 1

Congratulations to all guys to making this success.

Version history
Publication date:
05-07-2022 11:36 AM
Version history
Last update:
‎07-27-2021 10:18 AM
Updated by: