Tatras Data Services Pvt Ltd. - Creating a Classification Model to Categorize Medical Literature

Rushil09 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered, Frontrunner 2022 Participant Posts: 17 Partner


Rushil Sharma, Junior Data Scientist
Bhaskar Dev Goel, Junior Data Scientist

Country: India

Organization: Tatras Data Services Pvt Ltd.

Dr. Sarabjot Singh and Noah Gresham launched Tatras Data ten years ago. They recognized that organizations would struggle to locate the personnel and implementation skills required to utilize this technology, which would force systemic change to existing business models. It was launched to provide these AI/ML skills for our clients, and our team has delivered over $100M in business value and IP development for our clients. With projects executed in almost two dozen industries. We continue to engage in talent development through our cooperation with the Sabudh Foundation, employing a large number of graduates each year.

Awards Categories:

  • Data Science for Good
  • Excellence in Research
  • Partner Acceleration

Business Challenge:

The medical literature has been growing exponentially, and its size has become a barrier for physicians to locate and extract clinically useful information. Clinicians need tools that help to monitor and prioritize the research to understand the clinical implications of pathogenic genetic variants. So we have made a classification model which can correctly classify the literature as a penetrance type of mutation.

Business Solution:

Dataiku is Tatras' go-to platform for creating any use case that needs advanced visualization and analysis since it comes with efficient and effective preprocessing capabilities, data analysis, and a quick and simple AutoML.

Starting with data preparation: after importing the data, it was preprocessed using the "prepare recipe", while the "Text Cleaning plugin" was further used to clean and lemmatize the data. This also has the flexibility of defining the parameters on which we could clean the data, such as removing punctuation, stopword, etc.

Word embeddings are necessary because we are working with text data, and Dataiku's "macros" function contains many pre-trained word embedding models, including Glove. When coding was necessary, "Python Recipe" was used to conduct various vectorizing techniques on the cleaned data, and topic modeling using LatentDirichletAllocation and LatentSemanticAnalysis.

This is how the flow of the project looks like



Numerous trials were carried out to identify and develop the optimum method for categorizing these study papers. The first method involved turning text data into scalar vectors using an NLP methodology. In our example, we numerically represented the research papers and used that to train an ML model.


Feature Handling

Deep learning and machine learning algorithms were used as modeling techniques. The Dataiku design tab was used to handle the features before passing the data through the model.



As part of Dataiku's lab area, various machine learning models were trained utilizing the AutoML capability, which gives us the option to select between ML and DL models as well as between several ML algorithms, from logistic regression to neural networks. When it comes to the interpretability of the model, Dataiku offers useful features like sub-population analysis and interactive scoring. We were able to better comprehend the model by visualizing its performance using ROC curves, lift charts, and the confusion matrix.

Business Area: Other - Social(Healthcare)

Use Case Stage: Built & Functional

Value Generated:

The number of publications in the medical sciences has been steadily increasing. It is critical for health practitioners to be up to date on innovations in their field of practice since they can have a significant impact on the lives of their patients. Looking for different research papers and publications online can be a time-consuming task, but with the help of Dataiku, we were able to resolve this challenge by making use of machine learning and AI.

A large corpus of research papers focusing on five different gene mutations was used, out of which only two were of more concern to us - Germline and Penetrance. As Penetrance Mutation can only happen if there is a possibility of Germline Mutation, it was only reasonable for the final product to first focus on detecting Germline-related research papers and then shift to Penetrance from there. Various in-built recipes of Dataiku, helped in cleaning, processing, and visualization of the data so that various insights can be drawn from it. For Model Training, Testing, and Tuning, we leveraged Labs functionality in Dataiku which gave us the opportunity to focus more on the results rather than investing a large amount of time in building models from scratch.

Using Labs to build an ML model just takes a few seconds which is an outstanding feature of Dataiku. After comparing the performances of different models, the highest ROC score of 0.928 was achieved. Along with the product development, it was realized that it can be generalized in such a way that it can even be used by students and research scholars for educational purposes, which can easily be achieved using Dataiku.


Value Brought by Dataiku:

Dataiku provides a centralized data platform that moves businesses from scale and traditional analytics to Enterprise AI in their data journeys, empowering self-service analytics while also enabling the operationalization of machine learning models in digital environments.

Prior to Dataiku, project flow management, progress monitoring, and analysis required third-party software. Dataiku has replaced this because it enables teams to collaborate and work on the same project simultaneously, resulting in the creation and delivery of tangible or intangible outputs on schedule and increasing project efficiency from the moment the project starts.

Along the route, it is simple to keep everyone informed and organized. We can create and automate complex data pipelines quickly and preprocess the data 10 times faster than before. With the aid of AutoML, we are able to generate and develop numerous models in search of the most effective ones for conducting in-depth statistical analysis.

Last but not least, the good news is that every action is documented, making it repeatable and governed. Elastic distributed computing guarantees the performance required for model training. The tool's agility, transparency, and ability to visualize project flows make it widely used in the organization.

Value Type:

  • Improve customer/employee satisfaction
  • Reduce cost
  • Save time
  • Increase trust
  • Other - Healthcare

Value Range: Hundreds of thousands of $

Setup Info
      Help me…