Alice Smith, Platform Engineer, with:
AstraZeneca is a global pharmaceutical company with a major UK presence. Our purpose is to push the boundaries of science to deliver life-changing medicines. The best way we can help patients is to be science-led and share this passion with the scientific, healthcare, and business communities of the UK.
AstraZeneca has never attempted to solve the full landscape of data pipeline, machine learning, and data visualization within a single tool due to the inherent complexity required in building and maintaining the broad spectrum of capabilities that would be required.
Due to this, the lifecycle of a project from data wrangling through cleaning, manipulation, data science, visualization, and deployment could see a user working across multiple tools and platforms for each stage of their pipeline.
This process causes a challenge for AstraZeneca, as it increases the following aspects:
Another challenge faced at AstraZeneca lies in the hiring tension for data science roles, and the sudden introduction of roles such as Citizen Data Scientist, ML engineer, and Data Analyst into the scope of the technology sector. The requirement for AstraZeneca to have the best available data scientists working on their problems is paramount due to the complexity of the work that AstraZeneca produces. However, in a competitive landscape where these roles are in high demand, it cannot be assumed that employing a full stack data science team is the optimum solution.
Being a pharmaceutical company, rather than a traditional data science company, can create limitations on the ability to compete for data scientist roles, as well as the difficulties in ensuring that data scientists will have the required skills in understanding the pharmaceutical industry. The current scientists and SMEs are integral to the work being completed at AstraZeneca, but they do not have the experience to create or manage Machine Learning models, when their current projects could benefit from the introduction of data engineering and data science techniques.
Dataiku has allowed AstraZeneca to provide core Data Science capabilities to 120+ Users and 90+ Data Science Projects within the space of a year. These projects span across R&D, Operations, and Commercial teams at a global level. The users range from experienced data scientists and ML engineers who are utilizing the automation and deployment aspects of Dataiku, to Business Analysts, with previously no experience of data science, who, with the aid of Dataiku, have been able to produce their first ML model.
Dataiku has enabled teams to use a centralized platform to perform all stages of their project lifecycle, and has enabled collaboration across teams that was previously not possible. The multi-disciplinary teams at AstraZeneca are now able to work with a single source that meets the needs for all skill sets, enabling large scale projects to be completely effectively and efficiently.
In the past year, Dataiku has been essential to AstraZeneca, as it has enabled teams to work on business critical projects that have driven huge value within the business and for the world. These were made possible due to the fact that the necessary teams were able to quickly start leveraging Dataiku, rather than needing weeks of training as with their usual tools.
One of the key aims with Dataiku is to democratize AI and a create self-service capability that puts the power of AI and analytics into the hands of employees. To enable this, we have created macros for automating all of the steps in the project creation process, including group and connection creation, as well as producing project templates and new data connections that are not currently supported.
Our goal is for SMEs and non-technical users to be produce valuable insights from their data, regardless of their technical ability. For instance, Dataiku has allowed one team to quickly stitch together disparate data sets to create a holistic view of the value chain and rapidly develop predictive forecasting capabilities, which were considering lead time and yield to better understand our ability to fulfill commitments. Dataiku has enabled unparalleled visibility for this project and all upcoming work.
At AstraZeneca, time to value is a key metric when assessing any project. Due to the nature of drug development, manufacturing, and supply, the speed at which these life-changing drugs can be provided to patients is our most important priority.
Multiple capabilities provided by Dataiku had a direct positive impact:
1. Connection to multiple data sources in a matter of minutes, enabling more insights
Our users were able to set up the necessary connections to complete their projects in the space of minutes. AstraZeneca utilizes data lakes to enable global access to data, allowing a single connection on Dataiku that could provide multiple users with the data they need to access for their work. Combined with the ability to create new default connections, this enabled our teams to quickly and simply cross multiple sources and data stores in a matter of minutes, when previously it would have taken hours, and involved manual data ingestion and manipulation.
2. Central data access for collaboration and scalability, shortening the time-to-value
Previously, multiple teams at AstraZeneca only had the option to access their data from their local machines, whether this was via a code IDE (Integrated Development Environment) or a business analysis tool such as Excel. As well as this not being a collaborative environment, the level of compute and scalability is very slim. Moving projects onto Dataiku has allowed the team to utilize the automation nodes, as well as introducing clusters for running large processing jobs.
This helped significantly reduce the time-to-value for data projects, regardless of the size of the data. One project in R&D was able to perform exploratory data analysis on a large dataset in minutes using the available sampling tools and capability for Spark clusters. Another project in Commercial was able to reduce their data preparation from one week to two days ,using the scalability available within Dataiku, as they could refactor their Python code into blocks of recipes, which will be scalable across different areas.
3. Improved version control and governance to foster innovation
Additionally, Dataiku has enabled improved and quicker version control and governance for all projects on the platform. Dataiku in our instance, is connected to a BitBucket repository by default which provides seamless version control for all projects. This allows for quick change management at a click of a button, and it allows for users to branch their project and be working on the same project simultaneously without any concerns of code or data loss.