Advanced Analytics Office, Logistics COE & Western Digital - Using NLP-driven Email Categorization t

Samruda Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered, Frontrunner 2022 Participant Posts: 2 ✭✭✭


Samruda Pobbathi
Crystal Zhu
Yanhai Yang
Jayakumar Ramakrishna
Sameera Kodagoda
Joseph Hodges
Leslie He

Country: United States

Organization: Advanced Analytics Office, Logistics COE & Western Digital

Advanced Analytics Office is missioned with accelerating Analytics solutions at scale across the enterprise to rapidly capture business value. These solutions target key business metrics, such as, reducing manufacturing costs, improving capital efficiency, reducing time-to-market to develop new products, improving operational efficiency, and improving customer experience. The Logistics Center of Excellence focuses on identifying complex supply chain problems and delivering value and scalable solutions across the supply chain by researching, innovating, prototyping, changing mindsets, and upgrading skills. The COE steps out of the operation role and works on prototyping new scalable solutions to support supply chain priorities and long-term business goals.

Awards Categories:

  • Most Impactful Transformation Story
  • Partner Acceleration

Business Challenge:

Western Digital’s Logistics Control Tower team uses a Global PDL email address for both internal and external communications. Everyone is triggering emails to this PDL for shipping reports, shipment location queries, loss and damage, delivery and invoicing issues, etc. The traffic gets up to 8k-10k emails every week and gets even higher during quarter-end. Such massive email traffic has created a lot of issues:

  • Time-consuming to go through all emails and low working efficiency without a proper email category in place.
  • Critical emails have often been neglected or responses delayed.
  • Response time to address the emails has not been quantified. Therefore, it’s hard to assess the true impact.
  • There are no customer behavior profiles for team members to prioritize follow-up actions.

Previously, we tried to analyze those thousands of emails manually, taking two to three employees over two weeks to sort, categorize, annotate, and evaluate. Now we urgently need a solution that would work with such a large and complex data set and sort the emails by topics (category) with high accuracy and produce results promptly.

Once we understand the email category and sender profiles, we would identify hot/critical issues faster so that Corrective Actions can be developed in time, which would help reduce response time and raise customer satisfaction rate.

Business Solution:

This solution was built in collaboration among data scientists from Western Digital’s Advanced Analytics and Logistics teams. The project has the following components built on Dataiku:

  • Pipeline for extraction of data.
  • Calculation of metrics such as time, length of the conversation, etc., from the data.
  • Webapp for end users to label the data into 22 categories.
  • Pipeline for data cleansing such as stemming, tokenizing, etc.
  • Classification model.
  • Weekly inference with scenarios.

We built the project on a design node with the following benefits from Dataiku:

  1. Effective collaboration across various teams: Annotating large datasets is a highly challenging and time-consuming task and often requires collaboration among multiple subject matter experts. Dataiku’s ML-assisted labeling plugin provided a way to easily accomplish collaboration among multiple external users to label the data.
  2. Easy data clean-up and analysis: The built-in NLP preprocessor library functions such as tokenize text, simplify text, clear stop words, and so on have helped normalize the text data with a few clicks of a button. Plugins such as NER, which come with pre-trained Spacy models, were helpful in extracting insights and understanding the data. These readily available features reduce the development time and let data scientists focus on analyzing data.
  3. Reduced model development time: This is accomplished through an ML-Ops Auto-ML feature that helps develop and compare models quickly and easily. The reproducibility of experiments is made possible with the model, and data versioning is highly effective during the model development phase. For our solution, we used tf/idf vectorization for feature handling and built a logistic regression model to classify the emails into 22 categories. We also took advantage of Dataiku ML-Ops features, where we configured scenarios to run the data extraction and model inference every week. This saved around a few weeks’ worth of development time to build an inference pipeline.
  4. Visualization: We were able to build visualization charts for the metrics extracted using the dashboard feature. Having plugins such as the Tableau hyper format, which allows the easy export of data into the Tableau server to scale in the future, was a bonus.

Business Area: Internal Operations

Use Case Stage: Built & Functional

Value Generated:

With this all-in-one text analysis and data visualization studio, we can sort emails by topics, quantify the average response time spent on each category, and identify major internal and external service requestors per customer profile. Also, we can dig deeper into data with greater granularity, quickly re-label and continuously train the model for any new categories or updates, and create custom charts and visualization in a blazing-fast experience.

Eventually, actionable insights can be drawn from the data and team hours saved. Ultimately, automated email analysis has empowered WD’s Logistics team by:

  • Auto extract, analyze, and label 10,000 emails per week.
  • Categorization accuracy over 80%.
  • Reducing email traffic by 17% with actions learned from data insight.
  • Data-insight actions enable 100 employee hours saved per month.
  • Data-insight actions enable response time to be cut down by 20+ hours.

Value Brought by Dataiku:

We had a couple of new data scientists join the project. The Dataiku Academy, along with ease of use of visual flows and components, enabled us to quickly onboard new team members. Dataiku streamlined the processes to clean, label data, and visualize email categories — all in one place.

Prior to implementing the solution in Dataiku, we used csv files to label data. For this project, more than 2,000 emails had to be labeled. Using csv for labeling was very tedious and error-prone. ML Assisted labeling plugin was a very useful collaborative tool that allowed easy access and visualization of data and made the tedious task of labeling easy by reducing time taken to label the data by at least half of what it would otherwise be. Furthermore, we were able to consult multiple subject matter experts and label in a collaborative and iterative process.

The Quick lab section was quite handy in terms of feature handling as well as comparing various model architectures, making the development of models faster. Not only that, one of the important aspects of working on data science projects is to prepare a visual way to present the results. Dataiku has these features to export the confusion matrix, features used, training information, and so on, making it easy to present the solution.

The Dashboards and charts were also quite useful in this regard. Dataiku enabled the data scientists to build solutions quickly and let other non-technical users take over the project and maintain it without much coding experience, making it possible to hand off the solution to the users. Since NLP use cases are common across the board, the ability to re-use the project’s design to build another solution is an efficient way to utilize an existing architecture.

Value Type:

  • Improve customer/employee satisfaction
  • Reduce cost
  • Save time

Value Range: Hundreds of thousands of $

Setup Info
      Help me…