FPT Software - Leveraging Dataiku to Solve Natural Language Processing Issues

Name:

Nhan Do Van
Tan Tran Quang Minh
Duy Huynh Le
Khanh Le Hoang 
Nguyen Le Nguyen Cao

Country: Vietnam

Organization: FPT Software

FPT Corporation is a leading global technology and IT services provider headquartered in Vietnam, with nearly US$1.3 billion in revenue and 30,000 employees in 26 countries. As a pioneer in digital transformation, FPT delivers world-class services in Smart Factory, Digital platforms, RPA, AI, IoT, Enterprise Mobility, Cloud, AR/VR, Business Applications, Application Services, BPO, and so on. The company has served over 700+ customers worldwide, a hundred of which are Fortune Global 500 companies in the industries of Aerospace & Aviation, Automotive, Banking and Finance, Logistics & Transportation, Utilities, and more.

Awards Categories:

  • Data Science for Good
  • Responsible AI
  • Value at Scale
  • Most Impactful Transformation Story
  • Most Extraordinary AI Maker(s)
  • Partner Acceleration

 

Business Challenge:

Despite being a technological company, FPT Software has always concentrated on handling problems with software and applications. Consequently, there isn't much work being done in the field of language processing.

Recently, we began accepting projects involving data, including those from clients whose data is extensive (more than 50 folders and almost 600 datasets), contains a lot of misspelled data, or contains writing (description, product information, or instruction) in various languages (English, German, Italian, Portuguese, and even Vietnamese).

To deal with this, we initially required a team of almost 20 people to divide the steps, including:

  • Removing stopwords (the words that are often used but that a search engine has been trained to disregard when indexing entries for searching and when returning them as the result of a search query).
  • Date-time processing.
  • Removing invalid digits and characters from the product-id.
  • Acronym.
  • Lemmatization & Post - tagging.
  • Spell checking.

After that, the data must be stored, transferred, and sometimes designed into dashboards to provide rough statistics (such as how many prescription medications are sold in this store, which prescription medications or medical supplies are the best selling in the summer, etc.) derived from the textual data we have already processed.

Our team found working on these projects miserable, challenging, and difficult. Then we discovered Dataiku as the solution for this issue.

 

Business Solution:

Dataiku is a perfectly reasonable solution in this situation.

Firstly, we used Dataiku's Prepare recipes to solve simple NLP (natural language processing) issues with textual data (for English), date/time processing, etc. Dataiku does not yet support additional textual data problems; it only supports tokenizing for popular languages.

Next, the approach will be fairly laborious for several other widely used languages because Dataiku does not natively enable the processing of these languages on the Prepare recipe. The linguistic barriers were finally overcome as Dataiku allows you to directly add processors and plugins to the recipe (or even create new ones if you have a deeper understanding of Dataiku). 

When accessing the working flow of each project, Dataiku provides the "check consistency" section from the button "Flow action" section to save, unify, and synchronize the entire process (when more than three users work on it simultaneously). Additionally, users can develop steps using Dataiku's "Scenario" functionality to test or run specific requests and queries (such as compute metrics, update from hive tables, propagate and reload schema, execute SQL/Python code, establish/stop/destroy clusters, Deployer, etc.).

Finally, users can use the "Dashboard" from Dataiku to display the analysis and perspective of individuals, making it easier to view the data that has been examined from the utilities mentioned above. Readers will comprehend the charts better if they can directly link to the data recorded and processed in the workflow.

 

Business Area: Communication/Strategy/Competitive Intelligence

Use Case Stage: In Production

 

Value Generated:

Dataiku has greatly aided us in lowering the project's inherent workload load. Originally predicted to take place with 20 members over the course of two months, it will now only require three to four individuals and six weeks to finish (including five to seven days to train and grasp the basics).

Additionally, the resources provided during the Dataiku training are quite helpful. They have enabled us to utilize the platform's built-in features directly, create recipes and plugins, and reuse them across other related projects. This is undoubtedly fantastic, and we feel reenergized and confident using them in future projects.

 

Value Brought by Dataiku:

Speed and save on resources. The Dataiku UI allows our Data Analysts to quickly and efficiently perform data cleaning, transformation, and database administration with ease. Instead of employing a composite team of Data Analysts and Engineers, we were able to save time and human resources as, after two weeks of Dataiku training, the team could immediately get started on creating their vision.

 

Value Type:

  • Increase revenue
  • Reduce cost
  • Save time
  • Increase trust

Value Range: Hundreds of thousands of $

Share:
Version history
Publication date:
02-09-2023 08:44 AM
Version history
Last update:
‎09-19-2022 10:29 AM
Updated by: