FPT Software - Leveraging Dataiku to Solve Natural Language Processing Issues
Nhan Do Van Tan Tran Quang Minh Duy Huynh Le Khanh Le Hoang Nguyen Le Nguyen Cao
Organization: FPT Software
FPT Corporation is a leading global technology and IT services provider headquartered in Vietnam, with nearly US$1.3 billion in revenue and 30,000 employees in 26 countries. As a pioneer in digital transformation, FPT delivers world-class services in Smart Factory, Digital platforms, RPA, AI, IoT, Enterprise Mobility, Cloud, AR/VR, Business Applications, Application Services, BPO, and so on. The company has served over 700+ customers worldwide, a hundred of which are Fortune Global 500 companies in the industries of Aerospace & Aviation, Automotive, Banking and Finance, Logistics & Transportation, Utilities, and more.
Data Science for Good
Value at Scale
Most Impactful Transformation Story
Most Extraordinary AI Maker(s)
Despite being a technological company, FPT Software has always concentrated on handling problems with software and applications. Consequently, there isn't much work being done in the field of language processing.
Recently, we began accepting projects involving data, including those from clients whose data is extensive (more than 50 folders and almost 600 datasets), contains a lot of misspelled data, or contains writing (description, product information, or instruction) in various languages (English, German, Italian, Portuguese, and even Vietnamese).
To deal with this, we initially required a team of almost 20 people to divide the steps, including:
Removing stopwords (the words that are often used but that a search engine has been trained to disregard when indexing entries for searching and when returning them as the result of a search query).
Removing invalid digits and characters from the product-id.
Lemmatization & Post - tagging.
After that, the data must be stored, transferred, and sometimes designed into dashboards to provide rough statistics (such as how many prescription medications are sold in this store, which prescription medications or medical supplies are the best selling in the summer, etc.) derived from the textual data we have already processed.
Our team found working on these projects miserable, challenging, and difficult. Then we discovered Dataiku as the solution for this issue.
Dataiku is a perfectly reasonable solution in this situation.
Firstly, we used Dataiku's Prepare recipes to solve simple NLP (natural language processing) issues with textual data (for English), date/time processing, etc. Dataiku does not yet support additional textual data problems; it only supports tokenizing for popular languages.
Next, the approach will be fairly laborious for several other widely used languages because Dataiku does not natively enable the processing of these languages on the Prepare recipe. The linguistic barriers were finally overcome as Dataiku allows you to directly add processors and plugins to the recipe (or even create new ones if you have a deeper understanding of Dataiku).
When accessing the working flow of each project, Dataiku provides the "check consistency" section from the button "Flow action" section to save, unify, and synchronize the entire process (when more than three users work on it simultaneously). Additionally, users can develop steps using Dataiku's "Scenario" functionality to test or run specific requests and queries (such as compute metrics, update from hive tables, propagate and reload schema, execute SQL/Python code, establish/stop/destroy clusters, Deployer, etc.).
Finally, users can use the "Dashboard" from Dataiku to display the analysis and perspective of individuals, making it easier to view the data that has been examined from the utilities mentioned above. Readers will comprehend the charts better if they can directly link to the data recorded and processed in the workflow.
Business Area: Communication/Strategy/Competitive Intelligence
Use Case Stage: In Production
Dataiku has greatly aided us in lowering the project's inherent workload load. Originally predicted to take place with 20 members over the course of two months, it will now only require three to four individuals and six weeks to finish (including five to seven days to train and grasp the basics).
Additionally, the resources provided during the Dataiku training are quite helpful. They have enabled us to utilize the platform's built-in features directly, create recipes and plugins, and reuse them across other related projects. This is undoubtedly fantastic, and we feel reenergized and confident using them in future projects.
Value Brought by Dataiku:
Speed and save on resources. The Dataiku UI allows our Data Analysts to quickly and efficiently perform data cleaning, transformation, and database administration with ease. Instead of employing a composite team of Data Analysts and Engineers, we were able to save time and human resources as, after two weeks of Dataiku training, the team could immediately get started on creating their vision.