Translating column Russian to English with Textblob using Python Dataiku Recipe

UserBird
Dataiker
Translating column Russian to English with Textblob using Python Dataiku Recipe

Hi There,



 



I am trying to make use of the Textblob package within a Dataiku recipe.



More specifically I'm trying to create a python recipe which translates a column "Description" from Russian to English using this package.



I'm basing myself on the script which I found here in the context of a Kaggle competition: 



https://www.kaggle.com/gunnvant/russian-to-english-translate-with-progress-bar



I wanted to have a try to to see how I could incorporate this into a Dataiku Recipe (I took out the references to the progres bar part, which I don't need here).



 



--------------------------------------------------------------



My input is "translate_2" which consists out of two columns



-"ID": Integers



-"Description": Russian words with a few missings



My output is "output"



----------------------------------------------------------------------





 



 



I have reworked the code into the result below to integrate it into Dataiku:



 



# -*- coding: utf-8 -*-

import dataiku

import pandas as pd, numpy as np

from dataiku import pandasutils as pdu

import sys

import textblob





# Read recipe inputs

train_Raw_filtered = dataiku.Dataset("translate_2")

x = train_Raw_filtered.get_dataframe()



 



    

#Takes data frame as input, then searches and fills missing description with ะฝะตะดะพัั‚ะฐัŽั‰ะธะน (russian for "missing")

   

def desc_missing(x):

   

    if x['Description'].isnull().sum()>0:

        x['Description'].fillna("ะฝะตะดะพัั‚ะฐัŽั‰ะธะน",inplace=True)

        return x

    else:

        return x



x=desc_missing(x)

  



#Translate



def translate(x):

    try:

        return textblob.TextBlob(x).translate(to="en")

    except:

        return x

    

x=translate(x)

   

    

#Map to new column

def map_translate(x):

    x['en_desc']=x['Description']

    return x



x=map_translate(x)





# Write recipe outputs to dataiku

train_Raw_Translated = dataiku.Dataset("output")

train_Raw_Translated.write_with_schema(x)



 



 



The code runs without error. It does impute the "missing" value, but I do not seem to succeed to write the actual translation 



into the Dataiku recipe output. It just inherits the original values:





 



When I take a look at the logs I find this line which I don't know how to interpret at this point:





 



Bottom line:




  • I would expect the en_desc to contain the translation but it does not.

  • Do you guys have any input what I'm doing wrong here? I seem not to be able to figure out what is going wrong here.



Any help would be appreciated.



Thanks a million.



 



Kind Regards,



Tim



 

0 Kudos
2 Replies
Alex_Combessie
Dataiker Alumni
Hi Tim,

This is a python question, not linked to Dataiku DSS. Actually, the log is fine, and the way you read and write through the dataiku package is correct.

Then it is a matter of debugging your code.

We advise prototyping in a jupyter notebook first so you can execute block by block interactively. Some advice: prototype on a smaller sample, add print statements and never use an except clause without returning the error. Otherwise your code could be wrong but you would not be able to see it.

In particular I would inspect the behaviour of your translate function.

Cheers,

Alex
0 Kudos
Khushi100
Level 1

Hey Tim,

It seems your code handles missing values but encounters issues with the translation part, inheriting original values instead of the translated text.

Have you checked if the 'textblob' library or the translation function is applied correctly within Dataiku? Consider using AI translation tools like this for a more reliable and accurate translation process. Reviewing logs and exploring alternative translation methods might help resolve the issue.

Good luck!

0 Kudos

Labels

?
Labels (2)
A banner prompting to get Dataiku