Translating column Russian to English with Textblob using Python Dataiku Recipe
Hi There,
I am trying to make use of the Textblob package within a Dataiku recipe.
More specifically I'm trying to create a python recipe which translates a column "Description" from Russian to English using this package.
I'm basing myself on the script which I found here in the context of a Kaggle competition:
https://www.kaggle.com/gunnvant/russian-to-english-translate-with-progress-bar
I wanted to have a try to to see how I could incorporate this into a Dataiku Recipe (I took out the references to the progres bar part, which I don't need here).
--------------------------------------------------------------
My input is "translate_2" which consists out of two columns
-"ID": Integers
-"Description": Russian words with a few missings
My output is "output"
----------------------------------------------------------------------
I have reworked the code into the result below to integrate it into Dataiku:
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import sys
import textblob
# Read recipe inputs
train_Raw_filtered = dataiku.Dataset("translate_2")
x = train_Raw_filtered.get_dataframe()
#Takes data frame as input, then searches and fills missing description with недостающий (russian for "missing")
def desc_missing(x):
if x['Description'].isnull().sum()>0:
x['Description'].fillna("недостающий",inplace=True)
return x
else:
return x
x=desc_missing(x)
#Translate
def translate(x):
try:
return textblob.TextBlob(x).translate(to="en")
except:
return x
x=translate(x)
#Map to new column
def map_translate(x):
x['en_desc']=x['Description']
return x
x=map_translate(x)
# Write recipe outputs to dataiku
train_Raw_Translated = dataiku.Dataset("output")
train_Raw_Translated.write_with_schema(x)
The code runs without error. It does impute the "missing" value, but I do not seem to succeed to write the actual translation
into the Dataiku recipe output. It just inherits the original values:
When I take a look at the logs I find this line which I don't know how to interpret at this point:
Bottom line:
- I would expect the en_desc to contain the translation but it does not.
- Do you guys have any input what I'm doing wrong here? I seem not to be able to figure out what is going wrong here.
Any help would be appreciated.
Thanks a million.
Kind Regards,
Tim
Answers
-
Hi Tim,
This is a python question, not linked to Dataiku DSS. Actually, the log is fine, and the way you read and write through the dataiku package is correct.
Then it is a matter of debugging your code.
We advise prototyping in a jupyter notebook first so you can execute block by block interactively. Some advice: prototype on a smaller sample, add print statements and never use an except clause without returning the error. Otherwise your code could be wrong but you would not be able to see it.
In particular I would inspect the behaviour of your translate function.
Cheers,
Alex -
Hey Tim,
It seems your code handles missing values but encounters issues with the translation part, inheriting original values instead of the translated text.
Have you checked if the 'textblob' library or the translation function is applied correctly within Dataiku? Consider using AI translation tools like this for a more reliable and accurate translation process. Reviewing logs and exploring alternative translation methods might help resolve the issue.
Good luck!