Write pandas dataframe in dataset
I work on Dataiku and I have a jupyter notebook which is work and now I want to include this on python recipe.
`data_f` is the name of my dataframe and `output_gen_python` is the name of my dataset in dataiku.
I have this error :
> Job failed: Error in Python process: At line 158: <class 'NameError'>: name 'data_df' is not defined
Here is my code :
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
from datetime import datetime, timedelta
# Read recipe inputs
batches_types_copy = dataiku.Dataset("batches_types_copy")
batches_types_copy_df = batches_types_copy.get_dataframe()
Last_hour_extract = dataiku.Dataset("Last_hour_extract")
last_hour_extract_df = Last_hour_extract.get_dataframe()
class OutputMode(object):
...
class IDCalculation_I:
def _preGenerateID(self,outputMode,data_df):
...
def generateID(self,outputMode,data_df):
pass
class IDCase1(IDCalculation_I):
def generateID(self,outputMode,data_df):
...
return data_df
class IDCase2(IDCalculation_I):
def generateID(self,outputMode,data_df):
...
return data_df
class Fingerprinter(object):
def __init__(self,outputMode):
self._outputMode = outputMode
def _generateID(self,data_df):
return self._outputMode.getCaseID().generateID(self._outputMode,data_df)
def run(self,data_df):
# GenerateID
data_df = self._generateID(data_df)
return data_df
def __str__(self):
return str(self._outputMode)
outputMode = OutputMode('EEA','06:00:00','08:00:00',pytz.timezone('Europe/Paris'),CONST_MODE_CONT,IDCase1())
fp_calculator = Fingerprinter(outputMode)
output_gen_python_df = data_df # Compute a Pandas dataframe to write into output_gen_python
# Write recipe outputs
output_gen_python = dataiku.Dataset("output_gen_python")
output_gen_python.write_with_schema(output_gen_python_df)
Answers
-
Can we use object programming with python classes or just methods in a python recipe ?
-
Hello @Data_ing_solv
,You can definitely use object programming in a Python Recipe. From the error and your code, I guess the problem you encountered is at this line:
output_gen_python_df = data_df # Compute a Pandas dataframe to write into
"data_df" is not defined here. Maybe you wanted to use one of your input dataframes: "batches_types_copy_df" or "last_hour_extract_df".
Hope this helps
-
Thank you for your answer, in fact I merge my two dataframes "batches_types_copy_df" and "last_hour_extract_df" for create a new one named "data_df". I can't do it ?
-
Hi,
You can do that of course, but I didn't see any code that does this merge. You can have a look to the Pandas documentation on how to merge or concatenate two dataframes: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
Hope this helps
-
I do this in my _preGenerateID function :
data_df = batches_types_df.merge(right=last_hour_extract_df, how='left', on='equipement')
But it does not recognize my dataframe in dataiku unlike jupyter notebook.
-
Without the whole code it's hard for me to help you. From what I understand, you should create the ""data_df" variable in the global scope and not in the scope of the method "_preGenerateID".
-
I do all my treatments in my method, I can try to join my two dataframes and store the result in an input "data_df". It can resolve the problem I think.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,352 DataikerHi,
The error means that your code is referencing data_df before this is defined.
If you could share the full stack trace and perhaps which is actually line 158 in your code it may help understand where/why this is happening.
-
Here are the logs if it can help you to understand my problem :