Write pandas dataframe in dataset
I work on Dataiku and I have a jupyter notebook which is work and now I want to include this on python recipe.
`data_f` is the name of my dataframe and `output_gen_python` is the name of my dataset in dataiku.
I have this error :
> Job failed: Error in Python process: At line 158: <class 'NameError'>: name 'data_df' is not defined
Here is my code :
import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu from datetime import datetime, timedelta # Read recipe inputs batches_types_copy = dataiku.Dataset("batches_types_copy") batches_types_copy_df = batches_types_copy.get_dataframe() Last_hour_extract = dataiku.Dataset("Last_hour_extract") last_hour_extract_df = Last_hour_extract.get_dataframe() class OutputMode(object): ... class IDCalculation_I: def _preGenerateID(self,outputMode,data_df): ... def generateID(self,outputMode,data_df): pass class IDCase1(IDCalculation_I): def generateID(self,outputMode,data_df): ... return data_df class IDCase2(IDCalculation_I): def generateID(self,outputMode,data_df): ... return data_df class Fingerprinter(object): def __init__(self,outputMode): self._outputMode = outputMode def _generateID(self,data_df): return self._outputMode.getCaseID().generateID(self._outputMode,data_df) def run(self,data_df): # GenerateID data_df = self._generateID(data_df) return data_df def __str__(self): return str(self._outputMode) outputMode = OutputMode('EEA','06:00:00','08:00:00',pytz.timezone('Europe/Paris'),CONST_MODE_CONT,IDCase1()) fp_calculator = Fingerprinter(outputMode) output_gen_python_df = data_df # Compute a Pandas dataframe to write into output_gen_python # Write recipe outputs output_gen_python = dataiku.Dataset("output_gen_python") output_gen_python.write_with_schema(output_gen_python_df)
Answers
-
Can we use object programming with python classes or just methods in a python recipe ?
-
Hello @Data_ing_solv
,You can definitely use object programming in a Python Recipe. From the error and your code, I guess the problem you encountered is at this line:
output_gen_python_df = data_df # Compute a Pandas dataframe to write into
"data_df" is not defined here. Maybe you wanted to use one of your input dataframes: "batches_types_copy_df" or "last_hour_extract_df".
Hope this helps
-
Thank you for your answer, in fact I merge my two dataframes "batches_types_copy_df" and "last_hour_extract_df" for create a new one named "data_df". I can't do it ?
-
Hi,
You can do that of course, but I didn't see any code that does this merge. You can have a look to the Pandas documentation on how to merge or concatenate two dataframes: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
Hope this helps
-
I do this in my _preGenerateID function :
data_df = batches_types_df.merge(right=last_hour_extract_df, how='left', on='equipement')
But it does not recognize my dataframe in dataiku unlike jupyter notebook.
-
Without the whole code it's hard for me to help you. From what I understand, you should create the ""data_df" variable in the global scope and not in the scope of the method "_preGenerateID".
-
I do all my treatments in my method, I can try to join my two dataframes and store the result in an input "data_df". It can resolve the problem I think.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker
Hi,
The error means that your code is referencing data_df before this is defined.
If you could share the full stack trace and perhaps which is actually line 158 in your code it may help understand where/why this is happening.
-
Here are the logs if it can help you to understand my problem :