Issue with Python script appending data in Dataiku project

Hello,
I have an issue with my Dataiku project. I wrote a Python script that appends new data from the input dataset to the output dataset.
I think the problem may be related to recursion in Dataiku. Could you please suggest a solution?
Thank you in advance!
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,248 Neuron
So there are two ways of doing what you want. The first method is to use the "Append instead of overwrite" checkbox in your recipe. Different recipe types may have this option in the Inputs/Outputs tab or in the Advanced tab. As you are using a Python recipe this should be shown in the Inputs/Outputs tab as shown by my sample below:
In your case it is not being shown because your recipe is using inputs from different connections or because the connection type doesn't support append mode inserts. So try to use a Sync recipe for your "Lost_Path_Batiment…" dataset to move it to the same connection as the "Listed_Capteurs_Batiment…" and see if that enables the option. If you still don't see the append option then you need to mode your datasets to connection / data technology that supports appending data (most SQL databases do support append mode). You could then use the Sync recipe to bring it back to HDFS. Do keep in mind that the "Append instead of overwrite" checkbox does not guarantee the output dataset will not be dropped. In previous versions of Dataiku a schema change will result in append output datasets being dropped and recreated which means your historical data is lost. Only in the recent v13.1 release Dataiku will default to recipe failure if the output schema needs to be updated leaving you with the task of fixing it. So even if append output datasets don't get dropped they do make project maintenance harder as you can't propogate schema changes automatically through the flow on them.
The second way it's going to surprise @tgb417. In v13 Dataiku added support for recursive flows. This means that it is now possible to have a recipe that reads and writes to its output, among other designs. How to build it:
- Create a simple Python recipe that takes an input_dataset and writes it to an output_dataset
- Run the recipe so that the output_dataset is populated with the input_dataset contents
- Now go to the Python recipe Inputs/Outputs tab and add the output dataset as an input
- Modify the recipe code to use the output_dataset as required
Below is a sample Python recipe that reads the output dataset and concats it to the input dataset and writes it back to the output dataset:
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs input_dataset = dataiku.Dataset("input_dataset") input_dataset_df = input_dataset.get_dataframe() output_dataset = dataiku.Dataset("output_dataset") output_dataset_df = output_dataset.get_dataframe() # Concat both input and output dataset df_merged = pd.concat([input_dataset_df, output_dataset_df], ignore_index=True, sort=False) # Write recipe outputs output_dataset.write_with_schema(df_merged)
This second way is not very efficient since you you end up re-writing all the data in the output dataset every time the recipe runs, rather than just inserting new rows only However it is safer than the first option since it will be able to handle schema changes gracefully. So choose the option that best fit your use case.
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,611 Neuron
@ELACHAR ,
Welcome to the dataiku community. We are glad to have you join us.
It has been my experience that dataiku does not want a dataset to be both an input and output to a recipie. The system seems to want you to create a new dataset with the appended data.
It is interesting to me that you can even save what you have shown. So this may be something you can do. I’ve just never done this.