Creating a folder structure in Managed folder

jaisharma1210 · June 2021

Hi,

I am working on a python recipe, the output of this recipe is connected to a Managed folder connected to Azure Data lake storage. In this Managed folder I want write the output of my python recipe such that whenever i run the python recipe the output file is stored in date wise folder structure. For example, If i run today it will store the output parquet file in folder structure such as 2021>06>29. Similarly, for tomorrow file should be saved in this folder structure 2021>06>30.

As per this problem, if we want to have our output file saved in dynamic folder structure. Is there a way to do this in Dataiku?

fchataigner2 · June 2021

Hi,

typically, in this kind of use cases you should partition the output folder (by day). Then the python recipe will get the "partition to build" (in that case, a day) as a variable, that you can use however you deem fit in the code. For creating a folder structure, you simply have to pass the subpath inside the folder to the uplpad_xxx() calls, like for example (here with csv):

import dataiku

# Read recipe inputs
kaggle_titanic_train = dataiku.Dataset("kaggle_titanic_train")
df = kaggle_titanic_train.get_dataframe()
data = df.to_csv().encode("utf8")

# Write recipe outputs
output_folder = dataiku.Folder("H5s2NLcx")
partition = dataiku.dku_flow_variables["DKU_DST_DATE"]
output_folder.clear_partition(partition)
partition_root_path = output_folder.get_partition_folder(partition)
output_folder.upload_data(partition_root_path + "/data.csv", data)

Note that if you want to write parquet files to azure, as long as it's a storagev2 account that you can use abfs on, then it's probably simpler to create in DSS a azure dataset pointing to the desired location and write to the dataset, instead of writing to a managed folder.

jaisharma1210 · June 2021

Hi

I tried to test your code. But I am unable to access dku_flow_variables even though I am not running inside the notebook. I am building the recipe.

It gives me error:

"Error in Python process: At line 35: <class 'KeyError'>: DKU_DST_DATE"

fchataigner2 · June 2021

flow_variables is indeed recipe-only. And my example uses a partitioned folder indeed, with settings like:

Screenshot 2021-06-29 at 11.51.48.png

Creating a folder structure in Managed folder

Answers

"Error in Python process: At line 35: <class 'KeyError'>: DKU_DST_DATE"

Categories

Setup Info

Tags