Creating a folder structure in Managed folder
Hi,
I am working on a python recipe, the output of this recipe is connected to a Managed folder connected to Azure Data lake storage. In this Managed folder I want write the output of my python recipe such that whenever i run the python recipe the output file is stored in date wise folder structure. For example, If i run today it will store the output parquet file in folder structure such as 2021>06>29. Similarly, for tomorrow file should be saved in this folder structure 2021>06>30.
As per this problem, if we want to have our output file saved in dynamic folder structure. Is there a way to do this in Dataiku?
Answers
-
Hi,
typically, in this kind of use cases you should partition the output folder (by day). Then the python recipe will get the "partition to build" (in that case, a day) as a variable, that you can use however you deem fit in the code. For creating a folder structure, you simply have to pass the subpath inside the folder to the uplpad_xxx() calls, like for example (here with csv):
import dataiku # Read recipe inputs kaggle_titanic_train = dataiku.Dataset("kaggle_titanic_train") df = kaggle_titanic_train.get_dataframe() data = df.to_csv().encode("utf8") # Write recipe outputs output_folder = dataiku.Folder("H5s2NLcx") partition = dataiku.dku_flow_variables["DKU_DST_DATE"] output_folder.clear_partition(partition) partition_root_path = output_folder.get_partition_folder(partition) output_folder.upload_data(partition_root_path + "/data.csv", data)
Note that if you want to write parquet files to azure, as long as it's a storagev2 account that you can use abfs on, then it's probably simpler to create in DSS a azure dataset pointing to the desired location and write to the dataset, instead of writing to a managed folder.
-
Hi
I tried to test your code. But I am unable to access dku_flow_variables even though I am not running inside the notebook. I am building the recipe.
It gives me error:
"Error in Python process: At line 35: <class 'KeyError'>: DKU_DST_DATE"
-
flow_variables is indeed recipe-only. And my example uses a partitioned folder indeed, with settings like: