Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am working on a python recipe, the output of this recipe is connected to a Managed folder connected to Azure Data lake storage. In this Managed folder I want write the output of my python recipe such that whenever i run the python recipe the output file is stored in date wise folder structure. For example, If i run today it will store the output parquet file in folder structure such as 2021>06>29. Similarly, for tomorrow file should be saved in this folder structure 2021>06>30.
As per this problem, if we want to have our output file saved in dynamic folder structure. Is there a way to do this in Dataiku?
typically, in this kind of use cases you should partition the output folder (by day). Then the python recipe will get the "partition to build" (in that case, a day) as a variable, that you can use however you deem fit in the code. For creating a folder structure, you simply have to pass the subpath inside the folder to the uplpad_xxx() calls, like for example (here with csv):
import dataiku # Read recipe inputs kaggle_titanic_train = dataiku.Dataset("kaggle_titanic_train") df = kaggle_titanic_train.get_dataframe() data = df.to_csv().encode("utf8") # Write recipe outputs output_folder = dataiku.Folder("H5s2NLcx") partition = dataiku.dku_flow_variables["DKU_DST_DATE"] output_folder.clear_partition(partition) partition_root_path = output_folder.get_partition_folder(partition) output_folder.upload_data(partition_root_path + "/data.csv", data)
Note that if you want to write parquet files to azure, as long as it's a storagev2 account that you can use abfs on, then it's probably simpler to create in DSS a azure dataset pointing to the desired location and write to the dataset, instead of writing to a managed folder.
I tried to test your code. But I am unable to access dku_flow_variables even though I am not running inside the notebook. I am building the recipe.
It gives me error: