Creating a folder structure in Managed folder

jaisharma1210
jaisharma1210 Registered Posts: 2 ✭✭✭

Hi,

I am working on a python recipe, the output of this recipe is connected to a Managed folder connected to Azure Data lake storage. In this Managed folder I want write the output of my python recipe such that whenever i run the python recipe the output file is stored in date wise folder structure. For example, If i run today it will store the output parquet file in folder structure such as 2021>06>29. Similarly, for tomorrow file should be saved in this folder structure 2021>06>30.

As per this problem, if we want to have our output file saved in dynamic folder structure. Is there a way to do this in Dataiku?

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    edited July 17

    Hi,

    typically, in this kind of use cases you should partition the output folder (by day). Then the python recipe will get the "partition to build" (in that case, a day) as a variable, that you can use however you deem fit in the code. For creating a folder structure, you simply have to pass the subpath inside the folder to the uplpad_xxx() calls, like for example (here with csv):

    import dataiku
    
    # Read recipe inputs
    kaggle_titanic_train = dataiku.Dataset("kaggle_titanic_train")
    df = kaggle_titanic_train.get_dataframe()
    data = df.to_csv().encode("utf8")
    
    # Write recipe outputs
    output_folder = dataiku.Folder("H5s2NLcx")
    partition = dataiku.dku_flow_variables["DKU_DST_DATE"]
    output_folder.clear_partition(partition)
    partition_root_path = output_folder.get_partition_folder(partition)
    output_folder.upload_data(partition_root_path + "/data.csv", data)

    Note that if you want to write parquet files to azure, as long as it's a storagev2 account that you can use abfs on, then it's probably simpler to create in DSS a azure dataset pointing to the desired location and write to the dataset, instead of writing to a managed folder.

  • jaisharma1210
    jaisharma1210 Registered Posts: 2 ✭✭✭

    Hi

    I tried to test your code. But I am unable to access dku_flow_variables even though I am not running inside the notebook. I am building the recipe.

    It gives me error:

    "Error in Python process: At line 35: <class 'KeyError'>: DKU_DST_DATE"

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker

    flow_variables is indeed recipe-only. And my example uses a partitioned folder indeed, with settings like:

    Screenshot 2021-06-29 at 11.51.48.png

Setup Info
    Tags
      Help me…