Dataiku to store file into custom s3 folder

Options
shuvankarm
shuvankarm Dataiku DSS Core Designer, Registered Posts: 6 ✭✭✭

Hi Team,

I have been trying to store our data to an extended path of an s3 connection present in dataiku.

Say the connection that was created takes us to: bucket_1 and my project name is dummy_1_project

Hence, whenever we create a recipe to store the file through dataiku folder creation in S3, by default, it stores into:

bucket_1/dummy_1_project/<folder_id_autogenerated_by dataiku>/

But I want my file to be stored at bucket_1/dummy_1_project/current_data/.

Is there any way we can store it to some custom place without getting the autogenerated folder created?

Regards,

Shuvankar Mondal

Answers

  • EliasH
    EliasH Dataiker, Registered Posts: 34 Dataiker
    Options

    Hi @shuvankarm
    ,

    By going to Settings > Connection of your output dataset, you can modify the "Path in bucket" to relocate your file. Please note that changing the path could lead to overlapping datasets.

    Screen Shot 2021-07-14 at 1.23.12 PM.png

    DSS defines how managed datasets and folders are located and mapped to paths based on the "Naming rules for new datasets/folders" section of your S3 connection. These settings are only applied when creating a new managed dataset or folder, and can be modified in the settings of the dataset. For information can be found here: https://doc.dataiku.com/dss/9.0/connecting/relocation.html

    Best,

    Elias

  • shuvankarm
    shuvankarm Dataiku DSS Core Designer, Registered Posts: 6 ✭✭✭
    Options

    Thank you Elias.

    Yes, I am aware of this setting, where I can mention the desired path, and that is how we have been doing. I was wondering if somehow we could do it in code without modifying the folder setting. I tried giving the folder id different like,

    path = dataiku.folder("current_data")

    instead of, path = dataiku.folder("ascd1234")

    But this gives error of not identifying "current_data".

    All I am wanting is to not to go in the folder and change the setting, instead I wanna achieve the same through code.

  • EliasH
    EliasH Dataiker, Registered Posts: 34 Dataiker
    edited July 17
    Options

    Hi @shuvankarm
    ,

    What you need to do is utilize the Python API for datasets and not managed folders, those are completely different.

    import dataiku
    
    client = dataiku.api_client()
    project = client.get_project('YOUR_PROJECT_KEY')
    dataset = project.get_dataset('NAME_OF_DATASET')
    settings = dataset.get_settings()
    raw_settings = settings.get_raw()
    raw_settings['params']['path'] = '/YOUR/DESIRED/PATH'
    settings.save()

    Please note that even though you are not changing the settings of the dataset through the UI you are still changing the settings of the dataset through the API.

    A full list of the Python APIs can be found here: https://doc.dataiku.com/dss/latest/python-api/index.html

  • shuvankarm
    shuvankarm Dataiku DSS Core Designer, Registered Posts: 6 ✭✭✭
    Options

    Thanks for the info. I was wondering where I should be putting this.

    This is what I tried.

    1. Creating a python recipe where the source is input_abc, one of the datasets created earlier. For this recipe I provided the the output dataset name as output_abc.

    2. At the beginning, after the default imports I put the codes that you mentioned. Made changes to the path, project key and the dataset. Here, the dataset name I am providing is the output dataset of the python recipe. The path I am mentioning is similar to: '/${PROJECT_KEY}/current_data'

    The thing is, at first run, it creates the dataset at the output_abc folder. But in its second run, it creates the dataset at current_data folder.

    Is it the correct behavior? Did I miss anything, or did I put anything wrong anywhere.

    And thanks again for the info though.

Setup Info
    Tags
      Help me…