Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Dataiku to store file into custom s3 folder

shuvankarm
Level 1
Dataiku to store file into custom s3 folder

Hi Team,

I have been trying to store our data to an extended path of an s3 connection present in dataiku.

Say the connection that was created takes us to: bucket_1 and my project name is dummy_1_project

Hence, whenever we create a recipe to store the file through dataiku folder creation in S3, by default, it stores into: 

bucket_1/dummy_1_project/<folder_id_autogenerated_by dataiku>/

But I want my file to be stored at bucket_1/dummy_1_project/current_data/.

Is there any way we can store it to some custom place without getting the autogenerated folder created?

 

Regards,

Shuvankar Mondal

0 Kudos
4 Replies
EliasH
Dataiker
Dataiker

Hi @shuvankarm ,

By going to Settings > Connection of your output dataset, you can modify the "Path in bucket" to relocate your file. Please note that changing the path could lead to overlapping datasets.

Screen Shot 2021-07-14 at 1.23.12 PM.png

DSS defines how managed datasets and folders are located and mapped to paths based on the "Naming rules for new datasets/folders" section of your S3 connection. These settings are only applied when creating a new managed dataset or folder, and can be modified in the settings of the dataset. For information can be found here: https://doc.dataiku.com/dss/9.0/connecting/relocation.html

Best,

Elias

0 Kudos
shuvankarm
Level 1
Author

Thank you Elias.

Yes, I am aware of this setting, where I can mention the desired path, and that is how we have been doing. I was wondering if somehow we could do it in code without modifying the folder setting. I tried giving the folder id different like, 

path = dataiku.folder("current_data")

instead of, path = dataiku.folder("ascd1234")

But this gives error of not identifying "current_data".

All I am wanting is to not to go in the folder and change the setting, instead I wanna achieve the same through code.

0 Kudos
EliasH
Dataiker
Dataiker

Hi @shuvankarm ,

What you need to do is utilize the Python API for datasets and not managed folders, those are completely different.

import dataiku

client = dataiku.api_client()
project = client.get_project('YOUR_PROJECT_KEY')
dataset = project.get_dataset('NAME_OF_DATASET')
settings = dataset.get_settings()
raw_settings = settings.get_raw()
raw_settings['params']['path'] = '/YOUR/DESIRED/PATH'
settings.save()

 

Please note that even though you are not changing the settings of the dataset through the UI you are still changing the settings of the dataset through the API.

A full list of the Python APIs can be found here: https://doc.dataiku.com/dss/latest/python-api/index.html 

0 Kudos
shuvankarm
Level 1
Author

Thanks for the info. I was wondering where I should be putting this.

This is what I tried.

1. Creating a python recipe where the source is input_abc, one of the datasets created earlier. For this recipe I provided the the output dataset name as output_abc.

2. At the beginning, after the default imports I put the codes that you mentioned. Made changes to the path, project key and the dataset. Here, the dataset name I am providing is the output dataset of the python recipe. The path I am mentioning is similar to: '/${PROJECT_KEY}/current_data'

The thing is, at first run, it creates the dataset at the output_abc folder. But in its second run, it creates the dataset at current_data folder.

Is it the correct behavior? Did I miss anything, or did I put anything wrong anywhere.

And thanks again for the info though.

0 Kudos