Dataiku to store file into custom s3 folder

shuvankarm · July 2021

Hi Team,

I have been trying to store our data to an extended path of an s3 connection present in dataiku.

Say the connection that was created takes us to: bucket_1 and my project name is dummy_1_project

Hence, whenever we create a recipe to store the file through dataiku folder creation in S3, by default, it stores into:

bucket_1/dummy_1_project/<folder_id_autogenerated_by dataiku>/

But I want my file to be stored at bucket_1/dummy_1_project/current_data/.

Is there any way we can store it to some custom place without getting the autogenerated folder created?

Regards,

Shuvankar Mondal

EliasH · July 2021

Hi @shuvankarm
,

By going to Settings > Connection of your output dataset, you can modify the "Path in bucket" to relocate your file. Please note that changing the path could lead to overlapping datasets.

Screen Shot 2021-07-14 at 1.23.12 PM.png

DSS defines how managed datasets and folders are located and mapped to paths based on the "Naming rules for new datasets/folders" section of your S3 connection. These settings are only applied when creating a new managed dataset or folder, and can be modified in the settings of the dataset. For information can be found here: https://doc.dataiku.com/dss/9.0/connecting/relocation.html

Best,

Elias

shuvankarm · July 2021

Thank you Elias.

Yes, I am aware of this setting, where I can mention the desired path, and that is how we have been doing. I was wondering if somehow we could do it in code without modifying the folder setting. I tried giving the folder id different like,

path = dataiku.folder("current_data")

instead of, path = dataiku.folder("ascd1234")

But this gives error of not identifying "current_data".

All I am wanting is to not to go in the folder and change the setting, instead I wanna achieve the same through code.

EliasH · July 2021

Hi @shuvankarm
,

What you need to do is utilize the Python API for datasets and not managed folders, those are completely different.

import dataiku

client = dataiku.api_client()
project = client.get_project('YOUR_PROJECT_KEY')
dataset = project.get_dataset('NAME_OF_DATASET')
settings = dataset.get_settings()
raw_settings = settings.get_raw()
raw_settings['params']['path'] = '/YOUR/DESIRED/PATH'
settings.save()

Please note that even though you are not changing the settings of the dataset through the UI you are still changing the settings of the dataset through the API.

A full list of the Python APIs can be found here: https://doc.dataiku.com/dss/latest/python-api/index.html

shuvankarm · July 2021

Thanks for the info. I was wondering where I should be putting this.

This is what I tried.

1. Creating a python recipe where the source is input_abc, one of the datasets created earlier. For this recipe I provided the the output dataset name as output_abc.

2. At the beginning, after the default imports I put the codes that you mentioned. Made changes to the path, project key and the dataset. Here, the dataset name I am providing is the output dataset of the python recipe. The path I am mentioning is similar to: '/${PROJECT_KEY}/current_data'

The thing is, at first run, it creates the dataset at the output_abc folder. But in its second run, it creates the dataset at current_data folder.

Is it the correct behavior? Did I miss anything, or did I put anything wrong anywhere.

And thanks again for the info though.

Dataiku to store file into custom s3 folder

Answers

Categories

Setup Info

Tags