Changing managed folder path dynamically.

raghutej
Level 2
Changing managed folder path dynamically.

Hi, 

I have a managed folder that is pointing to S3 bucket. 
path to bucket is   /Images/
If i want to change the path dynamically in python program before accessing the managed folder is it possible? 

Example : i would like to change the path to bucket as /Images/0001/
is it possible to do that using python code?

Thanks,
Raghu

0 Kudos
5 Replies
Turribeach

What do you mean by "accessing the managed folder"? If you want to read files on a managed folder under a subfolder then this post you created already has an answer on how to do that:

https://community.dataiku.com/t5/Using-Dataiku/Accessing-files-in-a-subfolder-of-a-managed-folder-th...

If not explain exactly what you are trying to achieve. Thanks

0 Kudos
raghutej
Level 2
Author

Thanks for the response. 
I am trying to build an image search engine poc in dataiku at scale.
I know that we need lot more for real time deployment of this model with performance less than 10secs.

I have a large dataset of images, around 700k.
Image annotation- images are tagged with metadata that can then be used to sort, filter, classify, segment, group to narrow down target search.

-----------
In a simple poc i placed images in a folder called images_by_brand_id with lots of subfolders by brand_id.
When i used the code shown in attached image, it is taking 25mins to just search the names of the matching images, this is without even reading those image files. I am trying to see if there is a better way of going to subfolder directly instead of searching each file based on pattern.

I would like to point a managed folder to /images_by_brand_id/0001/ or /images_by_brand_id/1425/ or something else dynamically based on input.

 

2023-09-07_12-49-39.jpg

Thanks,
Raghu

0 Kudos

You are trying to tackle the problem from the wrong angle. Dataiku is not the problem, it's S3. But you are making things worst by adding the Dataiku API on top of S3 and using a for loop to search for files. So while this is not an AWS forum here are some pointers:

https://www.linkedin.com/pulse/how-scan-millions-files-aws-s3-jishnu-kinwar/

https://alukach.com/posts/tips-for-working-with-a-large-number-of-files-in-s3/

https://stackoverflow.com/questions/3337912/quick-way-to-list-all-files-in-amazon-s3-bucket

So once you find a way that meets your performance requirements then look how to implement it in Dataiku. If you need to use the AWS cli you can use a Dataiku Shell recipe. Otherwise you could use the AWS Python boto3 library to interact with AWS API via Python.

raghutej
Level 2
Author

Thanks for the inputs.
Using boto is an option but we prefer not placing keys explicitly in code to establish connection.

from boto.s3.connection import S3Connection

conn = S3Connection('access-key','secret-access-key')


see below screen shot, when i point dataiku folder to /Coop_lens/images_by_brand_id/1425/ instead of /Coop_lens/images_by_brand_id/
I got the results in 884 ms instead of 23mins 15s.

My question was can i change path in bucket dynamically in managed folder definition using python code.2023-09-11_14-13-13.jpg


Thanks,
Raghu

0 Kudos

As I said your problem is on S3, different folders will give different timings since they have different number of children. You can use the get_path_details() API method to get the children of a specific folder path:

https://developer.dataiku.com/latest/api-reference/python/managed-folders.html#dataiku.Folder.get_pa...

 

0 Kudos