Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Folder connected to S3 bucket is enumerating all files by default - memory leak

Solved!
Talb27
Level 1
Folder connected to S3 bucket is enumerating all files by default - memory leak

Hi,

I'm having a Folder in my pipeline that is connected to an S3 bucket containing Millions of files.

I noted an odd behaviour while running a python recipe: every time this folder is an input of a recipe, it will enumerate all its files by default (even if I don't create the object in code). Since there are Millions of files, the build will take ages before running into a memory leak!

Anyone ones how to suppress this default behaviour from S3 buckets?

read_bucket_ng.png

Best regards,
Talb27


Operating system used: Windows 10

0 Kudos
1 Solution
AlexT
Dataiker
Dataiker

Hi @Talb27 ,

To avoid enumerating items in this particular S3 bucket path you can add ignore_flow=True when calling 

dataiku.Folder("folder_name", ignore_flow=True )  and don't add the folder as input to the recipe. 

If this is the only output then you would need to add a dummy output dataset/folder at least. 

https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder

Let me know if that works for you.

View solution in original post

0 Kudos
2 Replies
AlexT
Dataiker
Dataiker

Hi @Talb27 ,

To avoid enumerating items in this particular S3 bucket path you can add ignore_flow=True when calling 

dataiku.Folder("folder_name", ignore_flow=True )  and don't add the folder as input to the recipe. 

If this is the only output then you would need to add a dummy output dataset/folder at least. 

https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder

Let me know if that works for you.

0 Kudos
Talb27
Level 1
Author

Thank you so much @AlexT !
This perfectly solved our problem 🙂

Best regards,

Talb27

0 Kudos