Folder connected to S3 bucket is enumerating all files by default - memory leak

Solved!
Talb27
Level 1
Folder connected to S3 bucket is enumerating all files by default - memory leak

Hi,

I'm having a Folder in my pipeline that is connected to an S3 bucket containing Millions of files.

I noted an odd behaviour while running a python recipe: every time this folder is an input of a recipe, it will enumerate all its files by default (even if I don't create the object in code). Since there are Millions of files, the build will take ages before running into a memory leak!

Anyone ones how to suppress this default behaviour from S3 buckets?

read_bucket_ng.png

Best regards,
Talb27


Operating system used: Windows 10

1 Solution
AlexT
Dataiker

Hi @Talb27 ,

To avoid enumerating items in this particular S3 bucket path you can add ignore_flow=True when calling 

dataiku.Folder("folder_name", ignore_flow=True )  and don't add the folder as input to the recipe. 

If this is the only output then you would need to add a dummy output dataset/folder at least. 

https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder

Let me know if that works for you.

View solution in original post

3 Replies
AlexT
Dataiker

Hi @Talb27 ,

To avoid enumerating items in this particular S3 bucket path you can add ignore_flow=True when calling 

dataiku.Folder("folder_name", ignore_flow=True )  and don't add the folder as input to the recipe. 

If this is the only output then you would need to add a dummy output dataset/folder at least. 

https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder

Let me know if that works for you.

Talb27
Level 1
Author

Thank you so much @AlexT !
This perfectly solved our problem ๐Ÿ™‚

Best regards,

Talb27

0 Kudos
tanguy

Thanks for preventing me from crashing our dataiku server ๐Ÿ™‚

0 Kudos