Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
I'm having a Folder in my pipeline that is connected to an S3 bucket containing Millions of files.
I noted an odd behaviour while running a python recipe: every time this folder is an input of a recipe, it will enumerate all its files by default (even if I don't create the object in code). Since there are Millions of files, the build will take ages before running into a memory leak!
Anyone ones how to suppress this default behaviour from S3 buckets?
Best regards,
Talb27
Operating system used: Windows 10
Hi @Talb27 ,
To avoid enumerating items in this particular S3 bucket path you can add ignore_flow=True when calling
dataiku.Folder("folder_name", ignore_flow=True ) and don't add the folder as input to the recipe.
If this is the only output then you would need to add a dummy output dataset/folder at least.
https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder
Let me know if that works for you.
Hi @Talb27 ,
To avoid enumerating items in this particular S3 bucket path you can add ignore_flow=True when calling
dataiku.Folder("folder_name", ignore_flow=True ) and don't add the folder as input to the recipe.
If this is the only output then you would need to add a dummy output dataset/folder at least.
https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder
Let me know if that works for you.
Thank you so much @AlexT !
This perfectly solved our problem ๐
Best regards,
Talb27
Thanks for preventing me from crashing our dataiku server ๐