Folder connected to S3 bucket is enumerating all files by default - memory leak
Hi,
I'm having a Folder in my pipeline that is connected to an S3 bucket containing Millions of files.
I noted an odd behaviour while running a python recipe: every time this folder is an input of a recipe, it will enumerate all its files by default (even if I don't create the object in code). Since there are Millions of files, the build will take ages before running into a memory leak!
Anyone ones how to suppress this default behaviour from S3 buckets?
Best regards,
Talb27
Operating system used: Windows 10
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @Talb27
,To avoid enumerating items in this particular S3 bucket path you can add ignore_flow=True when calling
dataiku.Folder("folder_name", ignore_flow=True ) and don't add the folder as input to the recipe.
If this is the only output then you would need to add a dummy output dataset/folder at least.
https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder
Let me know if that works for you.
Answers
-
Thank you so much @AlexT
!
This perfectly solved our problemBest regards, Talb27
-
Tanguy Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 119 Neuron
Thanks for preventing me from crashing our dataiku server