Trigger on new file addition on S3 path
Hi Team,
Is there any way to define a trigger like whenever a new file is added in my S3 path then my workflow should run for that particular file instead of all the files in that specified path?
Thanks in Advance
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker
It would help to know a bit more about your use case here.
At what frequency and how many new files do you expect? Daily? Hourly?
Are files timestamped, or put in a path that could perhaps be used as a partition? Can old files be updated after they were initially created?
There is a scenario trigger for Dataset modified trigger :
This performs an S3 enumeration and will detect if there were changes since the last enumeration based on a calculated hash. However, the exact file names that change are not available to the scenario. So you would need some logic for example to store a project variable with the last processed file timestamp. Depending on the structure of the files and if perhaps partitioning could be used e.g hourly partition and only build the last hour every time.