UPDATE: A built-in DSS solution is described in an answer below
Hi @Turribeach ,
It can be done using a Custom scenario trigger:
In this case, you can control when your scenario is triggered using the python API.
If the directory you're monitoring is a Dataiku managed folder, then you may use the following API to list its content (otherwise just use plain python and provide a path yourself):
Then, once the list of files is retrieved you can store it in a project variable (it may be also convenient to store a hash of concatenated file names, not the whole list).
Thanks Andrey. It's a pity there isn't a built-in way to trigger a scenario when a file arrives. This is a very common use case for data pipelines. The way we want to implement this is that we would move the incoming file to a processing folder and then to processed once loaded. This has 3 benefits:
How much overhead is running a trigger very frequently? (say every 5 seconds). How does the trigger "run every" work? Is it a daemon sort of process or does it get kicked by some internal Dataiku scheduler? The reason I ask is that if it is not a daemon it might be better for us to run a daemon process on the OS which will be more efficient than Dataiku starting up a thread to check a directory.
We need to score files within a minute max and it takes us around 10 seconds to score each file so we need to load them pretty quickly to meat our SLA.
Another way we could go which I think seems to better fit is to transfer the file to a GCP bucket (Dataiku is running on a GCP VM). Google Cloud Storage supports running Cloud Functions when new files are added to a bucket. I am guessing we could write a small Cloud Function that used the Dataiku REST API to trigger the scenario in Dataiku. Although our Cloud Engineering team have not enabled Cloud Functions in our environment so this will have to wait until then.
Turribeach, I actually completely forgot about an existing trigger functionality that we have. You can choose "On dataset change" option and then add a folder you want to monitor.
The way it works is it checks the list of files, their sizes and modified dates and if it differs from a previous run it triggers the scenario.
DSS will run the monitoring in a separate daemon thread with an infinite loop (as long as the trigger is enabled). On each iteration, this thread will check the managed folder and if nothing changed will sleep for a number of seconds defined in "Run every" field.
The filesystem enumeration is pretty fast as it doesn't read the content of the files, just its metadata. Usually, the trigger overhead is negligible compared to the scenario execution time.