Community Conundrum 25:Feature Visualization is now live! Read More

Trigger a scenario when a file arrives

Level 3
Trigger a scenario when a file arrives

We want to trigger a scoring scenario when a file arrives to a directory in the Dataiku server. Any ideas on how to achieve that? Thanks

3 Replies
Dataiker
Dataiker

UPDATE: A built-in DSS solution is described in an answer below

 

Hi @Turribeach ,

It can be done using a Custom scenario trigger:

Screenshot 2020-06-24 at 22.21.17.png

 

In this case, you can control when your scenario is triggered using the python API.

If the directory you're monitoring is a Dataiku managed folder, then you may use the following API to list its content (otherwise just use plain python and provide a path yourself):

https://doc.dataiku.com/dss/latest/python-api/managed_folders.html

Then, once the list of files is retrieved you can store it in a project variable (it may be also convenient to store a hash of concatenated file names, not the whole list).

Use this to get a current state of project variables:
dataiku.api_client().get_project().get_variables()
and ... set_variables(...) to override if it's different from the last run.
To trigger the scenario call:
 
from dataiku.scenario import Trigger
t = Trigger()
t.fire()
 
Andrey Avtomonov
R&D Engineer @ Dataiku
Level 3
Author

Thanks Andrey. It's a pity there isn't a built-in way to trigger a scenario when a file arrives. This is a very common use case for data pipelines. The way we want to implement this is that we would move the incoming file to a processing folder and then to processed once loaded. This has 3 benefits:

  1. We don't need to maintain a list of files as any new incoming files should be processed
  2. There is a clear workflow for files in each stage
  3. It's easy to monitor the incoming and processing directories to make sure files are not pilling up due to processing errors

How much overhead is running a trigger very frequently? (say every 5 seconds). How does the trigger "run every" work? Is it a daemon sort of process or does it get kicked by some internal Dataiku scheduler? The reason I ask is that if it is not a daemon it might be better for us to run a daemon process on the OS which will be more efficient than Dataiku starting up a thread to check a directory. 

We need to score files within a minute max and it takes us around 10 seconds to score each file so we need to load them pretty quickly to meat our SLA. 

Another way we could go which I think seems to better fit is to transfer the file to a GCP bucket (Dataiku is running on a GCP VM). Google Cloud Storage supports running Cloud Functions when new files are added to a bucket. I am guessing we could write a small Cloud Function that used the Dataiku REST API to trigger the scenario in Dataiku. Although our Cloud Engineering team have not enabled Cloud Functions in our environment so this will have to wait until then.

Dataiker
Dataiker

Turribeach, I actually completely forgot about an existing trigger functionality that we have. You can choose "On dataset change" option and then add a folder you want to monitor. 

The way it works is it checks the list of files, their sizes and modified dates and if it differs from a previous run it triggers the scenario.

DSS will run the monitoring in a separate daemon thread with an infinite loop (as long as the trigger is enabled). On each iteration, this thread will check the managed folder and if nothing changed will sleep for a number of seconds defined in "Run every" field.

The filesystem enumeration is pretty fast as it doesn't read the content of the files, just its metadata. Usually, the trigger overhead is negligible compared to the scenario execution time.

Andrey Avtomonov
R&D Engineer @ Dataiku