Hi dataiku users and experts,
I need to very quickly evaluate from all projects (around 300) if some job was executed in the last couple of minutes (configurable) or some scenario triggered and executed. I did not found anything else than the project list_jobs method from the public API.
However iterating through all projects and calling list_jobs takes too much time (2-3 sec).
I am thinking to write a watcher on top of scenarios and jobs folder in DSS data dir, to catch and cache the latest jobs/scenario execution per project.
Or do anybody has a better idea?
We'd advise creating an "internal stats" dataset on the "jobs" or "scenarios" view.
Then in your Python code, use get_dataframe() on this dataset, and filter the rows by the time.
scenario_runs = dataiku.Dataset("dss_scenario_runs")
sf = scenario_runs.get_dataframe()
# The view has the start time in UTC, so therefore we shift by +2h
last_1m = datetime.now() - timedelta(seconds = 2*3600+60)
sf[sf['time_start'] >= last_1m]
This already takes 2 sometimes 3 seconds to query. Is it possible to get it faster?