Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi dataiku users and experts,
I need to very quickly evaluate from all projects (around 300) if some job was executed in the last couple of minutes (configurable) or some scenario triggered and executed. I did not found anything else than the project list_jobs method from the public API.
However iterating through all projects and calling list_jobs takes too much time (2-3 sec).
I am thinking to write a watcher on top of scenarios and jobs folder in DSS data dir, to catch and cache the latest jobs/scenario execution per project.
Or do anybody has a better idea?
Thanks
Hi,
We'd advise creating an "internal stats" dataset on the "jobs" or "scenarios" view.
Then in your Python code, use get_dataframe() on this dataset, and filter the rows by the time.
scenario_runs = dataiku.Dataset("dss_scenario_runs")
sf = scenario_runs.get_dataframe()
# The view has the start time in UTC, so therefore we shift by +2h
last_1m = datetime.now() - timedelta(seconds = 2*3600+60)
sf[sf['time_start'] >= last_1m]
This already takes 2 sometimes 3 seconds to query. Is it possible to get it faster?
Hi,
No it isn't possible to get this any faster. I hadn't understood that your previous comment on 2-3 seconds was about the total time, not per project.