Log table to capture job status and steps for every dataset build
Hi I want to create a log table in Snowflake in which I want to capture the detailed steps post the scenario build for dataset execution happens, Ex Name of scenario, user to trigger it, build status, time taken for execution, build environment, date. Can you please share any existing code or guide me in the steps which I should follow to develop it?
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @Curious_Analyst
,
This data should already be most available in the "internal stats datasets" https://doc.dataiku.com/dss/latest/connecting/internal-stats.html
, a DSS admin can create internal stats datasets from +Dataset - Internal - Select :Scenario Runs" and/or Job
click preview and you will be able to create a dataset with the following columns.You can of course later calculate the duration in prepare as you have the start/end time.
Let us know if that helps. Additionally, if you need even more granular compute information you can CRU solution: https://knowledge.dataiku.com/latest/solutions/governance/solution-cru.html
Thanks -
Hey! Creating a log table in Snowflake is a good idea. You can start by defining your table structure, like log_id, scenario_name, user_name, build_status, execution_time, build_environment, and execution_date. Then, use SQL to create it. Happy coding!
-
Hi @AlexT
Every time the scenario is triggered automatically the details will be appended in the dataset which will be created or table which we can create using Sync recipe?
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
The dataset you will refresh everytime you read it e.g update sample or build a dependent dataset. It will need to make a direct query to runtime db to fetch the latest data.
However if your jobs/scenario table in the runtime db is very large and you are not using external runtime db it would be advised to build this query this less frequently directly.
Ideally have a sync recipe to a snowflake dataset run every x hours. Then you can use that snowflake dataset as needed.https://doc.dataiku.com/dss/latest/operations/runtime-databases.html#the-runtime-databases
-
Hi @AlexT
In case my Scenario fails and I also want to bring that particular information in my output "Failed_Step" which got failed on Build step and I want to log it then how can I fetch that information as well?
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Thanks for clarifying indeed in this case the internal stats database will not suffice you need to code this within the scenario you can use a reporter to the dataset and log this information to a dataset directly from your scenario.
-
Can you please help me on how can I do that?
-
Hi @AlexT
I am trying to do it like this. I have created an empty column "FAILED_STEP" in my Log Table which I am reading using a prepare recipe from Internal Datasets
I am using the inbuilt DSS Variable to capture the first failed step but I am getting an error
Is this correct or I am missing something?
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
So indeed, if the data in the table from the runtime db is not see below.
You need to extract the information you are looking within each and write to a dataset.You can do this as follows :
Create variables you want via code or use the built variables available from the scenarion
To print all scenario variables available you can use this python step in the scenario to see if the information you are looking is already available ( in most cases it should be)from dataiku.scenario import Scenario import json print(json.dumps(Scenario().get_all_variables(), indent=2))
To obtain more granular information around failed steps and their summary you can use code :import json import dataiku from dataiku.scenario import Scenario # this works only with code steps #def get_failed_steps(step_results): # return [f"{step}: {data['thrown']['message']}" for step, data in step_results.items() if data["outcome"] == "FAILED"] #variables = Scenario().get_all_variables() #step_results = variables["stepResults"] #failed_steps = get_failed_steps(step_results) #failed_steps_str = ", ".join(failed_steps) #scenario = Scenario() #variable_param = {"failed_steps": failed_steps_str} #scenario.set_scenario_variables(**variable_param) #print(failed_steps_str) p = dataiku.api_client().get_default_project() s = p.get_scenario('TEST') # the scenario id, from the location bar r = s.get_current_run() id_to_report_items = {} for sr in r.get_details()["stepRuns"]: id_to_report_items[sr.get("step", {}).get("id", "n/a")] = sr.get('additionalReportItems') failed_steps = [] for key in id_to_report_items.keys(): step = id_to_report_items[key] for i in step: if i['type'] == 'JOB_EXECUTED' and i['outcome'] == 'FAILED': failed_steps.append(str(i['thrown'])) scenario = Scenario() variable_param = {"failed_steps": '\n'.join(failed_steps)} scenario.set_scenario_variables(**variable_param)
Now that you set the new "failed steps" variables, you can append to a dataset using the reporter step from the scenario or code directly e.g SQL Executor or Python code.
The UI option is to use the reporter ( first you first manually create a dataset e.g upload an empty csv with the column names you need note you need the timestamp column to match and have the correct type. As explained here : https://community.dataiku.com/t5/Using-Dataiku/How-to-use-available-variables-in-reporters-section-Send-to/m-p/21538
Hope that helps!