Log table to capture job status and steps for every dataset build

Curious_Analyst · August 2023

Hi I want to create a log table in Snowflake in which I want to capture the detailed steps post the scenario build for dataset execution happens, Ex Name of scenario, user to trigger it, build status, time taken for execution, build environment, date. Can you please share any existing code or guide me in the steps which I should follow to develop it?

Alexandru · September 2023

Hi @Curious_Analyst
,
This data should already be most available in the "internal stats datasets" https://doc.dataiku.com/dss/latest/connecting/internal-stats.html
, a DSS admin can create internal stats datasets from +Dataset - Internal - Select :Scenario Runs" and/or Job
click preview and you will be able to create a dataset with the following columns. Screenshot 2023-09-05 at 9.15.39 PM.png

You can of course later calculate the duration in prepare as you have the start/end time.

Let us know if that helps. Additionally, if you need even more granular compute information you can CRU solution: https://knowledge.dataiku.com/latest/solutions/governance/solution-cru.html

Thanks

ChrisWalter · September 2023

Hey! Creating a log table in Snowflake is a good idea. You can start by defining your table structure, like log_id, scenario_name, user_name, build_status, execution_time, build_environment, and execution_date. Then, use SQL to create it. Happy coding!

Curious_Analyst · September 2023

Hi @AlexT

Every time the scenario is triggered automatically the details will be appended in the dataset which will be created or table which we can create using Sync recipe?

Alexandru · September 2023

The dataset you will refresh everytime you read it e.g update sample or build a dependent dataset. It will need to make a direct query to runtime db to fetch the latest data.

However if your jobs/scenario table in the runtime db is very large and you are not using external runtime db it would be advised to build this query this less frequently directly.
Ideally have a sync recipe to a snowflake dataset run every x hours. Then you can use that snowflake dataset as needed.

https://doc.dataiku.com/dss/latest/operations/runtime-databases.html#the-runtime-databases

Curious_Analyst · September 2023

Hi @AlexT

In case my Scenario fails and I also want to bring that particular information in my output "Failed_Step" which got failed on Build step and I want to log it then how can I fetch that information as well?

Alexandru · September 2023

Thanks for clarifying indeed in this case the internal stats database will not suffice you need to code this within the scenario you can use a reporter to the dataset and log this information to a dataset directly from your scenario.

Curious_Analyst · September 2023

Can you please help me on how can I do that?

Curious_Analyst · September 2023

Hi @AlexT

I am trying to do it like this. I have created an empty column "FAILED_STEP" in my Log Table which I am reading using a prepare recipe from Internal Datasets

I am using the inbuilt DSS Variable to capture the first failed step but I am getting an error

Is this correct or I am missing something?

Alexandru · September 2023

So indeed, if the data in the table from the runtime db is not see below.
You need to extract the information you are looking within each and write to a dataset.

Screenshot 2023-09-08 at 11.26.09 AM.png

You can do this as follows :
Create variables you want via code or use the built variables available from the scenarion

To print all scenario variables available you can use this python step in the scenario to see if the information you are looking is already available ( in most cases it should be)

from dataiku.scenario import Scenario
import json
print(json.dumps(Scenario().get_all_variables(), indent=2))

To obtain more granular information around failed steps and their summary you can use code :

import json
import dataiku
from dataiku.scenario import Scenario
# this works only with code steps
#def get_failed_steps(step_results):
  #  return [f"{step}: {data['thrown']['message']}" for step, data in step_results.items() if data["outcome"] == "FAILED"]

#variables = Scenario().get_all_variables()
#step_results = variables["stepResults"]
#failed_steps = get_failed_steps(step_results)
#failed_steps_str = ", ".join(failed_steps)

#scenario = Scenario()
#variable_param = {"failed_steps": failed_steps_str}
#scenario.set_scenario_variables(**variable_param)
#print(failed_steps_str)



p = dataiku.api_client().get_default_project()
s = p.get_scenario('TEST')  # the scenario id, from the location bar
r = s.get_current_run()
id_to_report_items = {}

for sr in r.get_details()["stepRuns"]:
    id_to_report_items[sr.get("step", {}).get("id", "n/a")] = sr.get('additionalReportItems')

failed_steps = []

for key in id_to_report_items.keys():
    step = id_to_report_items[key]
    for i in step:
        if i['type'] == 'JOB_EXECUTED' and i['outcome'] == 'FAILED':
            failed_steps.append(str(i['thrown']))

scenario = Scenario()
variable_param = {"failed_steps": '\n'.join(failed_steps)}
scenario.set_scenario_variables(**variable_param)

Now that you set the new "failed steps" variables, you can append to a dataset using the reporter step from the scenario or code directly e.g SQL Executor or Python code.

The UI option is to use the reporter ( first you first manually create a dataset e.g upload an empty csv with the column names you need note you need the timestamp column to match and have the correct type. As explained here : https://community.dataiku.com/t5/Using-Dataiku/How-to-use-available-variables-in-reporters-section-Send-to/m-p/21538

Screenshot 2023-09-08 at 11.39.43 AM.png

Hope that helps!

Log table to capture job status and steps for every dataset build

Answers

Categories

Setup Info

Tags