Log table to capture job status and steps for every dataset build

Curious_Analyst
Curious_Analyst Registered Posts: 9

Hi I want to create a log table in Snowflake in which I want to capture the detailed steps post the scenario build for dataset execution happens, Ex Name of scenario, user to trigger it, build status, time taken for execution, build environment, date. Can you please share any existing code or guide me in the steps which I should follow to develop it?

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,227 Dataiker

    Hi @Curious_Analyst
    ,
    This data should already be most available in the "internal stats datasets" https://doc.dataiku.com/dss/latest/connecting/internal-stats.html
    , a DSS admin can create internal stats datasets from +Dataset - Internal - Select :Scenario Runs" and/or Job
    click preview and you will be able to create a dataset with the following columns. Screenshot 2023-09-05 at 9.15.39 PM.png

    You can of course later calculate the duration in prepare as you have the start/end time.

    Let us know if that helps. Additionally, if you need even more granular compute information you can CRU solution: https://knowledge.dataiku.com/latest/solutions/governance/solution-cru.html

    Thanks

  • ChrisWalter
    ChrisWalter Registered Posts: 10 ✭✭✭✭

    Hey! Creating a log table in Snowflake is a good idea. You can start by defining your table structure, like log_id, scenario_name, user_name, build_status, execution_time, build_environment, and execution_date. Then, use SQL to create it. Happy coding!

  • Curious_Analyst
    Curious_Analyst Registered Posts: 9

    Hi @AlexT

    Every time the scenario is triggered automatically the details will be appended in the dataset which will be created or table which we can create using Sync recipe?

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,227 Dataiker

    The dataset you will refresh everytime you read it e.g update sample or build a dependent dataset. It will need to make a direct query to runtime db to fetch the latest data.

    However if your jobs/scenario table in the runtime db is very large and you are not using external runtime db it would be advised to build this query this less frequently directly.
    Ideally have a sync recipe to a snowflake dataset run every x hours. Then you can use that snowflake dataset as needed.

    https://doc.dataiku.com/dss/latest/operations/runtime-databases.html#the-runtime-databases

  • Curious_Analyst
    Curious_Analyst Registered Posts: 9

    Hi @AlexT

    In case my Scenario fails and I also want to bring that particular information in my output "Failed_Step" which got failed on Build step and I want to log it then how can I fetch that information as well?

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,227 Dataiker

    Thanks for clarifying indeed in this case the internal stats database will not suffice you need to code this within the scenario you can use a reporter to the dataset and log this information to a dataset directly from your scenario.

  • Curious_Analyst
    Curious_Analyst Registered Posts: 9

    Can you please help me on how can I do that?

  • Curious_Analyst
    Curious_Analyst Registered Posts: 9

    Hi @AlexT

    I am trying to do it like this. I have created an empty column "FAILED_STEP" in my Log Table which I am reading using a prepare recipe from Internal Datasets

    I am using the inbuilt DSS Variable to capture the first failed step but I am getting an error

    Is this correct or I am missing something?

    REPORTER_STEP.PNG

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,227 Dataiker
    edited July 17

    So indeed, if the data in the table from the runtime db is not see below.
    You need to extract the information you are looking within each and write to a dataset.

    Screenshot 2023-09-08 at 11.26.09 AM.png

    You can do this as follows :
    Create variables you want via code or use the built variables available from the scenarion

    To print all scenario variables available you can use this python step in the scenario to see if the information you are looking is already available ( in most cases it should be)

    from dataiku.scenario import Scenario
    import json
    print(json.dumps(Scenario().get_all_variables(), indent=2))


    To obtain more granular information around failed steps and their summary you can use code :

    import json
    import dataiku
    from dataiku.scenario import Scenario
    # this works only with code steps
    #def get_failed_steps(step_results):
      #  return [f"{step}: {data['thrown']['message']}" for step, data in step_results.items() if data["outcome"] == "FAILED"]
    
    #variables = Scenario().get_all_variables()
    #step_results = variables["stepResults"]
    #failed_steps = get_failed_steps(step_results)
    #failed_steps_str = ", ".join(failed_steps)
    
    #scenario = Scenario()
    #variable_param = {"failed_steps": failed_steps_str}
    #scenario.set_scenario_variables(**variable_param)
    #print(failed_steps_str)
    
    
    
    p = dataiku.api_client().get_default_project()
    s = p.get_scenario('TEST')  # the scenario id, from the location bar
    r = s.get_current_run()
    id_to_report_items = {}
    
    for sr in r.get_details()["stepRuns"]:
        id_to_report_items[sr.get("step", {}).get("id", "n/a")] = sr.get('additionalReportItems')
    
    failed_steps = []
    
    for key in id_to_report_items.keys():
        step = id_to_report_items[key]
        for i in step:
            if i['type'] == 'JOB_EXECUTED' and i['outcome'] == 'FAILED':
                failed_steps.append(str(i['thrown']))
    
    scenario = Scenario()
    variable_param = {"failed_steps": '\n'.join(failed_steps)}
    scenario.set_scenario_variables(**variable_param)


    Now that you set the new "failed steps" variables, you can append to a dataset using the reporter step from the scenario or code directly e.g SQL Executor or Python code.

    The UI option is to use the reporter ( first you first manually create a dataset e.g upload an empty csv with the column names you need note you need the timestamp column to match and have the correct type. As explained here : https://community.dataiku.com/t5/Using-Dataiku/How-to-use-available-variables-in-reporters-section-Send-to/m-p/21538

    Screenshot 2023-09-08 at 11.39.43 AM.png

    Hope that helps!

Setup Info
    Tags
      Help me…