Setting up Scenario to Run a step once a day

tgb417 · ‎02-22-2024

All,

I have a data flow that starts with a flow zone that takes a long time to run. I'd like to run this first flow zone just once a day.

Is anyone aware of a good simple way to run that zone the first time the scenario runs each day the first step will run. Then every other time the scenario runs the step is checked and then bypass the scenario step.

Operating system used: Sonoma 14.3.1

--Tom

Turribeach · ‎02-22-2024

Hi Tom,

I would do it this way. You obviously need to keep a state somewhere to determine you already run for the day. I am not a fan of project variables because they can be manipulated and left in an inconsistent state different from the actual scenario run so I want some certainty the step did complete in full. I would also prefer a more visual/code less solution if possible. So my solution would be to add a column to your last dataset in your zone which simply outputs today's date as a string. In a prepare recipe you could do this in a formula:

concat(datePart(now(), "year"), "-", datePart(now(), "month"), "-", datePart(now(), "day"))

Then create a metric on it (you can use Max() as it will be the same value for all rows so it doesn't really matter). Then define a scenario variable with today's date using the same formula above. Then add another step to recalculate metrics of your output dataset. Then use this tecnique to add one more step to fetch the metric value into another scenario variable. At this stage you have the two scenario variables defined: today's date and the max() of the column of your dataset. You then add a logical condition (if condition is statisfied, as per the link above) to your build zone step comparing the two scenario variables and only run the step if they are different.

tgb417 · ‎02-22-2024

@Turribeach

Thank you for a quick reply! 🙂

First of all in my case the data is in a PostgreSQL database.

Regarding retaining a state. I've been thinking about using a SQL statement query like this to calculate the state on the fly. I also don't really like putting this kind of state in project variables, because every time the scenario changes the variable, a git commit is generated. It makes my git repositories really crazy to follow.

SELECT DATE_PART('days', max(response_headers_date) - now()) <> 0 AS old_data FROM ${DKU_DATASET_TABLE_NAME}

In this SQL query above the responce_headers_date comes from the AI Connect plugin. This SQL will result in True if the max of these dates is today, and False if there is any other results. Right now I have this query as a Custom Dataset Metric that I created using the "New SQL Query Probe".

Your suggestion seems to add extra space to the dataset in such a way that every one of the 1/4 million records in my dataset currently gets a new field.

I see over in your other post that your are defining a scenario variable. I last looked at these question with you in early 2023 and remember now that there was no way to view the scenario variables, and I had a miserable time figuring out exactly how the variable was stored in Dataiku DSS. Is that still the case? I remember this sort of being a nightmare.

In fact I found our previous conversation on these points.

https://community.dataiku.com/t5/What-s-New/Want-to-Control-the-Execution-of-Scenario-Steps-With-Con...

I've not heard anything about this product idea.
https://community.dataiku.com/t5/Product-Ideas/Enhanced-UI-UX-in-the-Senario-Step-Builder/idi-p/3102...
have you?

If anyone else comes here and finds this challanging. I invite you to up vote this product suggestions.

--Tom

tgb417 · ‎02-22-2024

@Turribeach

I've got it working. I had forgotten how tedious this processes is to get to work.

--Tom

Turribeach · ‎02-22-2024

Hi Tom,

Glad you got it working. Yes working with secenario variables can be a little pain until you work out the correct expression. Indeed I do agree with you that if you already have your latest date in your SQL database then it makes sense to use it from there. I wasn't sure if that was the case so I thought of solution that could work either way. It's also worth noting that both the scenario variable and it's type are always available in the Scenario log (in the View Scenario log link) you just have to search for it. The log is very busy but the data will always be there so you will always know what the value of a scenario variable was when the scenario runs. Finally I just like to add that given that you have a SQL that returns the condition to whether you need to run a step or not you could even run the SQL in the scenario itself in a "Execute SQL" scenario step. This will allow you to fetch the value in a value directly without having to calculate a metrics in the scenario and removing the Custom Dataset Metric/SQL Query Probe. The following post shows how to do that, and even also how to retrieve the value of the SQL output either using a variable expression or in Python:

https://community.dataiku.com/t5/General-Discussion/Retrieve-result-of-an-execute-sql-step-of-a-scen...

I think moving the SQL to a scenario step makes a lot of sense because anyone looking at the scenario will be able to understand what that SQL is doing with having to dig into a metric improving into readability and explainability. In fact I would argue that you should change the SQL and return the two dates separately, the max date from the table and the now() date, and store them in separate scenario variables and compare them in the step conditions. This will improve your debug capabilities since the scenario will log both values giving you full visibility in weird cases where maybe your database value is higher than now(), off by many dates, etc. of course you might be able to know what happened from the database side but again it’s better to have this information at the scenario log.

tgb417 · ‎03-01-2024

@Turribeach

I still do not see how to get this apparent chunk of magic out of the View scenario log screen.

stepOutput_check_day_old_data.getPath(".rows[0][0]").join(',')

I see this reference in the file. There is a .JSON object that seem to be called "step" but not something like "stepOutput_check_day_old_data". This latter formulation seems to be called for in the post you referenced.

[2024/02/29-19:45:40.656] [FT-ScenarioThread-P9um6IHM-4126396] [INFO] [dip.scenario.step.sql] scenario GATHER2.TEST#2024-02-29-20-45-40-496 - [ct: 106] Exited the exec sql step {
  "endedOn": 0,
  "success": true,
  "updatedRows": 0,
  "warningMessages": [],
  "totalRowsClipped": false,
  "totalRows": 1,
  "log": "",
  "columns": [
    {
      "name": "old_data",
      "type": "bool",
      "dssType": "BOOLEAN",
      "sqlType": -7
    }
  ],
  "rows": [
    [
      "false"
    ]
  ],
  "hasResultset": true
}

So, I've got things working. Interested in trying your SQL idea. However, there appear to be lots of documentation / training details missing in order to create scenario variables.

--Tom

Setting up Scenario to Run a step once a day

Setting up Scenario to Run a step once a day

Labels

Scenarios

Setup info

Sign up to take part

Setting up Scenario to Run a step once a day

Setting up Scenario to Run a step once a day

Labels

Scenarios

Setup info