Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on December 16, 2022 12:00PM
Likes: 8
Replies: 9
I have a Dataiku scenario with multiple steps. One of these steps includes downloading some files from a third-party website. In some instances, there won't be any files downloaded. As a result, the next step fails as there are no files to process.
Since the absence of downloaded files some days is a normal situation, I wanted to prevent any “warning” or “failed” alerts from being triggered. Moreover, I also wanted a "clickers" solution with minimal or no code. (Note: Check out the original user question and solution on the Dataiku Community to see an alternative solution that uses Python code.)
With the help of Dataiku Support, I came to this conclusion: By using the "If condition is satisfied" option on a scenario step alongside scenario variables, you can conditionally control if a step is executed based on the value of a metric (in this case, the number of files on a folder).
Here are the step-by-step instructions to follow:
1. Create a scenario step to "Compute metrics" for the folder (let's call this step Compute_Metrics).
2. Next, create a scenario step to "Define scenario variables".
3. On the Define scenario variables step, toggle the "Evaluated variables" ON.
4. Then, define a new variable (let's call it number_of_files) with this formula:
toNumber(filter(parseJson(stepOutput_Compute_Metrics)['ProjectID.FolderID_NP']['computed'], x, x["metricId"]=="basic:COUNT_FILES")[0].value)
5. You should replace ProjectID.FolderID with your corresponding values. Note that "Compute_Metrics" refers to the previous step name where you computed metrics for the folder.
6. Finally, in your conditional step, set "Run this Step" to "If condition is satisfied" and the condition to “number_of_files >= 1.”
That's it! The step will conditionally execute based on the metric value of a folder — no more failures or warnings.
One last thing to add: using the "If condition is satisfied" option may also have the unwanted side effect of overriding the default behavior that scenario steps only execute "if no prior step failed."
As per the Step Flow Control documentation, the possible values for outcome (which holds the current outcome of the scenario) are ‘SUCCESS,’ ‘WARNING’, ‘FAILED’, and ‘ABORTED.’ So if any previous step failed, outcome = FAILED and the conditional steps will not execute. In my case, I still wanted to ensure that no steps were executed if a prior step had failed, including those that used the "If condition is satisfied" option.
As a result, my actual condition ended up being this:
My key takeaway? If you use the "If condition is satisfied" option to evaluate a variable, remember to take this side effect into consideration.
Enjoy!
Read more about this user question and the solution here: Conditional execute of scenario step without steps failing or giving warnings
To learn more about scenarios in Dataiku, visit this tutorial in the Knowledge Base: Concept: Scenarios
Great tip.
How did you figure out the 'Shaker' Formula Code that you ended up using in the step?
toNumber(filter(parseJson(stepOutput_Compute_Metrics)['ProjectID.FolderID_NP']['computed'], x, x["metricId"]=="basic:COUNT_FILES")[0].value)
Do you have any tricks to suggest, If you had other conditions you wanted to monitor?
Thanks @tgb417
. That is a great question and one that I should have documented too with the trick, but better late than never! There are several options to work with the resulting JSON. The one that I used is this. The first step is to get the full JSON output so you can start playing with it. So in a scenario we add a step to compute the metrics we want to extract values from. In my test below I am calculating the metrics of a Folder:
It's important to give the step a meaningful name without spaces. Next we create a Define variables step, we toggle the Evaluated variables setting and define a variable as: parseJson(stepOutput_Compute_Metrics)
Note that the suffix after stepOutput_ is the name of the step you want to get the metrics from:
If you know run this scenario you will find the value of the variable in the Scenario execution logs:
[2023/01/27-10:03:22.599] [FT-ScenarioThread-onZHv2rB-5396] [INFO] [dip.scenario.step.definevars] scenario SECFILLINGS.TEST#2023-01-27-10-03-22-500 - [ct: 61] Update variable initial_json = parseJson(stepOutput_Compute_Metrics) [2023/01/27-10:03:22.600] [FT-ScenarioThread-onZHv2rB-5396] [INFO] [dip.scenario.step.definevars] scenario SECFILLINGS.TEST#2023-01-27-10-03-22-500 - [ct: 62] --> Evaluated to {"SECFILLINGS.CaMoYxZE_NP":{"partition":"NP","computed":[{"metricId":"basic:SIZE","metric":{"metricType":"SIZE","dataType":"BIGINT","id":"basic:SIZE","type":"basic"},"dataType":"BIGINT","value":"0"},{"metricId":"basic:COUNT_FILES","metric":{"metricType":"COUNT_FILES","dataType":"BIGINT","id":"basic:COUNT_FILES","type":"basic"},"dataType":"BIGINT","value":"0"},{"metricId":"reporting:METRICS_COMPUTATION_DURATION","metric":{"metricType":"METRICS_COMPUTATION_DURATION","dataType":"BIGINT","id":"reporting:METRICS_COMPUTATION_DURATION","type":"reporting"},"dataType":"BIGINT","value":"5"}],"startTime":1674813802547,"endTime":1674813802552,"runs":[{"engine":"Basic"}],"target":
The value of the metrics output is the JSON shown after the "Evaluated to" log line until the (class org.json.JSONObject) part (you need to exclude that). Once you have the full JSON value you can play with it to start constructing your formula. Usually I copy/paste the JSON on a JSON formatter to be able to quickly understand the structure. If you don't have one in your machine you can use https://jsonformatter.org/.
In my case I was interested in the COUNT_FILES metric value. To build the formula I usually go to any Prepare recipe step, add a formula step and click on the Open the Editor Panel. Now paste the resulting JSON output from the log and enclose it with the parseJson() function. You need to add single quotes to the whole JSON as parseJson() expects a string. Now you can look at the Sample output of the formula and start to work on how to extract the desired values:
To get to value I want I first need to limit the output to the ["ProjectID.FolderID_NP"]["computed"] section by adding ["SECFILLINGS.CaMoYxZE_NP"]["computed"] at the end of the formula:
If we look at the resulting output in JSON formatter we now managed to filter the 3 metrics of the JSON:
The next step is to filter for the desired metric section. This can be done with the very handy filter() function:
Unfortunately Dataiku seems to throw some validation errors even though the syntax is correct (I have reported this to Dataiku Support). You can still save though and the function will work fine and extract the desired values. The output with the filter is:
Here I used the "Show complete value" option over the new dummy column to see the output rather than using JSON Formatter. We finally have the desired metric selected so we can simply add [0].value at the end to get its value and we also enclose everything in a toNumber() function to make sure we get a number data type and we can use it on our conditional expressions as a number:
Obviously if you need the output to be a string you can skip the toNumber() function.
I will be interested to know if anyone has a better way of doing this in a "Clicker" way (ie without Python). Enjoy!
Great post thank you. I use a few variants to your process. Not clear if mine are any better or worse than your approach, but I thought I’d share in case they are helpful to you or others reading this post.
Two other points you might be interested in:
Hope some of that is helpful to you or others reading this thread. And thank you for sharing this detailed set of tips and tricks.
PS: Dataiku Support has confirmed the issue with the "unknown tokens" has been fixed in v11.2.1 or higher. HEre is how the Formula Editor and Preview Sample output looks in 11.2.1:
Now there is no validation error and the Sample output is displayed properly. I have up voted your Idea and I will add a reference to this post as well, to give Dataiku more context and different use cases.
This was a great informative post. Using this post I created a similar notification but required one small change in syntax.
For the Scenario Condition:
number_of_files >= 1 && outcome == 'SUCCESS'
I had to modify to read this.
number_of_files >= '1' && outcome == 'SUCCESS'
For the run condition to work. Without the ' ' the condition was fail even though it should have passed.
Thank you!
You must be missing the final toNumber() when you define your variable so your variable is a string not a number. If you look at the log of the Define Scenario Variable step you should see the variable evaluated and it's data type. You should make sure it's defined as a number so it can be compared properly.
You are absolutely spot on! I originally had the toNumber() but wanted to find a way to convert to an interger. Then fast forward a few weeks and here we are. I'm going to modify the condition so it remains a text and not be '0'. Thank you for the quick reply!
The toNumber() function handles the conversion for you so you don't need to worry about that. In fact you are getting an Integer. If you look at the step log when the variable is defined you will see something like this:
--> Evaluated to 999 (class java.lang.Long)
The Dataiku backend runs in Java so the data types are of course from Java. "The long data type is a 64-bit two's complement integer. The signed long has a minimum value of -263 and a maximum value of 263-1". This data type is also called BigInt.
As a test I just did a custom metric and returned a number with decimals in the metric value. Then I converted it using the toNumber() function and this is what the step log showed:
--> Evaluated to 42.123456789 (class java.lang.Double)
As you can see the toNumber() function correctly defined the variable as a double.
Thank you! Knowing how to read the logs to understand is very helpful for my future projects.
The only comment I would add, I did not see this in the Step Log. I did find it in the Scenario Log.