Detecting if a dataset value changes
Given a very simple 1 row dataset which has a column value X can Dataiku detect if the value changes on load compared to the previous value without storing the previous value and in a no-code solution so that a scenario step can be conditionally executed (see this post for a sample of conditional step execution)?
I had some backs and forwards with support (ticket 50196) and we discussed several solutions none of which was ideal and matched all the requirements:
- Using a SQL dataset this can be done with "trigger on sql query change" but this only works with SQL datasets and we wanted something that would work on all dataset types
- Another option is to store the previous value on project variable or somehwere else so it can be compared to the new value but again this would usually need code and ends storing the previous value somewhere
- A solution that meets the no storing requirement works using a metric (max of the column value) and a custom Python Check that can compare the previous value of the metric against the newly loaded dataset value and be set the Check to WARNING or FAIL. Aside from using code I didn't like the fact that there is not much clarify in that it works because the metric value was not recalculated. I also don't like the fact that WARNING or FAIL checks are generated when in reality we just want to conditionaly fire a scenario step, so there is no need to mark anything as WARNING or FAIL.
- In the end I implemented a solution that with two datasets as follows: Load dataset1, calculate max col metric on dataset1, retrieve metric and set variable in scenario step for the current_value, then calculate metrics on dataset2, retrieve metric for the previous_value and use them to run a conditional step. Finally I can load dataset2 to load current_value and calculate metrics on dataset2 ready for the next load. So this works in no code but obviously needs two datasets which breaks the storing the previous value requirement
So how could Dataiku be improved to be able to do the above in no code and without storing the previous value? I see two ways this could be done:
- Allow access to metrics within a scenario step without having to compute them within the scenario in no code way. A new "Retrieve Metrics" scenario step type would be perfect for this use case. By accessing the existing metric values before a dataset is loaded I could retrieve the previous metric value, set a scenario variable, reload the dataset, compute metrics, retrieve the new metric value and compare them to have the conditional execution
- Allow users to compute a specific metric for a dataset in a scenario step in no code way, without having to compute all dataset metrics. A new "Compute Specific Metric" scenario step type would be perfect for this use case. In this case I will have two different metrics, I could reload the dataset, compute new metric value, retrieve the new and old metric values and compare them to have the conditional execution
Hope it makes sense. In the end this can be done relatevely easily with code but I always value the ability to have a no code solution.
Thanks