Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I'm facing some challenges with a scenario involving 10 datasets. These datasets are refreshed every week after undergoing data cleaning and manipulations through recipes (reading, cleaning, and writing back to the same dataset). Once all the 10 datasets are refreshed, they are supposed to be joined together by 'join scenario' step by step using the scenario.
The triggering condition for this 'join scenario' is set to 'Dataset modified,' and I have provided all the 10 datasets that are refreshed via the recipes mentioned above (provided screenshots attached). However, I'm encountering two issues:
To provide some context, I have selected specific options within the scenario settings, but it seems that there might be some misconfiguration or other factors causing these problems.
Could anyone kindly offer insights or suggestions on how to troubleshoot and resolve these issues with the scenario launch? I would greatly appreciate your assistance in resolving this matter.
Thank you in advance for your help!
Operating system used: windows
If you trigger the scenario manually, are the steps launched? Are all the steps successful if launched manually? If not, what errors are you receiving? Please provide any relevant scenario logs.
Regarding the occasions where the scenario successfully triggered on dataset modification, I would suggest grabbing those logs and looking for error messages to see if there is any indication why the steps are not being run. Please feel free to share them here as well.
Hi @JordanB ,
Thank you for your response.
Yes, I can confirm that when I trigger the scenario manually, all the steps are launched and completed successfully.
Regarding Screenshot 3, I noticed that the scenario was last triggered on 2023/07/24 at 12:42. In theory, this should have initiated the steps automatically. However, as you can see in Screenshot 4, I had to trigger the scenario manually at 2023/07/24 14:28, and upon doing so, all the steps completed successfully.
I have not yet observed a scenario that was successfully triggered on 'dataset modification,' so unfortunately, I don't have those logs to compare.
To further troubleshoot, I attempted changing the 'Trigger when' option from 'all are modified' to 'Any is modified' and manually triggered one of the dependent dataset scenarios. Surprisingly, the scenario was triggered and launched automatically this time. But it's important to note that this doesn't seem to work consistently when I select 'All are modified.'
Additionally, I took care to ensure that all the dependent 10 dataset scenarios are completed before proceeding with the main scenario.
I would like to pose a couple of questions to help us troubleshoot the issue more effectively:
Could this issue be related to the "Run as" option? I noticed that some of the 10 dependent scenarios, which rely on the 'join scenario,' have the 'last author of the scenario' listed, while for others, it shows my colleague's ID. Is this distinction significant in the scenario triggering/launching process?
Is it possible that the scenario looks back on the 10 datasets with a specific lead time? I've observed that the completion of some of these datasets is 3-4 hours apart(10:00 PM - 1:00 AM), and I wonder if this time difference could be affecting the scenario triggering/launching.
I appreciate any insights or suggestions that forum members can provide to help us resolve this issue. Thank you for your assistance!
The most likely cause for this is that one or more of your 10 datasets are invalid under the following dataset modification trigger guidelines:
Dataset modification triggers, which start a scenario whenever a change is detected in the dataset. This type of trigger is used for filesystem-based datasets. For SQL-based datasets, however, changes to the data are not detected.
We would need to see a scenario diagnostic to verify this and provide further guidance. Please open a support ticket with us and attach a scenario diagnostic ("Last runs" > "Download diagnostic").
What version of Dataiku are you using?
earlier today I saw another post about potential challenges with Senarios in Dataiku 12.x. As a regular user of Senarios, I’m trying to understand if I might be effected by a 12.x defect. (I’ve not noticed anything at the moment.) but was still interested in case there is something out there that may effect our use case.
Hi, can you confirm two things:
As Jordan has clarified SQL datasets that are NOT managed by DSS do not detect data changes when used for Dataset Changed triggers, please see the documentation below. I presume your datasets are not managed otherwise this should work. Note that Dataset Changed triggers still don't detect data changes on managed datasets but because the whole dataset is reloaded then the dataset settings change, hence why the triggers work.
If your non-managed datasets are all in the same connection then a single Trigger "on SQL Query Change" would work provided you can write a SQL query which can result in some field changing once all your 10 datasets have been updated according to your criteria.
There might be other options but I need to know more about your datasets before suggestion possible solutions.
Launch a scenario whenever a SQL dataset changes
The dataset change triggers do not read the data, only the dataset’s settings, and in the case of datasets based on files, the files’ size and last modified date. These triggers will thus not see changes on a SQL dataset, for example on a SQL table not managed by DSS. For these cases, a SQL query change trigger is needed.
Hello @Turribeach ,
Please find my responses below.
1. I have added 10 datasets. Show in the screenshot 5 attached here.
2. These are filesystem based datasets, created using custom python recipes via scenarios within dataiku.
3. These are not SQL datasets. Please refer to screenshot 6 for the file icon.
4. Yes, All these datasets are created under same connection.
I have a follow up question on below statement
" Note that Dataset Changed triggers still don't detect data changes on managed datasets but because the whole dataset is reloaded then the dataset settings change, hence why the triggers work."
When you say 'dataset settings change' , does it detect for file size? #records?
OK, so I think the issue must be that these are S3 datasets which the documentation doesn't say if it supports for dataset changed triggers or not. I think at this stage it might be better to move to a custom Python trigger because I don't think you are going to get what you want with the built-in functionality. Some more questions:
It's important to understand these questions to understand how could a custom trigger work. For instance suppose your 10 datasets refresh at different times of the day, perhaps even multiple times. How would you know when will the 10 datasets have been refreshed for a single day batch load (assuming the load is daily)? Will you check from midnight to midnight (ie base on the 24hs of the day), will you need to span over multiple days, if so, how many days?
Ideally it should support because, when I used 'Any of the data modified', Scenario got triggered. Issue arises only when I select "All of the dataset" is modified.
My responses to the questions:
These datasets are refreshed at different times of the day, with intervals of around 2-3 hours. However, it's not a fixed schedule, so predicting exact timing is challenging. Generally, it should not span over multiple days unless there are issues with receiving files from the data source, which is rare and hard to predict.
So my educated guess is that S3 not being a proper file system that's the reason why the dataset change trigger is failing to detect a change. There is one test that you chould do to confirm my theory:
Just noting here that source data changes are detectable on datasets that are based on filesystem-like datasets (filesystem, upload, HDFS, S3, Azure, GCS, SFTP, ...). For the trigger to fire, all 10 datasets must change (data-wise or settings-wise*) after the trigger has been set/saved.
* a change means a discrepancy in the file names, lengths or last modification time.