Troubleshooting Scenario Launch Issues with Refreshed Datasets

jyosmitha · ‎07-24-2023

Hello community,

I'm facing some challenges with a scenario involving 10 datasets. These datasets are refreshed every week after undergoing data cleaning and manipulations through recipes (reading, cleaning, and writing back to the same dataset). Once all the 10 datasets are refreshed, they are supposed to be joined together by 'join scenario' step by step using the scenario.

The triggering condition for this 'join scenario' is set to 'Dataset modified,' and I have provided all the 10 datasets that are refreshed via the recipes mentioned above (provided screenshots attached). However, I'm encountering two issues:

The scenario is not getting launched even after all the datasets are modified.
In some cases, the scenario is triggered, but the steps within the scenario are not launched.

To provide some context, I have selected specific options within the scenario settings, but it seems that there might be some misconfiguration or other factors causing these problems.

Could anyone kindly offer insights or suggestions on how to troubleshoot and resolve these issues with the scenario launch? I would greatly appreciate your assistance in resolving this matter.

Thank you in advance for your help!

Operating system used: windows

JordanB · ‎07-24-2023

Hi @jyosmitha,

If you trigger the scenario manually, are the steps launched? Are all the steps successful if launched manually? If not, what errors are you receiving? Please provide any relevant scenario logs.

Regarding the occasions where the scenario successfully triggered on dataset modification, I would suggest grabbing those logs and looking for error messages to see if there is any indication why the steps are not being run. Please feel free to share them here as well.

Thanks!

Jordan

jyosmitha · ‎07-26-2023

Hi @JordanB ,

Thank you for your response.

Yes, I can confirm that when I trigger the scenario manually, all the steps are launched and completed successfully.

Regarding Screenshot 3, I noticed that the scenario was last triggered on 2023/07/24 at 12:42. In theory, this should have initiated the steps automatically. However, as you can see in Screenshot 4, I had to trigger the scenario manually at 2023/07/24 14:28, and upon doing so, all the steps completed successfully.

I have not yet observed a scenario that was successfully triggered on 'dataset modification,' so unfortunately, I don't have those logs to compare.

To further troubleshoot, I attempted changing the 'Trigger when' option from 'all are modified' to 'Any is modified' and manually triggered one of the dependent dataset scenarios. Surprisingly, the scenario was triggered and launched automatically this time. But it's important to note that this doesn't seem to work consistently when I select 'All are modified.'

Additionally, I took care to ensure that all the dependent 10 dataset scenarios are completed before proceeding with the main scenario.

I would like to pose a couple of questions to help us troubleshoot the issue more effectively:

Could this issue be related to the "Run as" option? I noticed that some of the 10 dependent scenarios, which rely on the 'join scenario,' have the 'last author of the scenario' listed, while for others, it shows my colleague's ID. Is this distinction significant in the scenario triggering/launching process?
Is it possible that the scenario looks back on the 10 datasets with a specific lead time? I've observed that the completion of some of these datasets is 3-4 hours apart(10:00 PM - 1:00 AM), and I wonder if this time difference could be affecting the scenario triggering/launching.

I appreciate any insights or suggestions that forum members can provide to help us resolve this issue. Thank you for your assistance!

JordanB · ‎07-26-2023

Hi @jyosmitha,

The most likely cause for this is that one or more of your 10 datasets are invalid under the following dataset modification trigger guidelines:

Dataset modification triggers, which start a scenario whenever a change is detected in the dataset. This type of trigger is used for filesystem-based datasets. For SQL-based datasets, however, changes to the data are not detected.

We would need to see a scenario diagnostic to verify this and provide further guidance. Please open a support ticket with us and attach a scenario diagnostic ("Last runs" > "Download diagnostic").

Thanks!
Jordan

tgb417 · ‎07-26-2023

@jyosmitha ,

What version of Dataiku are you using?

earlier today I saw another post about potential challenges with Senarios in Dataiku 12.x. As a regular user of Senarios, I’m trying to understand if I might be effected by a 12.x defect. (I’ve not noticed anything at the moment.) but was still interested in case there is something out there that may effect our use case.

--Tom

jyosmitha · ‎07-26-2023

Hi @tgb417,

We are using 11.4.2 version

Turribeach · ‎07-30-2023

Hi, can you confirm two things:

In screenshot 1.PNG you didn't show any of the datasets added to the trigger. I presume you have added the 10 datasets you expect to change to the trigger, is that right?
Can you please clarify on what technologies are all these 10 datasets stored on?
If using SQL datasets are these datasets managed by DSS? (in other words who created the SQL table, DSS or some other system?)
Are all these 10 datasets all created from the same connection?

As Jordan has clarified SQL datasets that are NOT managed by DSS do not detect data changes when used for Dataset Changed triggers, please see the documentation below. I presume your datasets are not managed otherwise this should work. Note that Dataset Changed triggers still don't detect data changes on managed datasets but because the whole dataset is reloaded then the dataset settings change, hence why the triggers work.

If your non-managed datasets are all in the same connection then a single Trigger "on SQL Query Change" would work provided you can write a SQL query which can result in some field changing once all your 10 datasets have been updated according to your criteria.

There might be other options but I need to know more about your datasets before suggestion possible solutions.

https://doc.dataiku.com/dss/latest/scenarios/triggers.html

Launch a scenario whenever a SQL dataset changes
The dataset change triggers do not read the data, only the dataset’s settings, and in the case of datasets based on files, the files’ size and last modified date. These triggers will thus not see changes on a SQL dataset, for example on a SQL table not managed by DSS. For these cases, a SQL query change trigger is needed.

add a SQL query change trigger
write a query that will return a value which changes when the data changes. For example, a row count, or the maximum of some column in the dataset’s table.
set the check periodicity

jyosmitha · ‎07-31-2023

Hello @Turribeach ,

Please find my responses below.

1. I have added 10 datasets. Show in the screenshot 5 attached here.

2. These are filesystem based datasets, created using custom python recipes via scenarios within dataiku.

3. These are not SQL datasets. Please refer to screenshot 6 for the file icon.

4. Yes, All these datasets are created under same connection.

I have a follow up question on below statement

" Note that Dataset Changed triggers still don't detect data changes on managed datasets but because the whole dataset is reloaded then the dataset settings change, hence why the triggers work."

When you say 'dataset settings change' , does it detect for file size? #records?

Turribeach · ‎07-31-2023

OK, so I think the issue must be that these are S3 datasets which the documentation doesn't say if it supports for dataset changed triggers or not. I think at this stage it might be better to move to a custom Python trigger because I don't think you are going to get what you want with the built-in functionality. Some more questions:

Do these 10 datasets refresh as part of the same scenario or do they use different scenarios?
How frequent these datasets refresh?
What's the time frame that you will need to be looking at them to detect that all 10 changed? In the GUI Dataiku calls this the "observation time frame".

It's important to understand these questions to understand how could a custom trigger work. For instance suppose your 10 datasets refresh at different times of the day, perhaps even multiple times. How would you know when will the 10 datasets have been refreshed for a single day batch load (assuming the load is daily)? Will you check from midnight to midnight (ie base on the 24hs of the day), will you need to span over multiple days, if so, how many days?

jyosmitha · ‎07-31-2023

@Turribeach ,

Ideally it should support because, when I used 'Any of the data modified', Scenario got triggered. Issue arises only when I select "All of the dataset" is modified.

My responses to the questions:

These 10 datasets are associated with 10 different scenarios.
They are refreshed weekly, (every Sunday.)
Currently, I have the "Observation time frame" set to the default value, which is 'unlimited.'

These datasets are refreshed at different times of the day, with intervals of around 2-3 hours. However, it's not a fixed schedule, so predicting exact timing is challenging. Generally, it should not span over multiple days unless there are issues with receiving files from the data source, which is rare and hard to predict.

Turribeach · ‎07-31-2023

So my educated guess is that S3 not being a proper file system that's the reason why the dataset change trigger is failing to detect a change. There is one test that you chould do to confirm my theory:

For each of your 10 datasets add a group recipe and do a row count or a max on a datetime column, etc. Make sure the output of the recipe it's on a file system connection in the Dataiku server, ie not on S3. This dataset won't add much space since it's only storing the row count
Make sure the new datasets get built as part of your existing scenarios
Change your trigger to check for the 10 datasets changed now points to the new file system datasets

JordanB · ‎07-31-2023

Hi @Turribeach @jyosmitha,

Just noting here that source data changes are detectable on datasets that are based on filesystem-like datasets (filesystem, upload, HDFS, S3, Azure, GCS, SFTP, ...). For the trigger to fire, all 10 datasets must change (data-wise or settings-wise*) after the trigger has been set/saved.

* a change means a discrepancy in the file names, lengths or last modification time.

Thanks!

Jordan

Sign up to take part

Troubleshooting Scenario Launch Issues with Refreshed Datasets

Troubleshooting Scenario Launch Issues with Refreshed Datasets

Setup info