Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I was wondering if you could kindly help me.
I have a PySpark recipe with two input datasets, and I want to use only one of them depending on the flag I set in the project variables. So even reading the input data happens in an if/else block where the variable is checked. There are two custom python scenarios that run this recipe on some particular partitions. So for instance if the flag is set to True the partition of the appropriate input dataset need be yesterdays date, and I do not need yesterdays data from the other input dataset. In fact, there is no yesterday partition in the other unwanted dataset.
The problem is that there is some sort of validation done by default before the recipe is ran and due to missing partitions in the unused dataset the recipe fails immediately. Is there any way to change the configuration so the validity of the input datasets (or specific partitions) is not checked by default?
This is the error I get:
Checked source readiness MY_PROJECT.my_dataset -> false [14:29:37] [ERROR] [dku.flow.jobrunner] running my_recipe - Activity unexpectedly failed
com.dataiku.dip.exceptions.SourceDatasetNotReadyException: Error while connecting to dataset MY_PROJECT.my_dataset (partition 2021-..-..)
this check cannot be disabled. If you can compute the partitions to read upfront (like, in a scenario), then you can store the values of the partitions in variables and use them in the partition dependencies, with a custom python dependency like here (where the variable is named "a")
If the dataset has an empty partition (if file based, that means a partition with empty files, as opposed to a partition with no files at all), then you can select that partition when you don't want to read from that dataset, so that the readiness evaluation doesn't complain.