We have a Pyspark recipe built which has ~20 inputs datasets and one outputs where we write our final data. Inputs get created from different scenarios and we serially write each input to Output table via Pyspark recipe mentioned above,
Now when we trigger pyspark recipe for a specific input, it gives an error "Input dataset is not ready" as some others inputs are being loaded even though the dataset is not being used at that point of time for recipe.
E.g. Lets say i have Dataset A and Dataset B as inputs of Pyspark recipe and Dataset C as output. Dataset A is loaded with new data and i want to run recipe to load A to C, however B is still running. Hence the recipe fails B is still not ready, how to avoid this and run A when it gets completed?
As DSS cannot infer from your code that all input datasets are not necessary to execute it, it will throw an error when running a recipe with inputs not being ready. You can solve your issue using one of the following approaches:
run the pyspark recipe only when all 20 scenarios have finished to run
refactor your code to run those independant tasks in independant jobs. You can create one pyspark recipe per input dataset, those pyspark recipes will write to different output datasets. The trick here is to make all those output datasets point to the same underlying files