Let me first describe the situation: We have set up a flow within DSS (mainly SQL-scripts/queries running on a SQL Server) and need to run this for every month. The month indication is a combination of year and month (so no date format). The total input dataset is really large (few GB) and read-only.
The SQL scripts are set up to use a global variable as month indication and produce the output.
The steps inside the flow are :
- First to set the running month in the global variable,
- Then run several of the SQL scripts.
Once the flow is complete, the global variable is set to a new month and the process is repeated.
How can we automate this? We were thinking about the use of scenario's, but how can we loop through the different months as global variables.
We also did some experiments with partitioning How to automate multiple runs of a flow with different source file each time (discrete values), but this forced the script being run in the DSS Engine, which is performing much slower than the SQL Server.
Yes, you should be able to accomplish this by leveraging a custom python step in your scenario. If your SQL script is already configured to use a global variable, you can leverage our python APIs to programmatically retrieve the variable, update it, and then rerun the SQL scripts (by rebuilding the output dataset).
As an example, let's say you defined two global variables: one containing the full list of months that you want to iterate through and the other being the actual month variable that's used in your recipes. Something like:
Then, the following code (used as a custom python step in your scenario) should retrieve your project variables and rebuild the output dataset while iterating through the months provided.
import dataiku from dataiku.scenario import Scenario # Instantiate client client = dataiku.api_client() this_proj = client.get_project('YOUR_PROJECT_KEY') # Create the main handle to interact with the scenario scenario = Scenario() # Retrieve variables variables = this_proj.get_variables() months = variables['standard']['list_of_months'] # Iterate through months and rebuild the output dataset for m in months: variables['local']['Month'] = m this_proj.set_variables(variables) scenario.build_dataset('YOUR_OUTPUT_DATASET', build_mode="NON_RECURSIVE_FORCED_BUILD")
Please use this sample code as a reference since you will need to adjust it accordingly based on your specific use case but the general concept should still apply.
Also, I've gone ahead and attached a sample project that you can reference locally as well. The use case is slightly different but again it illustrates how you can leverage scenarios to programmatically update global variables that are used in recipes and rebuild the corresponding output datasets in your Flow. In this sample project, we are using these same concepts to create a scenario workflow that will send emails with attachments out to a list of recipients where the attachment that is included will contain filtered data based on the target recipient. However, just like in your situation, there are global variables capturing the list of recipients, which will then be used in a recipe, and then a custom python step in a scenario will be used to iterate through the list of recipients and rebuild an output dataset accordingly.
Again, the concepts are similar but just wanted to provide some examples that you can reference and adapt for your own needs accordingly. Hopefully, this proves helpful to you!
Would it be possible to run these jobs in parallel for any kind of recipe?
The idea is that for different set of parameters we would like to train multiple models over the night by making use of one main file that inputs the data and outputs to a folder.
How could we parallel same python recipe (main file) for different parameters or lists?