Application-as-recipes and SQL scripts reusable for different connections

pvannies
pvannies Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 16 Neuron

While turning a flow of SQL script recipes into an application-as-recipe we encountered some difficulties making it generalisable for new input SQL tables that have a different connection/schema/tablename as the original input table. We are interested to learn how other users dealt with this and what best practices are according to the dataiku developers.

Specifically, here are some of our questions:

- In the application designer advanced options, there is the possibilities to add Connections for remappings. We understood this happens when the application is instantiated -> is there a (future) possibility that this remapping automatically points to the connection of the input dataset for the application-as-recipe? As we do not know the input dataset/connection we will use the app-as-recipe for when designing it, we had to find a work-around to do the remapping.

- To use the tablename and schema of the inputtable as a variable that can be used in the SQL scripts, we wrote a custom python script that finds the input to the first recipe in the instantiated project and then looks up its properties using the dataset class and writes them as project variables. To have the 'Execute SQL' steps in our scenario run with the correct connection, we looked up every step of the scenario and if it had a connection changed the connection accordingly. Is there any easier way to do this?

- Will there be documentation available on these topics? we could only find the documentation for file based recipes and scenarios.

Thanks for your input!
This is a follow up from @akshaykatre
post: How do you use SQL Script in recipe as application - Dataiku Community @fchataigner2

Best Answer

  • pvannies
    pvannies Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 16 Neuron
    edited July 17 Answer ✓

    Will post our (@akshaykatre
    ) workaround here, hope this may be of help to someone in the community!

    Background: Upon running the application-as-recipe, a temporary project instance is created that performs input replacements and the execution of the flow defined in a scenario.
    After the scenario has run, the output will be copied back from the instantiated project to the project where the app-as-recipe is called.
    What this means for the use of SQL datasets, is that the connection of any dataset in the instantiated project remains the same connection as it was when defining your flow in the project that was turned into the application-as-recipe.
    Because it is likely you will use the app-as-recipe on a dataset on a different SQL server, you will have to remap the connections dynamically within the scenario when designing your application.
    If you forget to do this, you will get an error that the SQL recipe can't work on multiple connections simultaneously.

    Write two scenario's:

    • The first scenario is a python script that sets all relevant variables needed for the SQL scripts and applies remapping of your connections of the second scenario.
    • The second scenario deals with building the actual output dataset. The connections of this scenario are remapped before it is ran by the end of the first scenario.
    Depending on your actual use case, you can reuse (parts of) the python code below for the first scenario 'SettingVariables'.

    # Python script for SettingVariables Scenario
     
    import dataiku
    from dataiku.scenario import Scenario
     
    client = dataiku.api_client()
     
    # the project we are in now - the temporary application-as-recipe project
    project = client.get_default_project()
     
    # retrieve necessary information from the input dataset via the name of the first recipe
    first_recipe = project.get_recipe('name_recipe_linked_to_input_dataset')
    name_input_dataset = first_recipe.get_settings().get_recipe_inputs().get('main').get('items')[0].get('ref')
     
    # get parameters via this dataset which has the format ProjectKey.dataset_name
    # note that ProjectKey can also be obtained via the scenario as 'scenarioTriggerParam_projectKey'
     
    def datasetinfo(dataset):
        dataset =  dataiku.Dataset(dataset)
        connection = dataset.get_config().get('params').get('connection')
        schema = dataset.get_config().get('params').get('schema')
        table_name = dataset.get_config().get('params').get('table')
        projectkey = dataset.get_config()['projectKey']
        return connection, schema, table_name, projectkey
     
    connection, schema, tablename, projectkey = datasetinfo(name_input_dataset)
     
    # set global (or local) variables in temporary application-as-recipe project if variables are used outside this scenario
    vars = project.get_variables()
    vars['standard']['connection'] = connection
    vars['standard']['schema'] = schema
     
     
    # Change scenario connection
    scenario = project.get_scenario('BuildDatasets')
    settings = scenario.get_settings()
    for step in settings.raw_steps:
        if 'connection' in step['params'].keys():
        step['params']['connection'] = connection
        settings.save()
     
     
    # Change connection and schema of each dataset
    for datasets in project.list_datasets():
        data_table = project.get_dataset(datasets['name'])
        data_settings = data_table.get_settings()
        data_settings.get_raw_params()['connection'] = connection
        data_settings.get_raw_params()['schema'] = schema
        data_settings.save()
     
     
    # Adjust for names that have variables in them
    if "$" in tablename:
        tablename = tablename.replace("$", "")
        localvars = dataiku.Project(project_key=projectkey).get_variables().get('local')
        standardvars = dataiku.Project(project_key=projectkey).get_variables().get('standard')
        standardvars.update(localvars)
        tablename = tablename.format(**standardvars)
     
    # example of a local variable that  can be used by SQL scripts
    vars["local"]["input_table"] = schema + "." + tablename
     
    # save the new variables
    project.set_variables(vars)
     
    # Call scenario that builds the actual datasets which now has remapped connections and can make use of variables transferred from the callingProject
    Scenario().run_scenario('BuildDatasets')

Answers

Setup Info
    Tags
      Help me…