Python script to export any kind of recipes into SQL?

Options
Pizarro75
Pizarro75 Partner, Registered Posts: 7 Partner

I am in charge to migrate a whole Dataiku project to GCP BigQuery.

In my Dataiku project, I have around 90 input tables and 3 output tables, between those tables, around 500 different jobs (mostly JOIN, GROUP BY) and a single Python job.

Screenshot 2021-11-26 at 18.14.45.png

To export my whole project, I opened a ticket on the support to know if there is currently an easy way to export my whole project into SQL. They told me it is depending of my recipy type, it is possible if recipe['type'] = 'sql_query'.
However, If I list job types of my recipes there are mostly: shaker, grouping, join. I do not have any custom SQL.
I guess it is possible to export every type of recipes with a Python script, does anyone ever done this?
client = dataiku.api_client()
project = client.get_project('MDMPLANIF')
recipes = project.list_recipes()
for recipe in recipes:
if recipe['type'] != 'sql_query': # only consider SQL query recipes
print(recipe['type'])
continue
item = project.get_recipe(recipe["name"])
settings = item.get_settings()
sql = settings.data["payload"]
pprint.pprint(sql)

Answers

  • Keiji
    Keiji Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 52 Dataiker
    edited July 17
    Options

    Hello @Pizarro75
    ,

    Thank you so much for posting the question on Community.

    For the recipes whose types are other than SQL (such as Join and Prepare) and engines are "In-database (SQL)", would you please check if you can retrieve their SQL queries as follows?

    import dataiku
    import pprint
    
    client = dataiku.api_client()
    project = client.get_default_project()
    recipe = project.get_recipe('your_recipe_name')
    status = recipe.get_status()
    sql = status.data['sql']
    pprint.pprint(sql)

    Screen Shot 2021-11-29 at 13.14.45.pngScreen Shot 2021-11-29 at 13.19.53.png

    I hope this would help.

    Sincerely,
    Keiji, Dataiku Technical Support

  • Pizarro75
    Pizarro75 Partner, Registered Posts: 7 Partner
    edited July 17
    Options

    Thank you Keiji, this is working perfectly. I actually submitted a ticket in Dataiku suppot page and they were not able to find a way to help me. I just updated your script and I am getting the results I was expecting.

    client = dataiku.api_client()
    project = client.get_project('MDMPLANIF')
    recipe = project.get_recipe('compute_MAST_STKO_STPO')
    status = recipe.get_status()
    sql = status.data['sql']
    print(sql)

    However, when I a trying to loop over all my recipes (over 500), this is not working. My 2 cents is that I can not get the status to every type of recipes.

    client = dataiku.api_client()
    project = client.get_project('MDMPLANIF')
    recipes = project.list_recipes()
    for recipe in recipes:
        recipe = project.get_recipe(recipe)
        status = recipe.get_status() --> line that returning an error

    Error log

    DataikuException: java.lang.IllegalArgumentException: "{u'inputs': {u'main': {u'items': [{u'ref': u'BOM_SQL', u'deps': []}]}}, u'maxRunningActivities': 0, u'name': u'compute_BOM', u'versionTag': {u'lastModifiedBy': {u'login': u'KarenR'}, u'lastModifiedOn': 1618396290950, u'versionNumber': 13}, u'optionalDependencies': False, u'customMeta': {u'kv': {}}, u'outputs': {u'main': {u'items': [{u'appendMode': False, u'ref': u'BOM'}]}}, u'tags': [u'Input_sync_recipes'], u'creationTag': {u'lastModifiedBy': {u'login': u'amaury_Lethy'}, u'lastModifiedOn': 1580286049898, u'versionNumber': 0}, u'params': {u'engineParams': {u'tdchParams': {u'splitMode': u'DEFAULT', u'numberOfExecutors': 2}, u'maxThreads': 4, u'hive': {u'skipPrerunValidate': False, u'hiveconf': [], u'executionEngine': u'HIVECLI_GLOBAL', u'inheritConf': u'default', u'addDkuUdf': False}, u'sparkSQL': {u'overwriteOutputSchema': False, u'useGlobalMetastore': False, u'executionEngine': u'DATABRICKS', u'readParams': {u'map': {}, u'mode': u'AUTO', u'autoModeRepartitionInto': 10}, u'sparkConfig': {u'inheritConf': u'default', u'conf': []}, u'pipelineAllowMerge': True, u'pipelineAllowStart': True, u'skipPrerunValidate': False}, u'lowerCaseSchemaIfEngineRequiresIt': False, u'sqlPipelineParams': {u'pipelineAllowStart': True, u'pipelineAllowMerge': True}, u'impala': {u'forceStreamMode': False}}, u'forcePipelineableForTests': False, u'schemaMode': u'STRICT_SYNC'}, u'checklists': {u'checklists': []}, u'redispatchPartitioning': False, u'neverRecomputeExistingPartitions': False, u'type': u'sync', u'customFields': {}, u'projectKey': u'MDMPLANIF'}.json" is not a valid file/directory name (forbidden characters or too long)

    client = dataiku.api_client()
    project = client.get_project('MDMPLANIF')
    recipes = project.list_recipes()
    for recipe in recipes:
        recipe = project.get_recipe(recipe)
        print(recipe['type'])

    Result:

    sync
    grouping
    shaker
    grouping
    join
    shaker
    shaker
    sync
    shaker
    join
    shaker
    shaker
    sync
    shaker
    sync
    sampling
    shaker
    sync
    join
    vstack
    grouping
    shaker
    sampling
    shaker
    sync
    sync
    python
    join
    shaker
    join
    shaker
    grouping
    shaker
    grouping
    shaker
    grouping
    shaker
    join
    shaker
    grouping
    sync
    vstack
    join
    shaker
    sampling
    shaker
    grouping
    grouping
    shaker
    sync
    shaker
    shaker
    sync
    grouping
    shaker
    shaker
    join
    join
    join
    shaker
    grouping
    grouping
    sync
    grouping
    shaker
    shaker
    grouping
    shaker
    grouping
    shaker
    sync
    grouping
    shaker
    join
    sync
    sync

  • Keiji
    Keiji Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 52 Dataiker
    edited July 17
    Options

    Hello @Pizarro75
    ,

    Thank you so much for the confirmation and the response.

    Would you please try the following code?

    client = dataiku.api_client()
    project = client.get_project('MDMPLANIF')
    recipes = project.list_recipes()
    for recipe in recipes:
        recipe = project.get_recipe(recipe.name)
        status = recipe.get_status()

    DSSProject.list_recipes() returns a list of DSSRecipeListItem objects, and DSSProject.get_recipe() must take a string value of a recipe name as its parameter.

    I hope this would help.

    Sincerely,
    Keiji, Dataiku Technical Support

  • Pizarro75
    Pizarro75 Partner, Registered Posts: 7 Partner
    edited July 17
    Options

    Thanks. I am getting:

    AttributeErrorTraceback (most recent call last)
    <ipython-input-77-230328bed0fc> in <module>()
          3 recipes = project.list_recipes()
          4 for recipe in recipes:
    ----> 5     recipe = project.get_recipe(recipe.name)
          6     status = recipe.get_status()
    
    AttributeError: 'dict' object has no attribute 'name'
  • Keiji
    Keiji Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 52 Dataiker
    edited July 17
    Options

    Thank you so much for checking. Then, would you try the following code for the retrieval of the recipe?

    recipe = project.get_recipe(recipe['name'])

  • Pizarro75
    Pizarro75 Partner, Registered Posts: 7 Partner
    edited July 17
    Options

    Thank you @KeijiY
    ! This is working using this code snippet. I updated my code following your suggestions:

    client = dataiku.api_client()
    project = client.get_project('MDMPLANIF')
    recipes = project.list_recipes()
    for recipe in recipes:
    recipe = project.get_recipe(recipe['name'])
    status = recipe.get_status()
    sql = status.data['sql']
    Error log:
    KeyErrorTraceback (most recent call last)
    <ipython-input-81-91641dd714fa> in <module>()
          5     recipe = project.get_recipe(recipe['name'])
          6     status = recipe.get_status()
    ----> 7     sql = status.data['sql']
    
    KeyError: 'sql'

    I also just figured it out that I not using a recent version of Dataiku DSS, this is probably why sometimes your code does not work. THanks

  • Keiji
    Keiji Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 52 Dataiker
    edited July 17
    Options

    Hello @Pizarro75
    ,

    Thank you so much for the confirmation.

    As not all recipes have a SQL query (e.g. a recipe using the DSS engine does not have a SQL query), you will need to check the existence of the 'sql' key before retrieving a SQL query as follows:

    if 'sql' in status.data:
        sql = status.data['sql']
  • Pizarro75
    Pizarro75 Partner, Registered Posts: 7 Partner
    Options

    Thank you @KeijiY
    , this is working! I was not expected that honestly. I have a support ticket in progress and they weren't able to figure out. Can you keep this conversation open if I have any other questions regarding to that script in the next couple of hours?

Setup Info
    Tags
      Help me…