Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer!READ MORE

Python script to export any kind of recipes into SQL?

Pizarro75
Level 2
Level 2
Python script to export any kind of recipes into SQL?

I am in charge to migrate a whole Dataiku project to GCP BigQuery. 

In my Dataiku project, I have around 90 input tables and 3 output tables, between those tables, around 500 different jobs (mostly JOIN, GROUP BY) and a single Python job.

Screenshot 2021-11-26 at 18.14.45.png

 
 
To export my whole project, I opened a ticket on the support to know if there is currently an easy way to export my whole project into SQL. They told me it is depending of my recipy type, it is possible if recipe['type'] = 'sql_query'.
 
However,  If I list job types of my recipes there are mostly: shaker, grouping, join. I do not have any custom SQL. 
 
I guess it is possible to export every type of recipes with a Python script, does anyone ever done this?
 
 
 
 
client = dataiku.api_client()
project = client.get_project('MDMPLANIF')
recipes = project.list_recipes()
for recipe in recipes:
    if recipe['type'] != 'sql_query': # only consider SQL query recipes
        print(recipe['type'])
        continue
    item = project.get_recipe(recipe["name"])
    settings = item.get_settings()
    sql = settings.data["payload"]
    pprint.pprint(sql)
 
0 Kudos
8 Replies
KeijiY
Dataiker
Dataiker

Hello @Pizarro75,

Thank you so much for posting the question on Community.

For the recipes whose types are other than SQL (such as Join and Prepare) and engines are "In-database (SQL)", would you please check if you can retrieve their SQL queries as follows?

 

import dataiku
import pprint

client = dataiku.api_client()
project = client.get_default_project()
recipe = project.get_recipe('your_recipe_name')
status = recipe.get_status()
sql = status.data['sql']
pprint.pprint(sql)

 

Screen Shot 2021-11-29 at 13.14.45.pngScreen Shot 2021-11-29 at 13.19.53.png

I hope this would help.

Sincerely,
Keiji, Dataiku Technical Support

0 Kudos
Pizarro75
Level 2
Level 2
Author

Thank you Keiji, this is working perfectly. I actually submitted a ticket in Dataiku suppot page and they were not able to find a way to help me. I just updated your script and I am getting the results I was expecting.

 

client = dataiku.api_client()
project = client.get_project('MDMPLANIF')
recipe = project.get_recipe('compute_MAST_STKO_STPO')
status = recipe.get_status()
sql = status.data['sql']
print(sql)

 

However, when I a trying to loop over all my recipes (over 500), this is not working. My 2 cents is that I can not get the status to every type of recipes. 

client = dataiku.api_client()
project = client.get_project('MDMPLANIF')
recipes = project.list_recipes()
for recipe in recipes:
    recipe = project.get_recipe(recipe)
    status = recipe.get_status() --> line that returning an error

 

Error log

 

DataikuException: java.lang.IllegalArgumentException: "{u'inputs': {u'main': {u'items': [{u'ref': u'BOM_SQL', u'deps': []}]}}, u'maxRunningActivities': 0, u'name': u'compute_BOM', u'versionTag': {u'lastModifiedBy': {u'login': u'KarenR'}, u'lastModifiedOn': 1618396290950, u'versionNumber': 13}, u'optionalDependencies': False, u'customMeta': {u'kv': {}}, u'outputs': {u'main': {u'items': [{u'appendMode': False, u'ref': u'BOM'}]}}, u'tags': [u'Input_sync_recipes'], u'creationTag': {u'lastModifiedBy': {u'login': u'amaury_Lethy'}, u'lastModifiedOn': 1580286049898, u'versionNumber': 0}, u'params': {u'engineParams': {u'tdchParams': {u'splitMode': u'DEFAULT', u'numberOfExecutors': 2}, u'maxThreads': 4, u'hive': {u'skipPrerunValidate': False, u'hiveconf': [], u'executionEngine': u'HIVECLI_GLOBAL', u'inheritConf': u'default', u'addDkuUdf': False}, u'sparkSQL': {u'overwriteOutputSchema': False, u'useGlobalMetastore': False, u'executionEngine': u'DATABRICKS', u'readParams': {u'map': {}, u'mode': u'AUTO', u'autoModeRepartitionInto': 10}, u'sparkConfig': {u'inheritConf': u'default', u'conf': []}, u'pipelineAllowMerge': True, u'pipelineAllowStart': True, u'skipPrerunValidate': False}, u'lowerCaseSchemaIfEngineRequiresIt': False, u'sqlPipelineParams': {u'pipelineAllowStart': True, u'pipelineAllowMerge': True}, u'impala': {u'forceStreamMode': False}}, u'forcePipelineableForTests': False, u'schemaMode': u'STRICT_SYNC'}, u'checklists': {u'checklists': []}, u'redispatchPartitioning': False, u'neverRecomputeExistingPartitions': False, u'type': u'sync', u'customFields': {}, u'projectKey': u'MDMPLANIF'}.json" is not a valid file/directory name (forbidden characters or too long)

 

 

client = dataiku.api_client()
project = client.get_project('MDMPLANIF')
recipes = project.list_recipes()
for recipe in recipes:
    recipe = project.get_recipe(recipe)
    print(recipe['type'])

Result:

sync
grouping
shaker
grouping
join
shaker
shaker
sync
shaker
join
shaker
shaker
sync
shaker
sync
sampling
shaker
sync
join
vstack
grouping
shaker
sampling
shaker
sync
sync
python
join
shaker
join
shaker
grouping
shaker
grouping
shaker
grouping
shaker
join
shaker
grouping
sync
vstack
join
shaker
sampling
shaker
grouping
grouping
shaker
sync
shaker
shaker
sync
grouping
shaker
shaker
join
join
join
shaker
grouping
grouping
sync
grouping
shaker
shaker
grouping
shaker
grouping
shaker
sync
grouping
shaker
join
sync
sync

 

0 Kudos
KeijiY
Dataiker
Dataiker

Hello @Pizarro75,

Thank you so much for the confirmation and the response.

Would you please try the following code?

client = dataiku.api_client()
project = client.get_project('MDMPLANIF')
recipes = project.list_recipes()
for recipe in recipes:
    recipe = project.get_recipe(recipe.name)
    status = recipe.get_status()

DSSProject.list_recipes() returns a list of DSSRecipeListItem objects, and DSSProject.get_recipe() must take a string value of a recipe name as its parameter.

I hope this would help.

Sincerely,
Keiji, Dataiku Technical Support

0 Kudos
Pizarro75
Level 2
Level 2
Author

Thanks. I am getting:

 

AttributeErrorTraceback (most recent call last)
<ipython-input-77-230328bed0fc> in <module>()
      3 recipes = project.list_recipes()
      4 for recipe in recipes:
----> 5     recipe = project.get_recipe(recipe.name)
      6     status = recipe.get_status()

AttributeError: 'dict' object has no attribute 'name'
0 Kudos
KeijiY
Dataiker
Dataiker

Thank you so much for checking. Then, would you try the following code for the retrieval of the recipe?

recipe = project.get_recipe(recipe['name'])

 

0 Kudos
Pizarro75
Level 2
Level 2
Author

Thank you @KeijiY ! This is working using this code snippet. I updated my code following your suggestions:

 
client = dataiku.api_client()
project = client.get_project('MDMPLANIF')
recipes = project.list_recipes()
for recipe in recipes:
recipe = project.get_recipe(recipe['name'])
status = recipe.get_status()
sql = status.data['sql']
Error log:
 
KeyErrorTraceback (most recent call last)
<ipython-input-81-91641dd714fa> in <module>()
      5     recipe = project.get_recipe(recipe['name'])
      6     status = recipe.get_status()
----> 7     sql = status.data['sql']

KeyError: 'sql'

 

I also just figured it out that I not using a recent version of Dataiku DSS, this is probably why sometimes your code does not work. THanks

0 Kudos
KeijiY
Dataiker
Dataiker

Hello @Pizarro75,

Thank you so much for the confirmation.

As not all recipes have a SQL query (e.g. a recipe using the DSS engine does not have a SQL query), you will need to check the existence of the 'sql' key before retrieving a SQL query as follows:

if 'sql' in status.data:
    sql = status.data['sql']
0 Kudos
Pizarro75
Level 2
Level 2
Author

Thank you @KeijiY , this is working! I was not expected that honestly. I have a support ticket in progress and they weren't able to figure out. Can you keep this conversation open if I have any other questions regarding to that script in the next couple of hours?

0 Kudos