Extending the DSS Python API to Handle Shaker Steps + More
Hello!
Please excuse my long post but I recommend you read it fully to see what I am trying to accomplish!
TLDR:
- I am looking to first extend the dataiku python api to easily call visual prepare recipe steps/shakers. My vision is to eventually create a wrapper library of the dataiku python api/rest api that caters more to data engineers. Imagine a simpler pandas but just using dataiku visual recipes. The problem I am currently stuck on is similar to this: https://community.dataiku.com/t5/Using-Dataiku-DSS/Prepare-recipe-creator/m-p/3279/highlight/true#M2383
My Goals:
- Extend the dataiku external python api library to make visual prepare steps/shakers easily callable. (Not sure if this is already possibly though or needs to be implement from the back-end of Dataiku DSS?)
- Once goal 1 is complete. I want to make a library that provides very simple syntax to fully engineer a DSS flow using all visual recipes in DSS. Imagine a simpler pandas but just using dataiku visual recipes (via dataiku api)
Here is a taste of how I imagine the library for goal 2 to work. This could be changed I am up for any suggestions Dataiku has to offer!
# this calls the delete columns shaker in the prepare visual recipe. dataset_obj = project_obj.getDataset("dataset_name") dataset_obj.prepare.delete_columns(columns=["col2","col3"]) # keys are cell values and values in the dictionary are the replacement value find_and_replace_params = {"sale":"sales", "datku":"dataiku"} dataset_obj.prepare.find_and_replace(col_name="col1", find_and_replace_params)
My Dream:
- After this library (from goal 2) is created it would be cool to add this to Visual Studio Code and other IDEs. Maybe extend out the current VS code, PyCharm and Sublime Text plugins?? I would want to make it so you could see the table that appears just like on DSS prepare steps and for it to change as you write more lines. A library like this allows developers to feel at home and clickers to feel comfortable when they read out coders' DSS flows (clickers will just see the normal flow with only visual recipes). In a sense, it would allow clickers and coders to work together in harmony. Other benefits to a library like this is that for big datasets coders who are comfortable with using a programming language to engineer the datasets can optimize their code (especially ones who don't use pyspark, rspark or spark scala). This library in a sense acts like a transpiler that transpiles a pandas-like wrapper language of the Dataiku DSS Python API to the visual programming language (or visual grammar) of the Dataiku DSS flow.
My research and questions on how to accomplish my goal 1:
- I know the python api already has good documentation on ways to call group, stack, and I believe join visual recipes. But I don’t think the python api extends the prepare visual recipe to its full extent. I know that it is possible to call the shaker steps I am just stuck because I don’t know exactly how for each step. Is there any full documentation or information on how I can manipulate the python api to be able to call each prepare step or shaker?
- If I am understanding the python api and even REST API implementation correctly I believe everything boils down to JSON with visual recipes. Please let me know if this is correct. I just don’t have a full idea of the structure of the JSON for all of the shaker steps and the parameters they can take. If there is no documentation are there any suggestions on how I can accomplish my goals?
Answers
-
Interesting thought ! To my knowledge there is no available documentation on how to achieve this
-
Hi.
So as you have commented, REST API just simply modify the JSON files.
We do not have detailed documentation on all the parameters of the steps, I think the best way would be to just create a prepare recipe with a bunch of steps and manually look through the JSON.
I believe you have already played with the Dataiku API, but here is a code snippet to retrieve the steps and how to modify the steps manually.client = dataiku.api_client()
project = client.get_project(PROJECT_KEY)
recipe = project.get_recipe(recipe_name=RECIPE_NAME)
payload = recipe.get_definition_and_payload()
payload_json = payload.get_json_payload()
#modify payload_json["steps"][n_step]
payload.set_json_payload(payload_json)YSL
-
Thank you!
-
I think there are a lot of comparisons with this idea to the Optimus library:
https://github.com/ironmussa/Optimus/
the only difference with my idea would be that we would wrap around the DSS internal library code visual recipes.