How to standardize the deployment in production of project in recipe code ?
I hesitated to write this article in the product Idea section because it is an issue that I find very important on the platform.
The industrialization of use cases. I've been using the platform through different teams, use cases and IT and I think I'm at a fairly advanced point in the implementation of project deployment methodologies in production and today I still have a big point to develop to be efficient on a common pain point through the majority of IT with whom I collaborated.
-> Deploying projects developed by collaborators with a very python or SQL developer background. Therefore projects with a lot of code recipe in different languages where I can not at the level of my function or the function of the different IT experts manage efficiently.
I explain, during the deployment of projects we have set up different control processes, checklists to retrieve the various data to evaluate, recommend and reconfigure projects so that they are eligible to migrate to production environments. Today this control also allows us to feed the supervision and monitoring of our different instances.
But in the context of an exotic project (often associated with data-scienstists as opposed to data-analysts with ETL/data-management use case etc.) we have very consistent code recipes where we are obliged to look at STEP by STEP recipes to assess whether the code is Ok and if it is optimized.
You can imagine that this is very constraining and then we always have a hazard with doubt on our code evaluations.
We are actively building algo's to be able to catch recipes, extract the code and read it but we are really struggling to define standardized rules to put redflag on recipe code objects in a flow. This puts at risk the plausible deployment of some projects and requires energy to our users to retro egineer their development.
If you have ever had this kind of problem or heard of a pluggin in dev or recommendations on this I am really all yours.
Answers
-
Hi!
One possible solution to better maintain and test large code bases would be to rely on Project libraries.
As you rightfully mentioned, managing code written only in recipes can quickly become cumbersome as the code size and number of recipes increase. Project libraries allow you to better structure your code into classes and functions that you can import in the recipes, cleanly separating the "business logic" of your code from the "scripting" part that would remain in the code recipe. For example, a very complex recipe could be reduced to:
import dataiku from mylib import some_fancy_processing input_dataset = dataiku.Dataset("my_input_dataset") input_df = input_dataset.get_dataframe() output_df = some_fancy_processing(input_df) output_dataset = dataiku.Dataset("my_output_dataset") output_dataset.write_with_schema(output_df)
In this case, the source code of the some_fancy_processing() function would live inside a module called mylib.py located in the Project libraries. That way, you would only need to review the content of mylib.py! You can even take Project libraries to the next level by linking them to an external git repository, and apply standard CI/CD practices on it.
Hope this helps!
Best,
Harizo
-
Grixis PartnerApplicant, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 80 ✭✭✭✭✭
Hello HarizoR, thank you very much for your answer and your suggestion is indeed interesting to have a centralized control on the use of functions in python for example.
I have taken note of it.
However what I am looking for would be in the ideal idea of the existing "validate" button-click function to check if the code is good. In our case a second button that could evaluate the relevance of the script, if it is optimized or not. We often have users who use deprecated python modules or in Oracle databases use joins in SQL recipes with short left-join in Full which leads to heavy processing.
More concretely, the idea that I am looking at for this topic would be :
A) first step : get the "recipes code" of the projects. >done
Establish a rule to estimate the weight of the recipes codes to keep the most consequent. >done
Extract only the code developed by the user from the recipe X into a python object.
C) Establish pattern lists to flag dev errors / best practices / non-recommended functions.
D) Scroll inside the python object to evaluate the quality of the code from the recipe X.