Best Practices to create a package library for DSS
Hi everyone,
I was wondering what's the best way to create a python project in GitHub with different functionalities for DSS and eventually work as a package.
This package/GitHub will be either imported in the libraries section or pip install it. Some of the functionalities will be general (preprocessing, validations, etc) which can be developed locally and follow all the development process with testing. Our initial plan is to follow the Test-Driven-Dev process.
My main concern is this. Since some functionalities are DSS specific, for example for manipulating recipes, dataset's settings, scenarios, etc and these require the dataiku library. What's the best approach to start developing?
Thank you!
Answers
-
Hi,
Do you have examples of functionalities you want to implement for this package?
As you have guessed, project libraries aren't meant to encapsulate Dataiku-specific logic, so it would not be appropriate to import dataiku inside of them. You should instead check if the functionalities you are looking for are covered by the public API client. If not, any code requiring the dataiku package should not live within a project library but directly written into the recipe/notebook you'll work on.
Hope this helps.
Best,
Harizo
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,914 Neuron
@HarizoR
wrote:Hi,
Do you have examples of functionalities you want to implement for this package?
As you have guessed, project libraries aren't meant to encapsulate Dataiku-specific logic, so it would not be appropriate to import dataiku inside of them. You should instead check if the functionalities you are looking for are covered by the public API client. If not, any code requiring the dataiku package should not live within a project library but directly written into the recipe/notebook you'll work on.
Hope this helps.
Best,
Harizo
I respectfully disagree. I have used project libraries or even the global library before and I found them lacking. The main issue with these is that you can only have a single version active at a single time in a single project. The only solution to this that I am aware is to clone the project which of course is not a good idea as there is no way to merge projects back after a clone. The global library is even more restrictive as there is no way around it: you want a different library version you need a different environment. In complex and large environment setups (ie lots of DSS instances) the last thing you want is to have a component which can not handle multiple versions as the same time which makes your complex environment even more complex.
The original poster however mentioned that wants to work with a package published in Github. I think this will be a great way of handling multiple versions as this package can then be installed on Python code environments. For the reasons I explained above I wouldn't go with libraries. This will allow for multiple live versions to co-exist without any issues since you can obviously have multiple code environments and Dataiku gives you all the tools to handle these with easy. In fact you could even have different package versions running side by side on the same project, try that with project libraries.
As for "encapsulate Dataiku-specific logic, so it would not be appropriate to import dataiku inside them" I see nothing wrong with this. Most Python packages depend on other packages so I don't see what the issue with this is. Every Dataiku Python code environment comes builtin with both the dataiku and the dataikuapi packages installed so it will not be an issue meeting those dependencies.
Finally I woud say that depending on the things you want to automate with your package they might be a better fit for a Dataiku macro or Dataiku plugin. Having said that while Dataiku macros or Dataiku plugins are great ways to encapsulate and automate functionality they also fail in the same way that project and global libraries do: you can only have one version of a plugin or macro at the same time. In the case of macros and plugins you can duplicate them (ie MyMacro v2.0 and MyPlugin v2.0) but the migration of objects is not as easy as a package where you just update a code environment and all your recipes will automatically use the new package.
-
Hi,
To avoid any confusion, my previous message was specifically about project libraries. If you want to build an additional layer on top of the "dataiku" and "dataikuapi" packages then plugins can be a good choice depending on the use-case.
You are also correct about the versioning:
- you can only have one version of a plugin at a time within a given Dataiku instance,
- you can only have one version/branch of a project library being active in a given project.
If you have more details of practical use-cases where the previous points are perceived as limitations, do feel free to share them in the Product Ideas section: https://community.dataiku.com/t5/Product-Ideas/idb-p/Product_Ideas
Best,
Harizo
-
Ioannis Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 28 ✭✭✭✭✭
Hi Harizo,
So yes there some generic python functionalities for data manipulation but we want to incorporate other stuff specific to DSS
- manipulating SQL/Python/R recipes settings. For example going inside the code and change very specific parts depending is it's Spark SQL, Spark Query.
- manipulating Dataset settings and reading column names
I haven't used code studios at all is sth that I could do there? Or create abstract classes like dataiku.Dataset etc for this kind of development?
By public API client you mean dataikuapi ?