I am investigating Dataiku, but I am struggling with integrating default DevOps workflows with Dataiku.
What are the best practices in integrating the following CICD principles in a Dataiku environment?
- code linting (e.g. with pylint, black), ensuring code adheres to code standards.
- unit tests (e.g. with pytest), ensuring recipes and plugins are thoroughly tested before being deployed into production.
- data tests (e.g. with great expectations), ensuring the ingested data in production environments fits expectations.
- integration tests, similar question to unit tests, but for the test environment
- environment separation into dev/test/qa/prod
- CI CD (build pipelines) & DSS integration with Jenkins
- git code management & Bitbucket integration
Based on what I've seen so far, I'd assume the best practice is to separate all of these MLOps principles from the DSS environment.
A first common recommendation is to build as much as possible of your code using project libraries in DSS (https://doc.dataiku.com/dss/latest/python/reusing-code.html), and to keep the code of your recipes rather small, in order to make it more manageable.
For code linting, there are no capabilities built in Dataiku, so our recommendation is to use Git integration, either at the project level (https://doc.dataiku.com/dss/latest/collaboration/version-control.html) or project libraries (https://doc.dataiku.com/dss/latest/python/reusing-code.html), and to have your CI/CD pipeline using these Git repositories and performing lint checks
Unit tests can be performed either:
* Within DSS: you can simply call pytest in a scenario python step, and that will run on your project libraries - DSS can schedule scenario runs to schedule your unit tests
* Outside of DSS: using your normal CI/CD pipeline and DSS Git integration capabilities
For data-aware tests and integration tests, we recommend using Dataiku's builtin "metrics and checks" capabilities that were built with this exact purpose in mind (https://doc.dataiku.com/dss/latest/scenarios/metrics.html & https://doc.dataiku.com/dss/latest/scenarios/checks.html)
Environment separation is a native capability of DSS, through the concept of automation nodes, and bundles that act as deployment artifacts between environments (https://doc.dataiku.com/dss/latest/bundles/index.html)
The CI pipeline itself could either be:
* DSS itself, thanks to its extensive automation capabilities: scenarios can be used to automate deployments between environments
* Any existing CI tools like Jenkins, leveraging the extensive API of Dataiku: all aspects of generating artifacts, deploying them to the various environments, activating them, reverting them, running scenarios, running checks and verifying their results, .... can be done through our REST or Python APIs (https://doc.dataiku.com/dss/latest/python-api/)
And finally, as discussed previously, DSS can integrate with remote Git repositories like Github or Bitbucket, both for pushing and pulling code: https://doc.dataiku.com/dss/latest/collaboration/git.html
Hope this helps,
thanks for this info. The documentation doesn't mention how to work across branches, how to the DSS environment deals with PR's, etc.
This might be straightforward, but at the moment I don't see how to e.g. use different branches.
Hi @lhulsta ,
You can create multiple git branches of a DSS project from the DSS user interface. You can also sync the local git repository of your project with a remote repository, write a commit, revert changes, and proceed to push and pull also from the UI of DSS.
Thanks to this "true" git integration, pull requests can be handled on your git hosting provider like BitBucket, GitHub, or GitLab.
All of this is detailed in our documentation on the Version control of projects page.
Have a great day!