MLOps best practices for Dataiku

lhulsta · June 2020

Hi,

I am investigating Dataiku, but I am struggling with integrating default DevOps workflows with Dataiku.
What are the best practices in integrating the following CICD principles in a Dataiku environment?
- code linting (e.g. with pylint, black), ensuring code adheres to code standards.
- unit tests (e.g. with pytest), ensuring recipes and plugins are thoroughly tested before being deployed into production.
- data tests (e.g. with great expectations), ensuring the ingested data in production environments fits expectations.
- integration tests, similar question to unit tests, but for the test environment
- environment separation into dev/test/qa/prod
- CI CD (build pipelines) & DSS integration with Jenkins
- git code management & Bitbucket integration

Based on what I've seen so far, I'd assume the best practice is to separate all of these MLOps principles from the DSS environment.

kind regards,

Lars

Clément_Stenac · June 2020

Hi,

A first common recommendation is to build as much as possible of your code using project libraries in DSS (https://doc.dataiku.com/dss/latest/python/reusing-code.html), and to keep the code of your recipes rather small, in order to make it more manageable.

For code linting, there are no capabilities built in Dataiku, so our recommendation is to use Git integration, either at the project level (https://doc.dataiku.com/dss/latest/collaboration/version-control.html) or project libraries (https://doc.dataiku.com/dss/latest/python/reusing-code.html), and to have your CI/CD pipeline using these Git repositories and performing lint checks

Unit tests can be performed either:

* Within DSS: you can simply call pytest in a scenario python step, and that will run on your project libraries - DSS can schedule scenario runs to schedule your unit tests
* Outside of DSS: using your normal CI/CD pipeline and DSS Git integration capabilities

For data-aware tests and integration tests, we recommend using Dataiku's builtin "metrics and checks" capabilities that were built with this exact purpose in mind (https://doc.dataiku.com/dss/latest/scenarios/metrics.html & https://doc.dataiku.com/dss/latest/scenarios/checks.html)

Environment separation is a native capability of DSS, through the concept of automation nodes, and bundles that act as deployment artifacts between environments (https://doc.dataiku.com/dss/latest/bundles/index.html)

The CI pipeline itself could either be:

* DSS itself, thanks to its extensive automation capabilities: scenarios can be used to automate deployments between environments
* Any existing CI tools like Jenkins, leveraging the extensive API of Dataiku: all aspects of generating artifacts, deploying them to the various environments, activating them, reverting them, running scenarios, running checks and verifying their results, .... can be done through our REST or Python APIs (https://doc.dataiku.com/dss/latest/python-api/)

And finally, as discussed previously, DSS can integrate with remote Git repositories like Github or Bitbucket, both for pushing and pulling code: https://doc.dataiku.com/dss/latest/collaboration/git.html

Hope this helps,

lhulsta · June 2020

Hi Clement,
thanks for this info. The documentation doesn't mention how to work across branches, how to the DSS environment deals with PR's, etc.

This might be straightforward, but at the moment I don't see how to e.g. use different branches.

kind regards,
Lars

dimitri · June 2020

Hi @lhulsta
,

You can create multiple git branches of a DSS project from the DSS user interface. You can also sync the local git repository of your project with a remote repository, write a commit, revert changes, and proceed to push and pull also from the UI of DSS.

Thanks to this "true" git integration, pull requests can be handled on your git hosting provider like BitBucket, GitHub, or GitLab.

All of this is detailed in our documentation on the Version control of projects page.

You can refer to the Overview section of the Working with Git page for an overview of all the other levels where DSS integrates with git.

Have a great day!

Tanguy · December 9

Are conflicts between 2 branches of a DSS project readable (if I am correct, a DSS project is not a collection of flat code files, especially the flow which, if I recall, is in json ; so doesn't sound nice to resolve in case of conflicts) ?

MLOps best practices for Dataiku

Answers

Categories

Setup Info

Tags