Building a Jenkins pipeline for Dataiku DSS

3 Kudos

In this post, we will show how to set up a sample CI/CD (continuous integration / continuous deployment) pipeline built on Jenkins for our Dataiku DSS project. It follows our blog post Continuous integration and continuous deployment (CI/CD) in Dataiku DSS that presents the concepts and some important questions in order to fully optimize CI/CD.

In order to be at ease with this article, you will need to know about DSS flows, scenarios, and automation. I strongly recommend following the Operationalization series from the Dataiku Academy. In addition, an understanding of Jenkins and the basics of Artifactory are required, plus basic Python skills, including pytest ideally.

Environment

Our CI/CD environment will be made of:

One Jenkins server (we will be using local executors) with the following Jenkins plugins: Artifactory | Jenkins plugin , GitHub Authentication, Pyenv Pipeline | Jenkins plugin , xUnit | Jenkins plugin
One JFrog Artifactory OSS server
One DSS Design node where Data Scientists will build their ML flows
Two DSS Automation nodes, one for Pre-Production and the other Production

We will use the Dataiku DSS Prediction churn sample project, with a twist at the end with a Python recipe. The code itself is not really important; we need some Python code to showcase code analysis.

Note: You can find all the files used in this project attached to this article as dss_pipeline-master.zip

Pipeline configuration

The first step is to create a project in Jenkins of the “Pipeline” type. Let’s call it dss-pipeline-cicd.

We will use the following parameters for this project:

DSS_PROJECT (String): key of the project we want to deploy (e.g. DKU_CHURN)
DESIGN_URL (String): URL of the design node (e.g. http://dss-dev-design:1000)
DESIGN_API_KEY (Password): API key to connect to this node
AUTO_PREPROD_URL (String): URL of the PREPROD node (e.g. http://dss-preprod-auto:1000)
AUTO_PREPROD_API_KEY (Password): the API key to connect to this node
AUTO_PROD_URL (String): URL of the PROD node (e.g. http://dss-prod-auto:1000)
AUTO_PROD_API_KEY (Password): the API key to connect to this node

The pipeline contains five stages and one post action:

The post action will be used to clean up the bundle zip file from the local jenkins workspace to save space and will also retrieve all the xUnit test reports.

As additional global notes, we are using a global variable bundle_name so that we can pass this information from one stage to another. This variable is computed using a shell script with the date & time of the run (the script is explained after it is displayed).

You can find the groovy file of the pipeline in the zip: pipeline.groovy. In this file, you have the definition of the different stages and for each stage the details of the steps.

Let’s review those steps one by one.

‘PREPARE’ stage

This stage is used to build a proper workspace. The main tasks are to get all the CI/CD files we need from your GitHub project and build the right Python3 environment using the requirements.txt file.

Since the Dataiku DSS API python package is retrieved from a node directly, we are using the Design node URL provided as parameter for that.

‘PROJECT_VALIDATION’ stage

This stage contains mostly Python scripts used to validate that the project respects internal rules for being production-ready. Any check can be performed at this stage, be it on the project structure, setup, or the coding parts, such as the code recipes or the project libraries.

Note that we are using pytest capability to use command-line arguments by adding a conftest.py.

This is very specific to each installation, but here are the main takeaways:

In this project, we will be using the pytest framework to run the tests and report the results to Jenkins. The conftest.py is used to load our command line options. The run_test.py file includes the actual tests, all being Python functions starting with ‘TEST_’.
The checks we have:

There is at least one test DSS scenario (name starts with ‘TEST_’) and one named ‘TEST_SMOKE’
Code complexity of Python recipes is in acceptable ranges. We are using radon for this (using pylint, flake8, or any other code inspection tool is of course also possible)

If this state is OK, we know we have a properly written project, and we can package it.

‘PACKAGE_BUNDLE’ stage

The first part of this stage is using a Python script to create a bundle of the project and download it locally on the Jenkins executor.

The second part is using a Jenkins stage to publish this bundle on our Artifactory repository generic-local/dss_bundle/. Note that the stage will fail if no file is published (the “failNoOp: true” option) so there is no need for an extra check.

‘PREPROD_TEST’ stage

In this stage, we are deploying the bundle produced at the previous stage on our DSS PREPROD instance and then running tests.

The bundle import is done in import_bundle.py and is straightforward: import, preload, activate. In this example, we consider connection mappings are automatically done. If you need specific mappings, this requires some more work using the API (see XXX)

The following script run_test.py executes all the scenarios named TEST_XXX and fails if a result is not a success.

You can check on the blog article why we are using DSS scenarios.

This pytest configuration has an additional twist. If you have only one test running all the TEST_XXX scenarios, they will be reported to Jenkins as a single test, successful or failed.

Here, we make this nicer by dynamically creating one unit test per scenario. In the final report, we will have one report per scenario executed, making the report more meaningful. This requires some understanding of pytest parameterization. Note that you can perfectly keep one test that will run all your scenarios if you are not feeling at ease with this.

‘DEPLOY_TO_PROD’ stage

The previous stage verified that we have a valid package. It’s time to move it to production!

Again using Python with script import_bundle.py, we will upload the bundle to the production node using the same logic as in PREPROD.

The second script prod_activation.py will handle the activation and the rollback. For that, here are the main steps:

Get the current active bundle. Note we need to iterate through the projects to find the indication.
The uploaded bundle is preloaded and then activated within a ‘try’ statement. Since the activation is an atomic operation, if this fails, we have nothing to do.
In order to ensure the new bundle is working, we execute the TEST_SMOKE scenario.
If TEST_SMOKE execution fails, we perform the rollback by re-activating the previous bundle.

Post Actions

The Post Actions phase allows us to clean locally downloaded zip files and publish all test xUnit reports in Jenkins to have a nice test report. Those reports were produced all along the pipeline by pytest and are here aggregated into a single view to have something like:

Test trends on the project dashboard

Test result details on a given run

How to use this sample?

Now we have seen a step-by-step demonstration of how to build a solid CI/CD pipeline with Jenkins. If you want to use this and adapt it to your setup, here is a checklist of what you need to do:

Have your Jenkins, Artifactory, and DSS nodes installed and running
Make sure to have Python 3 installed on your Jenkins executor
Get all the Python scripts for the project and put it in your own source code repository
Create a new pipeline project in your Jenkins:

Add the variables as project parameters and assign them a default value according to your setup
Copy/paste the pipeline.groovy as Pipeline script

In the pipeline, setup your source code repository in the PREPARE stage

And then hit ‘Build with parameters’.

How to improve it?

You can of course improve this startup kit, and here are some ideas:

Define a trigger for your pipeline:

A Time trigger, to run the pipeline every day, for example
A Jenkins GUI manual trigger where users connect to Jenkins and trigger the job

An API trigger, by calling Jenkins from a DSS webapp, scenario, or a macro (using Generic Webhook Trigger | Jenkins plugin or Remote Access API)

You can also add a manual sign-off in this process if you are not confident. The easiest way is to use the Jenkins manual input step.

And of course, add as many test scenarios as possible that will ensure a reliable continuous deployment.

arielma2304 · ‎11-16-2020

Refer to this issue:
Pipeline-fails-with-import-error-for-dataiku

arielma2304 · ‎12-08-2020

Can you share an example on how to use pylint here instead of radon?

arielma2304 · ‎12-21-2020

I think there is error in the file 'prod_activation.py' line 50:

should be 'previous_bundle_id' instead of 'bundle_id'

fsergot · ‎01-04-2021

Thanks @arielma2304 for your feedbacks on this article. The code has been fixed in the sample.

As for the pylint sample, I would suggest to try the following: grab the python code from DSS and save it as a python file locally and then pass this file to pylint with something like pylint mymodule.py.

arielma2304 · ‎01-06-2021

Hi

About the file 1_project_validation/run_test.py, and idea how can I can results for multiple scenarios which starts with TEST? assuming I have 5 scenarios in the project, and 2 of them start with TEST, I want to see 2 test results

fsergot · ‎01-11-2021

Good morning,

The example for the preproduction unit tests is using a different approach to tests by using pytest parametrization with the pytest_generate_tests() function. This allows you to run and return as many test results as there are scenarios to execute.

I am not sure I see what would be the status check you want to perform in your case. The preparation step in the demo is about controlling that the structure of the project is good so the existence or number of scenarios is the metric to control. Here, do you want to execute the scenario and check their status? (in which case, the 3_preprod_test example is good) or do you just want to have one fake test result for each existing scenario (using the parametrization with a simple 'assert True' would do the job, although it looks a bit weird)?