Split built-in Python code environment in two
Context: When Dataiku installs in a system it creates a built-in Python code environment which is based on the supported version of Python for such task. This is Python v2.7 for old Dataiku versions but from v9 Dataiku till v11 will use Python v3.6 or Python v3.7 where available as the built-in Python code environment. v12 then added support for Python v3.9 as built-in environment. It's worth noting that the built-in environment is a complete different beast that regular code environments that you create in Dataiku code environments Admin screen. For instance Dataiku v11.4 added support for Python v3.11 for code environments (v11.0 also added Python v3.8 and Python 3.9 too) but as above Dataiku v12 only currently supports Python v3.9 for built-in. You can specify the built-in Python version in the installation script as follows:
installer.sh -d /path/to/DATA_DIR -P python3.7 -p PORT
While I have not found any documentation stating it I have been told few times by many Dataiku employees that we should not install packages in the built-in Python code environment as it is critical this code environment remains pristine and in working condition. This is what the documention says:
The DSS installation phase creates an initial “builtin” Python environment, which is used to run all Python-based internal DSS operations, and is also used as a default environment to run user-provided Python code. This builtin Python environment comes with a default set of packages, suitable for this version of DSS. These are setup by the DSS installer and updated accordingly on DSS upgrades. This builtin environment is not controllable nor configurable by user. Depending on the OS used, a suitable Python version is automatically used.
It is this dual use that I think should be split:
- all Python-based internal DSS operations, and is also used as a default environment to run user-provided Python code
Dataiku will argue that you can use the built-in code environment and that if you need any changes you should create your own code environment using the code environment management functionality that Dataiku provides. I disagree with the status quo for many reasons:
- The tyranny of the default: The majority of people don't tend to change settings unless absolutely required. As a result a lot of our Dataiku users and projects will end up pointing to the built-in code environment which in undesired (more on this later)
- It is not possible to restrict the use of the built-in code environment
- While it is possible to set a Global Default Python code env in Administration Settings (Misc) users can still select "Built-in" as the code environment option in most places where they can select a code environment
- By having both users and DSS internal processes pointing to the same built-in environment it means we can't upgrade it without risking impacting users. This means that we keep relying on Python v3.6, which is end of life now, for our DSS Python internal processes which means exposing ourselves to unpatched security issues
- It's actually quite hard to recreate the 3.6 built-in code environment for v10 and v11 as a Dataiku code environment. Having actually attempted this to move our users using the v3.6 built-in code environment in v10 and v11 to a custom one (so we could upgrade the built-in one without impacting users) it took a while to recreate since it clashes with core packages and Jupyter support packages so it has to be defined a fully custom code env
So what's this idea about? Simply separate the built-in code environment into two:
- One built-in Internal code environment to run all Python-based internal DSS operations. The Python version is specified in installer.sh as of now and created during the install. This code environment can not be used to execute user code in Dataiku and it can not be selected in any of the Dataiku screens.
- One built-in new Default code environment. This matches the version of Internal built-in code env above but gets created as a proper Dataiku Code Environment with the "Usable by all" enabled. This code environment can be used to execute user code in Dataiku. It can also be managed by Dataiku administrators that can add or remove packages as needed.
Hope it makes sense. Thanks for reading!