Versioning Dataiku project
For you to understand better I will try to explain our use case in more details. Usually our Dataiku projects has specific names which represents a clear purpose of what this model is for. So far we were not recalibrating our models (maybe that is an idea for the future) but usually new models are build anew from scratch and replaces the old ones. Maybe both of them are using at the same time for a while. The problem we face is that when we start a new model we want work on the same Dataiku project but that project contains the flow that was used to build the old model. We don't want to create another flow side by side because of several reasons: flow will look messy (imagine what we will have after creating not 2 but 5 models or even more), common dataset or folder names are already in use which is also inconvenient and maybe other reasons. At the same time we don't want to delete the old flow because we might need in the future. So basically we want to have an ability to switch between different flows (versions) in the same project. What would be the best ways to solve it?
I had couple ideas but I haven't tried to implement them from start to end. One idea was to use bundles. If I understand correctly we can download bundles which are as versions of a project and could be stored anywhere we want. Correct me if I am wrong and that is not possible. However, I don't like this option very much because switching between versions might be complicated and inconvenient. Other idea was to behave like in a typical "coding" project and use code version control (git) to make project versions. I believe that the best way to do this would be to create a remote repository in Dataiku (Github repo for example). Then push changes there and create a completely new branch and push this branch into Dataiku. In other words, just have different branches for different versions and be able to switch between them using git. Is that possible to do? Or maybe you can suggest any other option?
Answers
-
A Customer Success Manager will get back to you by email with the different options and trade-offs.
-
Hi, I'd like to find out more about this too - I'd like to reference a version of a project for audit. Is there anything that can be shared?
-
I can share some methods used at my company. We need versioning for audits as well, which means we don't use previous versions during development.
1. Bundles. You can use bundles as version markers. You can switch back and forth between bundle-versions, or download and import the bundle as a separate copy of the project. You can create bundles programmatically, allowing you to automatically create a bundle as part of a scenario.
2. Project exports. These can be created programmatically as well, and you can store the zip in a managed folder in your flow.
Another option would be to use Git as mentioned in the original post. We still need to set up our remote, and once that happens we will look into using the repositories for versioning. It would be great if we can tag commits, but that looks to be difficult because Dataiku creates commits automatically.
-
Thanks for this, I was indeed looking at options 1 and 2 but not Git. I didn't know you can do 1 or 2 programmatically. I suspect Bundles is the approach I'll try first and adopt a naming convention. Just exploring how to spot what has changed in different Bundles versions. Thanks again.
-
I would suggest to use custom-fields for managing version. I've found it very useful