Is there an easy way to identify which version of a Project (maybe by git hash?) was used to retrain a particular model? The use case I'm considering is as follows. I would like to version my training data so that each time the training data in the flow is updated and the model is retrained (possibly manually or through a scenario), another scenario will run (probably python), and create a backup of that training dataset in my RDBMS (or on S3) that I can link back to the project at that point in time. Has anyone done something like this before? I was thinking of possibly using a git hash and the date?
For the first part of your setup, I think you can:
Can you clarify what specific information you would want to see for a project at a specific point in time?
If I understand correctly, you would like to snapshot a project and tie that snapshot of the project to a specific set of trained data. If so, one option would be to create a bundle of a project and use that bundle as your project snapshot. You could then use the bundle name to label the data in S3 or RDMS to tie the data back to the particular snapshot of the project. If you are interested in this approach, this can be done through the Python API, and could be incorporated into step #3 above. Here’s some starter code for how this could be done:
import dataiku import uuid # get project info client = dataiku.api_client() project = client.get_project('PROJECT_NAME') # generate project hash to tag the project project_hash = uuid.uuid4().hex # create a project snapshot by creating a bundle, labeled with the project_hash project.export_bundle('version_' + project_hash) # save your training data backup to s3 using the project_hash