Generating Project Identifier for Versioning Training Data

tim-wright
Level 5
Generating Project Identifier for Versioning Training Data

Is there an easy way to identify which version of a Project (maybe by git hash?) was used to retrain a particular model? The use case I'm considering is as follows. I would like to version my training data so that each time the training data in the flow is updated and the model is retrained (possibly manually or through a scenario), another scenario will run (probably python), and create a backup of that training dataset in my RDBMS (or on S3) that I can link back to the project at that point in time. Has anyone done something like this before? I was thinking of possibly using a git hash and the date?

 

Thanks,

Tim  

0 Kudos
1 Reply
SarinaS
Dataiker

Hi Tim,

For the first part of your setup, I think you can:

  1. Create a scenario that is triggered based on the “Trigger on dataset change” trigger.
  2.  Add a “Build / Train” Step to your scenario to perform your model training
  3.  Add an “Execute Python code” Step to your scenario to perform the backup of your data to S3/RDMS.

Can you clarify what specific information you would want to see for a project at a specific point in time?

If I understand correctly, you would like to snapshot a project and tie that snapshot of the project to a specific set of trained data. If so, one option would be to create a bundle of a project and use that bundle as your project snapshot. You could then use the bundle name to label the data in S3 or RDMS to tie the data back to the particular snapshot of the project. If you are interested in this approach, this can be done through the Python API, and could be incorporated into step #3 above. Here’s some starter code for how this could be done:

import dataiku
import uuid

# get project info 
client = dataiku.api_client()
project = client.get_project('PROJECT_NAME')

# generate project hash to tag the project 
project_hash = uuid.uuid4().hex

# create a project snapshot by creating a bundle, labeled with the project_hash 
project.export_bundle('version_' + project_hash)

# save your training data backup to s3 using the project_hash 

 

Thanks,
Sarina

0 Kudos