Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

How to keep inter-project references working upon export and import to other/same DSS instance?

pvannies
Level 2
Level 2
How to keep inter-project references working upon export and import to other/same DSS instance?

Dear community,

Our model team finished their project - consisting of tens of Dataiku projects - and will hand it over to a model validation team. The model validation team will rerun all the Dataiku projects (containing SQL, python and visual recipes and scenarios) and validate their approach starting from a clean database. We are looking for a general way of working, that will cause no problem for the copied projects that has the following inter-project references:

  • shared datasets 
  • calling scenarios from other projects within one master scenario

When the model validation can be performed in another instance, we do not have to alter the unique ProjectKey and thus the references will still work after export and import of the projects (after normal remapping of connections and code environments).
However, this is not always an option and sometimes the validation has to be performed on the same DSS instance, where we have to change the ProjectKey for the tens of imported projects. We now have a problem with above inter-project references.

  • Do you have experience that you can share or advice for our use case?
  • Is there a way to alter these references programmatically?
  • Are there any other places (besides the shared datasets and the scenarios) where inter-project references can occur?

As example, I found that exposing the datasets to the new projects is programmatically possible - see code below.
After that, one would also has to change the inputs of the recipe where this dataset is used in the project it is shared with, and the code of the (python) recipe that has the name of the dataset hardcoded as well. Right? 

import dataikuapi

client = dataiku.api_client()
project = client.get_default_project()
settings = project.get_settings()

# user defined variables
mapping = {'SHAREDPROJECT':'NEWNAME_SHAREDPROJECT'}
DATASET = 'DATASET'

for item in settings.get_raw()['exposedObjects']['objects']:
    if item['type']==DATASET:
        for appearance in item['rules']:
            target = appearance['targetProject']
            if target in mapping:
                settings.add_exposed_object(DATASET, item['localName'], mapping[target])
settings.save()

 Thanks for your help!

0 Kudos
2 Replies
Marlan
Neuron
Neuron

Hi @pvannies,

No help with solutions for you but thought sharing our approach to sharing datasets / tables across projects might still be of interest. 

We have ended up generally not using shared datasets. Instead we just reference SQL tables from another project directly in our SQL script recipes. We have DSS instances for development, test, and production tiers. Shared datasets work within the same instance and we often want to reference a production version of a table from project A in all tiers (dev, test, and prod) of project B. Sometimes we use a variable to indicate which version of the other project table we want to use (i.e., which database / schema prefix). But typically we are referencing the production version of the other project table so we can test the current project logic using production data. This approach also facilitates being able to keep the development and deployment of each project more independent (rather than potentially having to make synchronized updates to both projects across dev, test, and prod instances).  The drawback of course is that we lose having the explicit dependency that shared datasets provide.

Marlan

pvannies
Level 2
Level 2
Author

Hi @Marlan,

Thanks for sharing your way of working! 
We will take it into account and see if this approach can help the teams in their future way of working. 

0 Kudos
A banner prompting to get Dataiku DSS