Get shared projects using Dataiku API

osk · ‎04-16-2019

Hi there,

I am looking for a way to get the database keys and names that are shared into my project using the Dataiku API.

I tried the following:


project = client.get_project('PROJECT_NAME')
datasets = project.list_datasets()

When using datasets[index_of_database]['params']['table'], then I get the name of a database.

However, the API call does not include databases which are shared into my project.

Background of this is to find dependencies of projects (e.g. if database A is shared into project B, then project A needs to be built first)

I am looking forward to your help.

Best,

Oliver

UserBird · ‎04-16-2019

Hi, this code snippet can help you get the list of shared datasets + their connections.


client = dataiku.api_client()
for project_key in client.list_project_keys():
    print "*** EXPOSED FROM PROJECT %s ***" % (project_key)
    p = client.get_project(project_key)
    for exposed_object in p.get_settings().get_raw()["exposedObjects"]["objects"]:
        connection = p.get_dataset(exposed_object["localName"]).get_definition().get('params').get('connection')
        print "    Object id=%s type=%s db=%s is exposed to projects:" % (exposed_object["localName"], exposed_object["type"], connection)
        for rule in exposed_object["rules"]:
            print "      %s" % rule["targetProject"]

Cheers,

View solution in original post

UserBird · ‎04-16-2019

Hi, this code snippet can help you get the list of shared datasets + their connections.


client = dataiku.api_client()
for project_key in client.list_project_keys():
    print "*** EXPOSED FROM PROJECT %s ***" % (project_key)
    p = client.get_project(project_key)
    for exposed_object in p.get_settings().get_raw()["exposedObjects"]["objects"]:
        connection = p.get_dataset(exposed_object["localName"]).get_definition().get('params').get('connection')
        print "    Object id=%s type=%s db=%s is exposed to projects:" % (exposed_object["localName"], exposed_object["type"], connection)
        for rule in exposed_object["rules"]:
            print "      %s" % rule["targetProject"]

Cheers,

osk · ‎04-16-2019

Thanks a lot, Du. Very helpful!

tomas · ‎04-17-2019

If you want to check if the shared (exported) dataset is used in downstream (i.e. is an input of a recipe in the other project) you can use something like this:


def get_shared_datasets(client, project_key=None, direction='from'):
    # Returns all the shared dataset
    #  1. from a given project (direction = from)
    #   i.e. it returns all the datasets that are exported(shared) from this project
    #   and are used. So for example if DS1 is exported from PRJA to PRJB
    #   it is reported only if in PRJB there is a recipe reading PRJA.DS1.
    #  2. or to a given project (direction = to)
    #   i.e. it returns all the datasets that are imported to this project
    #   and are used. So for example if DS is imported from PRJB to PRJA
    #   it is reported only if in PRJA there is a recipe reading PRJB.DS1
    # project_key can be <str> or <list> of <str>
    # If project_key is None, then returns exported datasets from every project
    # Result is a dict with structure:
    # {u'PROJECT_KEY_A':
    #       {u'dataset_A': [u'CHILD_PROJECT_A'],
    #        u'dataset_B': [u'CHILD_PROJECT_A',u'CHILD_PROJECT_B'],
    #         ... },
    #  u'PROJECT_KEY_B':
    #       { .. }
    # }
    # client = dataiku.api_client()
    projects = []
    if isinstance(project_key, str):
        projects = [project_key]
    if isinstance(project_key, list):
        projects = project_key
    patt = re.compile('\w+\.\w+')
    shared_datasets = {}
    for project in client.list_projects():
        prj = client.get_project(project['projectKey'])
        for r in prj.list_recipes():
            if 'inputs' in r:
                if 'main' in r['inputs']:
                    if 'items' in r['inputs']['main']:
                        for inp in r['inputs']['main']['items']:
                            if patt.match(inp['ref']):
                                proj_ds = inp['ref'].split('.')
                                if project_key is None or (proj_ds[0] in projects and direction == 'from') or\
                                        (project['projectKey'] in projects and direction == 'to'):
                                    if proj_ds[0] not in shared_datasets:
                                        shared_datasets[proj_ds[0]] = {}
                                    if proj_ds[1] not in shared_datasets[proj_ds[0]]:
                                        shared_datasets[proj_ds[0]][proj_ds[1]] = []
                                    if project['projectKey'] not in shared_datasets[proj_ds[0]][proj_ds[1]]:
                                        shared_datasets[proj_ds[0]][proj_ds[1]].append(project['projectKey'])
    return shared_datasets

Sign up to take part

Get shared projects using Dataiku API

Get shared projects using Dataiku API

Labels