List of all project and respective tables in dataiku

Options
sj0071992
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

Hi Team,

Is there any way to extract the list of project and respective tables from Dataiku?

Thanks in Advance

Best Answer

  • VitaliyD
    VitaliyD Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 102 Dataiker
    Answer ✓
    Options

    HI,

    The exception "Dataset cannot be used: Declare it as input or Output of your Recipe" means that you are trying to write to the dataset that was not specified as the output dataset in the recipe. To fix it you will need to add the dataset as an output dataset to the recipe, please refer to the example below:

    Screenshot 2021-12-07 at 12.04.29.png

    Additionally, some of the datasets may have no "versionTag" so to handle these cases I would recommend introducing try/except into your code like below:

    import dataiku, json, datetimefrom dataiku import pandasutils as pduimport pandas as pdclient = dataiku.api_client()projects = client.list_projects()df = pd.DataFrame(columns=['project', 'dataset', 'table_name','lastModifiedBy','lastModifiedOn'])for p in projects :proj = client.get_project(p["projectKey"])datasets = []all_datasets = proj.list_datasets()for dataset in all_datasets:versionTag = dataset.get('versionTag')try:lastModifiedBy = versionTag['lastModifiedBy'].get('login')lastModifiedOn = datetime.datetime.fromtimestamp(versionTag['lastModifiedOn']/1e3)except:lastModifiedBy = ""lastModifiedOn = ""table_name = dataset['params'].get('table')projectKey = dataset.get('projectKey')if(table_name is not None):table_name = (lambda a, b: re.sub(r'\$\{projectKey\}', a, b))(projectKey, table_name)df = df.append({'project': p["projectKey"], 'dataset': dataset.name, 'table_name': table_name,'lastModifiedBy':lastModifiedBy,'lastModifiedOn':lastModifiedOn}, ignore_index=True)df = df.drop_duplicates()

    Hope this helps.

    -Best

Answers

  • VitaliyD
    VitaliyD Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 102 Dataiker
    Options

    Hi, If by the tables you mean datasets, you can use Dataiku API. Somethin like follows:

    import dataiku, jsonfrom dataiku import pandasutils as pduimport pandas as pdclient = dataiku.api_client()projects = client.list_projects()df = pd.DataFrame(columns=['project', 'dataset'])for p in projects :proj = client.get_project(p["projectKey"])datasets = []all_datasets = proj.list_datasets()for dataset in all_datasets:df = df.append({'project': p["projectKey"], 'dataset': dataset.name}, ignore_index=True)df = df.drop_duplicates()print(df)

    The above will create a data frame that you can save to a dataset if required.

    Hope the above is what you were looking for.

    -Best

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi,

    This is really helpful.

    But instead of dataset name can we get the table name under that dataset because i have to use it for join and in some cases project key is appended with the dataset name (table Name).

    Is there any way to extract the same?

    Thanks in Advance

  • VitaliyD
    VitaliyD Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 102 Dataiker
    Options

    Hi,

    If you need to display table names for SQL like datasets, you can get them from dataset params. However, if you are using variables in the names it is saved as a string like "${projectKey}_dataset_name" so you will need to replace the variable yourself. I modified the code I shared earlier, please check it below:

    import dataiku, jsonimport refrom dataiku import pandasutils as pduimport pandas as pdclient = dataiku.api_client()projects = client.list_projects()df = pd.DataFrame(columns=['project', 'dataset', 'table_name'])for p in projects :proj = client.get_project(p["projectKey"])datasets = []all_datasets = proj.list_datasets()for dataset in all_datasets:table_name = dataset['params'].get('table')projectKey = dataset.get('projectKey')if(table_name is not None):table_name = (lambda a, b: re.sub(r'\$\{projectKey\}', a, b))(projectKey, table_name)df = df.append({'project': p["projectKey"], 'dataset': dataset.name, 'table_name': table_name}, ignore_index=True)df = df.drop_duplicates()print(df)

    Hope this helps.

    -Best

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 316 Neuron
    Options

    Hi @sj0071992
    ,

    Below is an alternative approach to the one @VitaliyD
    shared. This has the advantage of not having to replace variables yourself. It is possible for variables other than ${projectKey} to be used in a table name so nice that such situations are taken care of for you. It does involve using two additional classes available through the Python APIs: the dataikuapi.dss.dataset.DSSDataset class and the dataiku.Dataset class which may be a bit confusing.

    Documentation for these two classes is here: https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html

    The get_location_info method used returns a number of other pieces of information (illustrated in the code sample). This information is not unique to this method but the simplicity of the returned data structure is perhaps a bit more convenient than other options.

    client = dataiku.api_client()proj = client.get_project('DKUTESTING')dssdataset_objects = proj.list_datasets(as_type='objects') # returns instances of dataikuapi.dss.dataset.DSSDataset classfor dssdataset_object in dssdataset_objects:dataset_name = dssdataset_object.name# Get corresponding instance of dataiku.Dataset class for the current dataset# This class offers the get_location_info method which will return table name with variables replaceddataset_object = dssdataset_object.get_as_core_dataset()location_info = dataset_object.get_location_info()# Get table name, etc for SQL type datasetsif location_info.get('locationInfoType') == 'SQL':table_name = location_info['info'].get('table')connection_name = location_info['info'].get('connectionName')database_type = location_info['info'].get('databaseType')print('Dataset name = {}'.format(dataset_name))print(' Table name = {}'.format(table_name))print(' Connection name = {}'.format(connection_name))print(' Database Type = {}'.format(database_type))print('')

    Marlan

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi,

    This is really helpful.

    Just an add on question, can we also extract the last modified by (User Name) as well here with table Name?

    Thanks in Advance

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 316 Neuron
    Options

    Hi @sj0071992
    ,

    The following snippet can be added to the script to get both last modified by and last modified on:

    version_info = dssdataset_object.get_settings().get_raw()['versionTag']last_modified_by = version_info['lastModifiedBy']['login']last_modified_on = datetime.datetime.fromtimestamp(version_info['lastModifiedOn'] / 1e3)

    You'll need to add an import for the datetime package if you want last modified on. Most of the rest of the available information about datasets is available via dssdataset_object.get_settings().get_raw(). You can check the returned results to see what's there.

    Marlan

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi,

    This is perfect, but there is one issue.

    When i am running this in python Notebook it is working perfect but when i convert that into Python Recipe it is giving one exception

    "Dataset cannot be used: Declare it as input or output of your recipe"

    Could you please help here as well.

    Thanks in Advance

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi @Marlan

    The Code is failing when executing below line

    dataset_object = dssdataset_object.get_as_core_dataset() 

    Ideally it is running fine in Jupiter notebook but is getting failed when i tried running through python recipe

    Error - "Dataset cannot be used: Declare it as input or Output of your Recipe"

    Thanks in Advance

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 316 Neuron
    Options

    Hi @sj0071992
    ,

    Makes sense that you can't run the script provided in a recipe as DSS requires that use of dataiku.Dataset class in recipe matches recipe inputs and outputs.

    See https://doc.dataiku.com/dss/latest/python-api/datasets.html for more information on the two classes you can use to interact with datasets programmatically. The solution I proposed used both classes because the dataiku.Dataset class has this nice method that returns the table name with all variables resolved. As you found, this works fine in a notebook but not in a recipe.

    You can't meet the requirement for use of dataiku.Dataset class in a recipe so you'll need to go back to just using the DSSDataset class from the dataikuapi package.

    Thus you could go back to using @VitaliyD
    's earlier solution without change.

    One small enhancement you could make to that solution is to handle replacement of any variables used in the table name rather than just the projectKey variable.

    Here is one way to do that that works pretty nicely:

    import dataikufrom string import Templatetable_name_with_variables = '${projectKey}_${tier}_DATASETNAME'variables = dataiku.get_custom_variables() # get all project variables (including projectKey)table_name = Template(table_name_with_variables).substitute(variables) # replace any variables used with their values

    The advantage of this approach is that it handles replacing any variables used with their values where the values are also automatically obtained via the API. It uses the Template class from the Python built-in string package which happens to use the same syntax for variables as DSS. I also use this a lot for replacing variables in SQL queries ran through the API.

    Marlan

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi @Marlan

    Thanks again for your help.

    In the Solution provided by @VitaliyD
    , is there any way to extract the "Last Modified by" and "Last Modified on". As we need these detail.

    Thanks in Advance

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 316 Neuron
    Options

    Hi @sj0071992
    , yes, the code I suggested earlier can be used with @VitaliyD
    's solution as it uses the DSSDataset class.

    Marlan

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi @Marlan
    ,

    I am getting the same exception when I tried modifying the @VitaliyD
    code, below is the Code for your reference. It would e great if you have a look at this,

    client = dataiku.api_client()

    projects = client.list_projects()

    df = pd.DataFrame(columns=['project', 'dataset', 'table_name','lastModifiedBy','lastModifiedOn'])

    for p in projects :

    proj = client.get_project(p["projectKey"])

    datasets = []

    all_datasets = proj.list_datasets()

    for dataset in all_datasets:

    versionTag = dataset.get('versionTag')

    lastModifiedBy = versionTag['lastModifiedBy'].get('login')

    lastModifiedOn = datetime.datetime.fromtimestamp(versionTag['lastModifiedOn']/1e3)

    table_name = dataset['params'].get('table')

    projectKey = dataset.get('projectKey')

    if(table_name is not None):

    table_name = (lambda a, b: re.sub(r'\$\{projectKey\}', a, b))(projectKey, table_name)

    df = df.append({'project': p["projectKey"], 'dataset': dataset.name, 'table_name': table_name,'lastModifiedBy':lastModifiedBy,'lastModifiedOn':lastModifiedOn}, ignore_index=True)

    df = df.drop_duplicates()

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi @Marlan
    and @VitaliyD

    Thanks for your help, its working perfect.

    Cheers

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi @VitaliyD

    Could you please also help in extracting the Schema name details with the same snippet. As i have tried by self but the only reason is, the projects used a variables for Schema and when i tried to extract the details i got '${schema}' not the name.

    I am just wondering, how to extract these kind of details.

    Thanks in Advance

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi @VitaliyD
    ,

    Could you please help here

  • VitaliyD
    VitaliyD Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 102 Dataiker
    Options

    Hi @sj0071992
    ,

    If I understood the requirement correctly, you would need to read the project variable and use the similar lambda function(i used in my code) to replace the "${schema}" in the string with the value from the variable. It can be something like below:

    # get project variableschemaVar = proj.get_variables()['standard']['schema']# replace the placeholder with the valuetable_name = (lambda a, b: re.sub(r'\$\{schema\}', a, b))(schemaVar, table_name)

    Hope this helps.

    Best.

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi @VitaliyD
    ,

    I got your point here but what if we don't know the variable name used in the Dataiku projects for Schema.

    How we can extract the details in that case?

    Thanks in Advance

  • VitaliyD
    VitaliyD Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 102 Dataiker
    Options

    Hi @sj0071992
    ,

    In this case, you can extract the variable name using Regex groups with something like below and then as you will have the variable name, get the value and replace the placeholder.

    table_name = '${some_variable}/asdf/asdf'result = re.search(r'(\$\{)(.*)(\})', table_name)print(result.group(2))

    Screenshot 2022-03-09 at 12.48.33.png

    Best,

    Vitaliy

  • RAMAN
    RAMAN Dataiku DSS Core Designer, Registered Posts: 3 ✭✭✭
    Options

    this works good.

    actually my problem is to delete some of the tables zone wise .can you help me to get all the tables from respective zones .

Setup Info
    Tags
      Help me…