List of all project and respective tables in dataiku

Solved!
sj0071992
List of all project and respective tables in dataiku

Hi Team,

Is there any way to extract the list of project and respective tables from Dataiku?

 

Thanks in Advance

0 Kudos
1 Solution
VitaliyD
Dataiker

HI,

The exception "Dataset cannot be used: Declare it as input or Output of your Recipe" means that you are trying to write to the dataset that was not specified as the output dataset in the recipe. To fix it you will need to add the dataset as an output dataset to the recipe, please refer to the example below:

Screenshot 2021-12-07 at 12.04.29.png

Additionally, some of the datasets may have no "versionTag" so to handle these cases I would recommend introducing try/except into your code like below:

import dataiku, json, datetime
from dataiku import pandasutils as pdu
import pandas as pd

client = dataiku.api_client()
projects = client.list_projects()

df = pd.DataFrame(columns=['project', 'dataset', 'table_name','lastModifiedBy','lastModifiedOn'])

for p in projects :
    proj = client.get_project(p["projectKey"])
    datasets = []
    all_datasets = proj.list_datasets()
    for dataset in all_datasets:
        versionTag = dataset.get('versionTag')
        try:
            lastModifiedBy = versionTag['lastModifiedBy'].get('login')
            lastModifiedOn = datetime.datetime.fromtimestamp(versionTag['lastModifiedOn']/1e3)
        except:
            lastModifiedBy = ""
            lastModifiedOn = ""
        table_name = dataset['params'].get('table')
        projectKey = dataset.get('projectKey')
        if(table_name is not None):
            table_name = (lambda a, b: re.sub(r'\$\{projectKey\}', a, b))(projectKey, table_name)
        df = df.append({'project': p["projectKey"], 'dataset': dataset.name, 'table_name': table_name,'lastModifiedBy':lastModifiedBy,'lastModifiedOn':lastModifiedOn}, ignore_index=True)
       
df = df.drop_duplicates()

 

Hope this helps.

-Best

View solution in original post

0 Kudos
20 Replies
VitaliyD
Dataiker

Hi, If by the tables you mean datasets, you can use Dataiku API. Somethin like follows:

 

 

import dataiku, json
from dataiku import pandasutils as pdu
import pandas as pd

client = dataiku.api_client()
projects = client.list_projects()
df = pd.DataFrame(columns=['project', 'dataset'])
for p in projects :
    proj = client.get_project(p["projectKey"])
    datasets = []
    all_datasets = proj.list_datasets()
    for dataset in all_datasets:
        df = df.append({'project': p["projectKey"], 'dataset': dataset.name}, ignore_index=True)
df = df.drop_duplicates()
print(df)

 

 

 

The above will create a data frame that you can save to a dataset if required.

Hope the above is what you were looking for.

-Best

0 Kudos
sj0071992
Author

Hi,

 

This is really helpful.

 

But instead of dataset name can we get the table name under that dataset because i have to use it for join and in some cases project key is appended with the dataset name (table Name).

Is there any way to extract the same?

 

Thanks in Advance

0 Kudos
VitaliyD
Dataiker

Hi, 

If you need to display table names for SQL like datasets, you can get them from dataset params. However, if you are using variables in the names it is saved as a string like "${projectKey}_dataset_name" so you will need to replace the variable yourself. I modified the code I shared earlier, please check it below:

import dataiku, json
import re
from dataiku import pandasutils as pdu
import pandas as pd

client = dataiku.api_client()
projects = client.list_projects()
df = pd.DataFrame(columns=['project', 'dataset', 'table_name'])
for p in projects :
    proj = client.get_project(p["projectKey"])
    datasets = []
    all_datasets = proj.list_datasets()
    for dataset in all_datasets:
        table_name = dataset['params'].get('table')
        projectKey = dataset.get('projectKey')
        if(table_name is not None):
            table_name = (lambda a, b: re.sub(r'\$\{projectKey\}', a, b))(projectKey, table_name)
        df = df.append({'project': p["projectKey"], 'dataset': dataset.name, 'table_name': table_name}, ignore_index=True)
        
df = df.drop_duplicates()
print(df)

 

Hope this helps.

-Best

Marlan

Hi @sj0071992,

Below is an alternative approach to the one @VitaliyD shared. This has the advantage of not having to replace variables yourself. It is possible for variables other than ${projectKey} to be used in a table name so nice that such situations are taken care of for you. It does involve using two additional classes available through the Python APIs: the dataikuapi.dss.dataset.DSSDataset class and the dataiku.Dataset class which may be a bit confusing. 

Documentation for these two classes is here: https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html

The get_location_info method used returns a number of other pieces of information (illustrated in the code sample).  This information is not unique to this method but the simplicity of the returned data structure is perhaps a bit more convenient than other options.

client = dataiku.api_client()

proj = client.get_project('DKUTESTING')

dssdataset_objects = proj.list_datasets(as_type='objects') # returns instances of dataikuapi.dss.dataset.DSSDataset class

for dssdataset_object in dssdataset_objects:
    
    dataset_name = dssdataset_object.name
    
    # Get corresponding instance of dataiku.Dataset class for the current dataset 
    # This class offers the get_location_info method which will return table name with variables replaced
    dataset_object = dssdataset_object.get_as_core_dataset() 

    location_info = dataset_object.get_location_info()
    
    # Get table name, etc for SQL type datasets
    if location_info.get('locationInfoType') == 'SQL':
        table_name = location_info['info'].get('table')
        connection_name = location_info['info'].get('connectionName')
        database_type = location_info['info'].get('databaseType')

        print('Dataset name = {}'.format(dataset_name))
        print('  Table name = {}'.format(table_name))
        print('  Connection name = {}'.format(connection_name))
        print('  Database Type = {}'.format(database_type))
        print('')

 

Marlan

sj0071992
Author

Hi,

 

This is really helpful.

 

Just an add on question, can we also extract the last modified by (User Name) as well here with table Name?

 

Thanks in Advance

0 Kudos
Marlan

Hi @sj0071992,

The following snippet can be added to the script to get both last modified by and last modified on:

version_info = dssdataset_object.get_settings().get_raw()['versionTag']
last_modified_by = version_info['lastModifiedBy']['login']
last_modified_on = datetime.datetime.fromtimestamp(version_info['lastModifiedOn'] / 1e3)

 

You'll need to add an import for the datetime package if you want last modified on. Most of the rest of the available information about datasets is available via dssdataset_object.get_settings().get_raw(). You can check the returned results to see what's there.

Marlan

0 Kudos
sj0071992
Author

Hi,

 

This is perfect, but there is one issue.

When i am running this in python Notebook it is working perfect but when i convert that into Python Recipe it is giving one exception

 

"Dataset cannot be used: Declare it as input or output of your recipe"

Could you please help here as well.

 

Thanks in Advance

 

 

0 Kudos
sj0071992
Author

Hi @Marlan 

 

The Code is failing when executing below line

 

dataset_object = dssdataset_object.get_as_core_dataset() 

 

Ideally it is running fine in Jupiter notebook but is getting failed when i tried running through python recipe 

 

Error - "Dataset cannot be used: Declare it as input or Output of your Recipe"

 

Thanks in Advance

0 Kudos
Marlan

Hi @sj0071992,

Makes sense that you can't run the script provided in a recipe as DSS requires that use of dataiku.Dataset class in recipe matches recipe inputs and outputs.

See https://doc.dataiku.com/dss/latest/python-api/datasets.html for more information on the two classes you can use to interact with datasets programmatically. The solution I proposed used both classes because the dataiku.Dataset class has this nice method that returns the table name with all variables resolved. As you found, this works fine in a notebook but not in a recipe.

You can't meet the requirement for use of dataiku.Dataset class in a recipe so you'll need to go back to just using the DSSDataset class from the dataikuapi package.

Thus you could go back to using @VitaliyD's earlier solution without change.

One small enhancement you could make to that solution is to handle replacement of any variables used in the table name rather than just the projectKey variable. 

Here is one way to do that that works pretty nicely:

import dataiku
from string import Template
table_name_with_variables = '${projectKey}_${tier}_DATASETNAME'
variables = dataiku.get_custom_variables() # get all project variables (including projectKey)
table_name = Template(table_name_with_variables).substitute(variables) # replace any variables used with their values

 

The advantage of this approach is that it handles replacing any variables used with their values where the values are also automatically obtained via the API. It uses the Template class from the Python built-in string package which happens to use the same syntax for variables as DSS. I also use this a lot for replacing variables in SQL queries ran through the API. 

Marlan

0 Kudos
sj0071992
Author

Hi @Marlan 

 

Thanks again for your help.

In the Solution provided by @VitaliyD , is there any way to extract the "Last Modified by" and "Last Modified on". As we need these detail.

Thanks in Advance

0 Kudos
Marlan

Hi @sj0071992, yes, the code I suggested earlier can be used with @VitaliyD's solution as it uses the DSSDataset class.

Marlan

0 Kudos
sj0071992
Author

Hi @Marlan ,

 

I am getting the same exception when I tried modifying the @VitaliyD code, below is the Code for your reference. It would e great if you have a look at this,

 

client = dataiku.api_client()

projects = client.list_projects()

 

 

df = pd.DataFrame(columns=['project', 'dataset', 'table_name','lastModifiedBy','lastModifiedOn'])

for p in projects :

    proj = client.get_project(p["projectKey"])

    datasets = []

    all_datasets = proj.list_datasets()

    for dataset in all_datasets:

        versionTag = dataset.get('versionTag')

        lastModifiedBy = versionTag['lastModifiedBy'].get('login')

        lastModifiedOn = datetime.datetime.fromtimestamp(versionTag['lastModifiedOn']/1e3)

        table_name = dataset['params'].get('table')

        projectKey = dataset.get('projectKey')

        if(table_name is not None):

            table_name = (lambda a, b: re.sub(r'\$\{projectKey\}', a, b))(projectKey, table_name)

        df = df.append({'project': p["projectKey"], 'dataset': dataset.name, 'table_name': table_name,'lastModifiedBy':lastModifiedBy,'lastModifiedOn':lastModifiedOn}, ignore_index=True)

       

df = df.drop_duplicates()

0 Kudos
VitaliyD
Dataiker

HI,

The exception "Dataset cannot be used: Declare it as input or Output of your Recipe" means that you are trying to write to the dataset that was not specified as the output dataset in the recipe. To fix it you will need to add the dataset as an output dataset to the recipe, please refer to the example below:

Screenshot 2021-12-07 at 12.04.29.png

Additionally, some of the datasets may have no "versionTag" so to handle these cases I would recommend introducing try/except into your code like below:

import dataiku, json, datetime
from dataiku import pandasutils as pdu
import pandas as pd

client = dataiku.api_client()
projects = client.list_projects()

df = pd.DataFrame(columns=['project', 'dataset', 'table_name','lastModifiedBy','lastModifiedOn'])

for p in projects :
    proj = client.get_project(p["projectKey"])
    datasets = []
    all_datasets = proj.list_datasets()
    for dataset in all_datasets:
        versionTag = dataset.get('versionTag')
        try:
            lastModifiedBy = versionTag['lastModifiedBy'].get('login')
            lastModifiedOn = datetime.datetime.fromtimestamp(versionTag['lastModifiedOn']/1e3)
        except:
            lastModifiedBy = ""
            lastModifiedOn = ""
        table_name = dataset['params'].get('table')
        projectKey = dataset.get('projectKey')
        if(table_name is not None):
            table_name = (lambda a, b: re.sub(r'\$\{projectKey\}', a, b))(projectKey, table_name)
        df = df.append({'project': p["projectKey"], 'dataset': dataset.name, 'table_name': table_name,'lastModifiedBy':lastModifiedBy,'lastModifiedOn':lastModifiedOn}, ignore_index=True)
       
df = df.drop_duplicates()

 

Hope this helps.

-Best

0 Kudos
sj0071992
Author

Hi @Marlan and @VitaliyD 

 

Thanks for your help, its working perfect.

 

Cheers

sj0071992
Author

Hi @VitaliyD 

 

Could you please also help in extracting the Schema name details with the same snippet. As i have tried by self but the only reason is, the projects used a variables for Schema and when i tried to extract the details i got '${schema}' not the name.

I am just wondering, how to extract these kind of details.

 

Thanks in Advance 

0 Kudos
sj0071992
Author

Hi @VitaliyD , 

 

Could you please help here

0 Kudos
VitaliyD
Dataiker

Hi @sj0071992,

If I understood the requirement correctly, you would need to read the project variable and use the similar lambda function(i used in my code) to replace the "${schema}" in the string with the value from the variable. It can be something like below:

# get project variable
schemaVar = proj.get_variables()['standard']['schema']
# replace the placeholder with the value
table_name = (lambda a, b: re.sub(r'\$\{schema\}', a, b))(schemaVar, table_name)

 

Hope this helps.

Best.

0 Kudos
sj0071992
Author

Hi @VitaliyD ,

 

I got your point here but what if we don't know the variable name used in the Dataiku projects for Schema.

How we can extract the details in that case?

 

Thanks in Advance

0 Kudos
VitaliyD
Dataiker

Hi @sj0071992,

In this case, you can extract the variable name using Regex groups with something like below and then as you will have the variable name, get the value and replace the placeholder.

table_name = '${some_variable}/asdf/asdf'
result = re.search(r'(\$\{)(.*)(\})', table_name)
print(result.group(2))

Screenshot 2022-03-09 at 12.48.33.png

 

Best,

Vitaliy

0 Kudos
RAMAN
Level 1

this works good. 

actually my problem is to delete some of the tables zone wise .can you help me to get all the tables from respective zones .

 

0 Kudos