Extract metadata from a dataset

Options
florianbriand
florianbriand Registered Posts: 12 ✭✭✭✭

I would like to extract some metadata (for example the column names) from a dataset.

Is there any existing recipe or preparation processor doing that, or have I to write it myself in a plugin ?

Answers

  • ATsao
    ATsao Dataiker Alumni, Registered Posts: 139 ✭✭✭✭✭✭✭✭
    Options

    Hi,

    I would suggest leveraging the Python APIs to retrieve the information you need, such as the metadata or corresponding schema, from these datasets:
    https://doc.dataiku.com/dss/latest/python-api/datasets.html
    https://doc.dataiku.com/dss/latest/python-api/rest-api-client/datasets.html

    More information about available functions can be found at the bottom under the reference doc section.

    Best,
    Andrew

  • florianbriand
    florianbriand Registered Posts: 12 ✭✭✭✭
    Options
    So I understand the answer as "there isn't any existing solution" and I had to code it myself I think it could be a good thing to add as a standard recipe / processor
  • ATsao
    ATsao Dataiker Alumni, Registered Posts: 139 ✭✭✭✭✭✭✭✭
    Options

    Hi,

    Sure we appreciate the feedback and I'll forward it on your behalf to our Product team for further review.

    Best,

    Andrew

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 316 Neuron
    Options

    Here's some code that may be helpful. It illustrates reading the schema and some other dataset attributes. This example is geared toward SQL datasets (our primary use) so some of the details may differ for other types of datasets.

    ds = dataiku.Dataset('FEATURES')# location_infoloc_info = ds.get_location_info()ds_info = loc_info['info']if loc_info['locationInfoType'] == 'SQL' and 'table' not in ds_info:table = '<SQL Query>'else:table = ds_info['table']print('Dataset Type: {}'.format(loc_info['locationInfoType']))print('Connection Name: {}'.format(ds_info['connectionName']))print('Database Type: {}'.format(ds_info['databaseType']))print('Table Name: {}'.format(table))# schemaprint('\nColumns in Dataset (Name - dss type / database type):')for col in ds.read_schema():print('{0} - {1} / {2}'.format(col['name'], col['type'], col['originalType']))# Note that ds.get_config() has all of above plus many other config items

  • Herc
    Herc Alpha Tester, Registered Posts: 6 ✭✭✭✭
    Options
    Hi @florianbriand
    What did you want to do with the metadata? Is writing it into a file (e.g. a custom recipe) very convenient?
    Herc
  • florianbriand
    florianbriand Registered Posts: 12 ✭✭✭✭
    Options

    For information, the plugin I wrote is as simple as :

    import dataikufrom dataiku.customrecipe import *import pandas as pdfrom pprint import pprintmain_input_names = get_input_names_for_role('main_input')main_input_ds = dataiku.Dataset(main_input_names[0])print("------------- INPUT -------------")pprint(main_input_ds)# For outputs, the process is the same:main_output_names = get_output_names_for_role('main_output')main_output_ds = dataiku.Dataset(main_output_names[0])schema = main_input_ds.read_schema() # {name, type} for each columnprint("------------- SCHEMA -------------")pprint(schema)main_output_df = pd.DataFrame(schema)print("------------- OUTPUT -------------")pprint(main_output_df)main_output_ds.write_with_schema(main_output_df)

    Because in my case, I didn't need anything else than the schema.

    But there are probably other metadata which could be helpful.

    From my side, the need is just to get the columns list to make things like :

    - check if every column required in subsequent flow are provided and give user information about missing columns
    - automatically configure some other recipes, via variables, with the input columns
    - ...

Setup Info
    Tags
      Help me…