Dataset Schema Query

rishabh1994
rishabh1994 Partner, Registered Posts: 7 Partner

Hi Folks,

I am actually working on plugins that generate datasets with varying number of columns. Because of this I am unable to manually define the schema of the output and I am currently using the automatic setting.

This however, leads to Dataiku setting the default length of a lot of varchar columns to 65535 which is huge when I need to deal with large volumes of data. I am looking for a way to limit the maximum size of columns dynamically.

Any leads would be appreciated.

Thanks!

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker
    edited July 17

    Hi,

    The default length with being according to the max for your database in your 65535.

    To change this you can use the UI or get_schema() and set_schema() once the dataset si created and the initial schema is set. To update maxLength in the schema via the API you can use :

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    
    client = dataiku.api_client()
    project = client.get_project('PREDICTION_CARS')
    input_dataset = project.get_dataset('PREDICTION_CARS_1')
    
    schema = input_dataset.get_schema()
    new_schema = {'columns': [], 'userModified': True}
    
    try :
        for col in schema['columns']:
            if col['type'] == 'string':
                col['maxLength'] = 3000
            new_schema['columns'].append(col)
        print('final new schema')
        print(new_schema)
        try:
            input_dataset.set_schema(new_schema)
        except Exception as e: print(e)
        
    except Exception as e: print(e)
        
        

    Let me know if this works for you.

  • rishabh1994
    rishabh1994 Partner, Registered Posts: 7 Partner

    Thanks @AlexT
    , will try this out...

Setup Info
    Tags
      Help me…