Dataset Schema Query

rishabh1994 Partner, Registered Posts: 7 Partner

Hi Folks,

I am actually working on plugins that generate datasets with varying number of columns. Because of this I am unable to manually define the schema of the output and I am currently using the automatic setting.

This however, leads to Dataiku setting the default length of a lot of varchar columns to 65535 which is huge when I need to deal with large volumes of data. I am looking for a way to limit the maximum size of columns dynamically.

Any leads would be appreciated.



  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    edited July 17


    The default length with being according to the max for your database in your 65535.

    To change this you can use the UI or get_schema() and set_schema() once the dataset si created and the initial schema is set. To update maxLength in the schema via the API you can use :

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    client = dataiku.api_client()
    project = client.get_project('PREDICTION_CARS')
    input_dataset = project.get_dataset('PREDICTION_CARS_1')
    schema = input_dataset.get_schema()
    new_schema = {'columns': [], 'userModified': True}
    try :
        for col in schema['columns']:
            if col['type'] == 'string':
                col['maxLength'] = 3000
        print('final new schema')
        except Exception as e: print(e)
    except Exception as e: print(e)

    Let me know if this works for you.

  • rishabh1994
    rishabh1994 Partner, Registered Posts: 7 Partner

    Thanks @AlexT
    , will try this out...

Setup Info
      Help me…