Dataset Schema Query

rishabh1994
Level 1
Dataset Schema Query

Hi Folks,

I am actually working on plugins that generate datasets with varying number of columns. Because of this I am unable to manually define the schema of the output and I am currently using the automatic setting.

This however, leads to Dataiku setting the default length of a lot of varchar columns to 65535 which is huge when I need to deal with large volumes of data. I am looking for a way to limit the maximum size of columns dynamically.

 

Any leads would be appreciated.

 

Thanks! 

 

 

0 Kudos
2 Replies
AlexT
Dataiker

Hi,

The default length with being according to the max for your database in your 65535.

To change this you can use the UI or get_schema() and set_schema() once the dataset si created and the initial schema is set.  To update maxLength in the schema via the API you can use :

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd

client = dataiku.api_client()
project = client.get_project('PREDICTION_CARS')
input_dataset = project.get_dataset('PREDICTION_CARS_1')

schema = input_dataset.get_schema()
new_schema = {'columns': [], 'userModified': True}

try :
    for col in schema['columns']:
        if col['type'] == 'string':
            col['maxLength'] = 3000
        new_schema['columns'].append(col)
    print('final new schema')
    print(new_schema)
    try:
        input_dataset.set_schema(new_schema)
    except Exception as e: print(e)
    
except Exception as e: print(e)
    
    

 

Let me know if this works for you. 

0 Kudos
rishabh1994
Level 1
Author

Thanks @AlexT , will try this out...

0 Kudos