Dataset Schema Query
Hi Folks,
I am actually working on plugins that generate datasets with varying number of columns. Because of this I am unable to manually define the schema of the output and I am currently using the automatic setting.
This however, leads to Dataiku setting the default length of a lot of varchar columns to 65535 which is huge when I need to deal with large volumes of data. I am looking for a way to limit the maximum size of columns dynamically.
Any leads would be appreciated.
Thanks!
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
The default length with being according to the max for your database in your 65535.
To change this you can use the UI or get_schema() and set_schema() once the dataset si created and the initial schema is set. To update maxLength in the schema via the API you can use :
import dataiku from dataiku import pandasutils as pdu import pandas as pd client = dataiku.api_client() project = client.get_project('PREDICTION_CARS') input_dataset = project.get_dataset('PREDICTION_CARS_1') schema = input_dataset.get_schema() new_schema = {'columns': [], 'userModified': True} try : for col in schema['columns']: if col['type'] == 'string': col['maxLength'] = 3000 new_schema['columns'].append(col) print('final new schema') print(new_schema) try: input_dataset.set_schema(new_schema) except Exception as e: print(e) except Exception as e: print(e)
Let me know if this works for you.
-
Thanks @AlexT
, will try this out...