Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi Folks,
I am actually working on plugins that generate datasets with varying number of columns. Because of this I am unable to manually define the schema of the output and I am currently using the automatic setting.
This however, leads to Dataiku setting the default length of a lot of varchar columns to 65535 which is huge when I need to deal with large volumes of data. I am looking for a way to limit the maximum size of columns dynamically.
Any leads would be appreciated.
Thanks!
Hi,
The default length with being according to the max for your database in your 65535.
To change this you can use the UI or get_schema() and set_schema() once the dataset si created and the initial schema is set. To update maxLength in the schema via the API you can use :
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
client = dataiku.api_client()
project = client.get_project('PREDICTION_CARS')
input_dataset = project.get_dataset('PREDICTION_CARS_1')
schema = input_dataset.get_schema()
new_schema = {'columns': [], 'userModified': True}
try :
for col in schema['columns']:
if col['type'] == 'string':
col['maxLength'] = 3000
new_schema['columns'].append(col)
print('final new schema')
print(new_schema)
try:
input_dataset.set_schema(new_schema)
except Exception as e: print(e)
except Exception as e: print(e)
Let me know if this works for you.
Thanks @AlexT , will try this out...