Join us, on May 27th, for an introduction to the new Dataiku Academy Learn more

DSS visual recipes defaulting to max column length with Redshift tables

Level 1
DSS visual recipes defaulting to max column length with Redshift tables

Hi everyone, 

When working with Redshift tables in DSS visual recipes we noticed that the table creation settings sometimes defaults to setting certain column lengths to the redshift max (65,000). In many cases this becomes excessive. For example, in the screenshot below the "brand" column has a length of 65k but most of the column has text that span less than 10 characters. 

We wanted  to better understand the logic of column length setting defaults for Redshift and if there is a safe / proper way to modify this.

veenacalambur_0-1578506355191.png

 

1 Reply
Dataiker
Dataiker

Hi,

When creating datasets from input Redshift (or other SQL databases), DSS will automatically fetch the column lengths from the Redshift table.

However, when creating a new Redshift dataset from columns which do not have a fixed length (as is the case for example when syncing from a CSV or Parquet file), DSS cannot know what the maximum size will be. It therefore takes a safe approach and defaults to setting up the maximum length possible for the database.

You can of course modify the length in the schema screen of the dataset.

DSS does not "simulate" the entire process in order to know what the maximum length may be, as it would be extremely slow and brittle.

We are looking into potential future improvements to make the default max length configurable. This will make it easier to work with the SQL databases (like Redshift) where the max length of varchars has a performance impact (it doesn't for most SQL databases)