DSS visual recipes defaulting to max column length with Redshift tables

Solved!
veenacalambur
Level 2
DSS visual recipes defaulting to max column length with Redshift tables

Hi everyone, 

When working with Redshift tables in DSS visual recipes we noticed that the table creation settings sometimes defaults to setting certain column lengths to the redshift max (65,000). In many cases this becomes excessive. For example, in the screenshot below the "brand" column has a length of 65k but most of the column has text that span less than 10 characters. 

We wanted  to better understand the logic of column length setting defaults for Redshift and if there is a safe / proper way to modify this.

veenacalambur_0-1578506355191.png

 

1 Solution
Clรฉment_Stenac

Hi,

When creating datasets from input Redshift (or other SQL databases), DSS will automatically fetch the column lengths from the Redshift table.

However, when creating a new Redshift dataset from columns which do not have a fixed length (as is the case for example when syncing from a CSV or Parquet file), DSS cannot know what the maximum size will be. It therefore takes a safe approach and defaults to setting up the maximum length possible for the database.

You can of course modify the length in the schema screen of the dataset.

DSS does not "simulate" the entire process in order to know what the maximum length may be, as it would be extremely slow and brittle.

We are looking into potential future improvements to make the default max length configurable. This will make it easier to work with the SQL databases (like Redshift) where the max length of varchars has a performance impact (it doesn't for most SQL databases)

View solution in original post

4 Replies
Clรฉment_Stenac

Hi,

When creating datasets from input Redshift (or other SQL databases), DSS will automatically fetch the column lengths from the Redshift table.

However, when creating a new Redshift dataset from columns which do not have a fixed length (as is the case for example when syncing from a CSV or Parquet file), DSS cannot know what the maximum size will be. It therefore takes a safe approach and defaults to setting up the maximum length possible for the database.

You can of course modify the length in the schema screen of the dataset.

DSS does not "simulate" the entire process in order to know what the maximum length may be, as it would be extremely slow and brittle.

We are looking into potential future improvements to make the default max length configurable. This will make it easier to work with the SQL databases (like Redshift) where the max length of varchars has a performance impact (it doesn't for most SQL databases)

AntonB
Level 2

Hello @Clรฉment_Stenac ,

was there any progress on this improvement that you mention? I am regularly running into a similar issue with MySQL db, where it defaults to varchar(500) which is far from sufficient for the text data I work with. Then I have to manually edit the create table statement to text or longtext for it to work. Is there any DSS setting that would allow me to use text instead of varchar(500) by default?

Is there a ticket or development request that I could raise which would change the schema from varchar(500) to longtext whenever the job fails because of the Data too long error for a given field in the table?

Apparently, DSS is aware of the error type and also the field name, so from my point of view it should be able to adjust the schema by itself and load the data again without human intervention

 

before - default behavior today:

before - default behavior todaybefore - default behavior today

 

you get an error:

data too long error.JPG

 

schema should be adjusted:

afterafter

 

run it again and it works, Job succeeded

 

 

0 Kudos
federico_oldani
Level 2

image.png

 

Hello, can anyone confirm that this is the only place where we're able to change the max length of a string column?

0 Kudos
AntonB
Level 2

Hi, correct, here or in the advanced tab inside the create table statement

0 Kudos