DSS visual recipes defaulting to max column length with Redshift tables

veenacalambur · ‎01-08-2020

Hi everyone,

When working with Redshift tables in DSS visual recipes we noticed that the table creation settings sometimes defaults to setting certain column lengths to the redshift max (65,000). In many cases this becomes excessive. For example, in the screenshot below the "brand" column has a length of 65k but most of the column has text that span less than 10 characters.

We wanted to better understand the logic of column length setting defaults for Redshift and if there is a safe / proper way to modify this.

Clément_Stenac · ‎01-08-2020

Hi,

When creating datasets from input Redshift (or other SQL databases), DSS will automatically fetch the column lengths from the Redshift table.

However, when creating a new Redshift dataset from columns which do not have a fixed length (as is the case for example when syncing from a CSV or Parquet file), DSS cannot know what the maximum size will be. It therefore takes a safe approach and defaults to setting up the maximum length possible for the database.

You can of course modify the length in the schema screen of the dataset.

DSS does not "simulate" the entire process in order to know what the maximum length may be, as it would be extremely slow and brittle.

We are looking into potential future improvements to make the default max length configurable. This will make it easier to work with the SQL databases (like Redshift) where the max length of varchars has a performance impact (it doesn't for most SQL databases)

View solution in original post

Clément_Stenac · ‎01-08-2020

Hi,

When creating datasets from input Redshift (or other SQL databases), DSS will automatically fetch the column lengths from the Redshift table.

However, when creating a new Redshift dataset from columns which do not have a fixed length (as is the case for example when syncing from a CSV or Parquet file), DSS cannot know what the maximum size will be. It therefore takes a safe approach and defaults to setting up the maximum length possible for the database.

You can of course modify the length in the schema screen of the dataset.

DSS does not "simulate" the entire process in order to know what the maximum length may be, as it would be extremely slow and brittle.

We are looking into potential future improvements to make the default max length configurable. This will make it easier to work with the SQL databases (like Redshift) where the max length of varchars has a performance impact (it doesn't for most SQL databases)

AntonB · ‎06-01-2022

Hello @Clément_Stenac ,

was there any progress on this improvement that you mention? I am regularly running into a similar issue with MySQL db, where it defaults to varchar(500) which is far from sufficient for the text data I work with. Then I have to manually edit the create table statement to text or longtext for it to work. Is there any DSS setting that would allow me to use text instead of varchar(500) by default?

Is there a ticket or development request that I could raise which would change the schema from varchar(500) to longtext whenever the job fails because of the Data too long error for a given field in the table?

Apparently, DSS is aware of the error type and also the field name, so from my point of view it should be able to adjust the schema by itself and load the data again without human intervention

before - default behavior today:

before - default behavior today

you get an error:

schema should be adjusted:

after

run it again and it works, Job succeeded

federico_oldani · ‎06-21-2021

Hello, can anyone confirm that this is the only place where we're able to change the max length of a string column?

AntonB · ‎06-01-2022

Hi, correct, here or in the advanced tab inside the create table statement

Peter_R_Knight

I am also facing this issue. It would be nice to have some options other than having to manually define the table creation script.

Be able to set a max for all columns ( so instead of using 65000 I could set 3000 for example) this would be applied to all columns that previously had 65000.
Choose a mode to compute a max based on the data - this may be slow on large dataset, but my data only had ~1000 rows and so would be useful - ideally this would also round up (e.g. if max was 8, set to 10, if max was 278, set to 300, etc). especially if you are writing from a python recipe then it should be easy to compute the max size.

I also found that even though I manually set the schema for a dataset using the json fields, it gets overwritten when writing a python dataset. If I manually define the schema creation script then I get load errors: Load into table failed. Check 'stl_load_errors' system table for details - that table appears blank to me.

DSS visual recipes defaulting to max column length with Redshift tables

DSS visual recipes defaulting to max column length with Redshift tables

Labels

SQL databases

Visual recipes

Sign up to take part

DSS visual recipes defaulting to max column length with Redshift tables

DSS visual recipes defaulting to max column length with Redshift tables

Labels

SQL databases

Visual recipes