'Sync'ing Redshift to S3 causes a syntax error
When I attempt to use the Sync visual recipe to put a Redshift table into S3, using the "Redshift to S3" recipe engine, I get this rather mysterious error:
Job failed: [Amazon](500310) Invalid operation: syntax error at or near ":" Position: 100;, caused by: ErrorException: [Amazon](500310) Invalid operation: syntax error at or near ":" Position: 100;
This is despite taking care to set the S3 Preview to use "Escaping only" quoting, either ',' or '\t' as a separator, and '\' as the escape character, as described in the documentation. Using "Excel" style quoting causes a different syntax error, saying it's near one of the column names. This failure only occurs on one table; others sync fine.
I also don't seem to be able to locate the generated SQL query in order to at least see what the actual syntax error is in context.
Using the "DSS" recipe engine works with this table, or would work, except that the table is pretty large and it bumps up against the rather puzzling S3 limitation of only allowing 10 000 part uploads; it would be easy to work around that if there were a way of specifying larger part sizes. Currently the transfer uses the minimum 5 MiB part size, and I don't know how to change that.
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,215 Dataiker
Hi,
Using the DSS engine to export data from Redshift to S3 is not recommended as it can be inefficient.
The error suggests there is an issue with one of the column names, perhaps can't be escaped correctly when generating the SQL. I've tried various combinations but could not get it to fail with commas, quotes ":" etc. So unsure what column in your redshift dataset could be causing this.
Can you confirm :
1) Your DSS Version
2) Can you share the SQL statement printed in the logs If you look at the job Activity log you will see: Executing SQL : UNLOAD , something like below, if this container has any sensitive information you can't share here can open a support instead.
[13:52:07] [INFO] [dku.recipes.sync.s3toredshift] - Executing SQL : UNLOAD ('SELECT "userid",REPLACE("username",\'"\',\'""\'),REPLACE("firstname",\'"\',\'""\'),REPLACE("lastname",\'"\',\'""\'),REPLACE("city",\'"\',\'""\'),REPLACE("state",\'"\',\'""\'),REPLACE("email",\'"\',\'""\'),REPLACE("phone",\'"\',\'""\'),CASE WHEN "likesports" IS TRUE THEN \'true\' WHEN "likesports" IS FALSE THEN \'false\' END,CASE WHEN "liketheatre" IS TRUE THEN \'true\' WHEN "liketheatre" IS FALSE THEN \'false\' END,CASE WHEN "likeconcerts" IS TRUE THEN \'true\' WHEN "likeconcerts" IS FALSE THEN \'false\' END,CASE WHEN "likejazz" IS TRUE THEN \'true\' WHEN "likejazz" IS FALSE THEN \'false\' END,CASE WHEN "likeclassical" IS TRUE THEN \'true\' WHEN "likeclassical" IS FALSE THEN \'false\' END,CASE WHEN "likeopera" IS TRUE THEN \'true\' WHEN "likeopera" IS FALSE THEN \'false\' END,CASE WHEN "likerock" IS TRUE THEN \'true\' WHEN "likerock" IS FALSE THEN \'false\' END,CASE WHEN "likevegas" IS TRUE THEN \'true\' WHEN "likevegas" IS FALSE THEN \'false\' END,CASE WHEN "likebroadway" IS TRUE THEN \'true\' WHEN "likebroadway" IS FALSE THEN \'false\' END,CASE WHEN "likemusicals" IS TRUE THEN \'true\' WHEN "likemusicals" IS FALSE THEN \'false\' END from "public"."users" ') to 's3://...../TESTING_S3_NEW/users_copy/out-s' ALLOWOVERWRITE DELIMITER AS '\t' ADDQUOTES GZIP IAM_ROLE
-
Hi,
Regarding the 10000 multipart upload Amazon S3 limits, you can configure the size of the part as follows:
1) Stop the DSS instance:
DATADIR/bin/dss stop
2) Edit the DATADIR/config/dip.properties file and add the following key:
dku.s3.upload.multipart.chunk.size=XXXXXXX
Where XXXXXXX is the size of one part (and is larger than 5242880, of course).
3) Start the DSS instance:
DATADIR/bin/dss start
- Best
-
We are running Dataiku version 8.05. The column names may be sensitive, but it appears the problem is in the date handling for one of them, as this is the only part of the query that has a ':' in it:
TO_CHAR(cast("redacted" as TIMESTAMPTZ),\'YYYY-MM-DDThh24:MI:SS.MSZ\')
I will check with my contact in IT to see how they want the ticket filed.