Redshift slow to write-out results
I have several Redshift datasets that have already been built in a project, and was trying to deploy some very basic recipes. For instance, on one dataset - let's call it dataset X - I was simply trying to create a new dataset with only the first two columns of X using a Redshift connection.
When I go to deploy the script (there are roughly 1.5 million rows and 2 columns that I want to write out), we see some errors. Upon inspecting the log files, I see the following issue:
It appears that writing out 10,000 lines is taking on average 5 minutes, which seems like a very long amount of time. I have tried implementing this process both using a visual lab and python recipe, but the same issue occurs in both scenarios.
I ended moving the DBs from Redshift and saving them as new DBs using a Postgres connection, and then everything went very fast again. So, I feel as if this must be some issue with SQL and Redshift not playing well together, but I'm not sure exactly why. Any insight into this would be very helpful; thanks!
Answers
-
Hi,
Writing results directly into Redshift from DSS is indeed extremely inefficient. This is the nature of Redshift which is a database fully optimized for large-scale loading from S3 only.
The solution is to send your data to S3 first and then use the "Sync" recipe to request a data load from S3 to Redshift. The sync recipe will automatically use the fast path. See:
https://www.dataiku.com/learn/guide/connecting/dss-and-redshift.html and https://doc.dataiku.com/dss/latest/other_recipes/sync.html#specific-case-s3-to-redshift
For SQL recipes, they execute in-database (provided that both input and output are in the same DB) so you don't need to do anything special, they will be fast.
However the "Prepare" recipe does not run "in-database" for the moment and will stream the data from the DB, prepare it on the fly and stream it back. In that scenario, you'll need to use S3 as output of the prepare recipe.
If the only preparation you want to do is about selecting a few columns, you'd really want to use a SQL recipe to do that rather.
Hope this helps, -
I understand that Redshift recommends the use of COPY (http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) for its data loading, but it can still injest data at a reasonable rate when using INSERT and combining this with multiple row syntax (http://docs.aws.amazon.com/redshift/latest/dg/r_INSERT_30.html).
INSERT INTO mytable (a, b, c) VALUES (1, 2, 3), (4, 5, 6) ...
I have been able to achieve rates of thousands or rows per second with this approach when using 10k row batches. This was using the lowest spec Redshift instance.
For this specific example, would DSS send 10k individual INSERT statements rather than using multi row syntax?