Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on April 2, 2025 4:19PM
Likes: 0
Replies: 5
My team uses Dataiku to write flows consisting mostly of SQL recipes with data stored on an Amazon Redshift server. I have noticed the odd behavior that ORDER BY clauses in SQL recipes do not seem to have any effect on the resulting Dataiku dataset. They do, however, still work inside SQL notebooks, which suggests that it's not a matter of my writing the ORDER BY clause incorrectly.
As a reproducible example, I uploaded the R demo dataset 'mtcars' to Dataiku and synced it to our Redshift server as a Redshift table. I run the following query:
SELECT gear, am, vs, cyl, mpg, wt
FROM "FOO_BAR_mtcars_rs"
ORDER BY gear, am, vs, cyl, mpg
Within a SQL notebook, this creates the anticipated ordering.
Within the flow, however, the resulting Redshift table does not have the anticipated ordering. In column "gear", values of 3 and 4 are interleaved, rather than all the 3s appearing, then the 4s. Within values of gear, columns am, vs, cyl, and mpg don't appear to have any meaningful ordering either.
Can anybody reproduce and explain this issue?
Operating system used: Windows
This is by design. The Dataset Explore tab allows you to sort the data but this doesn't work on the real data but a sample. The bigger question is why do you need the data sorted? Sorting is only usually needed needed for display purposes. Most recipes don't support sorting and SQL databases can't not guarantee the ordering of the data returned unless the order by clause is used.
I would like the data sorted because I then export it as a CSV or Excel file for an automated email report. It is inelegant for me or the users to have to download the report and sort it within Excel. Is there a different way I should be preparing a table for reporting via email attachment?
My preferred solution for that use case will be a Python recipe which will write the CSV or Excel file to a managed folder. Then I can send that in a Mail Reporter or Send Mail step. Are you looking for a code or no-code solution? Personally I much prefer the code solution here since I can customise the Excel file with a lot more features than the plain export you get from Dataiku.
I will use an R recipe to sort the data and save it to a managed folder, then. Thanks for explaining that the SQL is behaving as intended, however unexpected it was to me.
There is a Sort recipe as well but it's a bit useless unless output dataset technology supports storing / retrieving data ordered.