Dataset SQL Query
Hello, how do I remove the "Top N" clause added by Dataiku when my new dataset is set up by using SQL query Mode. Here is the error...
"Failed to read data from DB
Failed to read data from DB, caused by: SQLException: [Teradata Database] [TeraJDBC 16.00.00.24] [Error 6916] [SQLState HY000] TOP N Syntax error: Top N option is not supported with QUALIFY clause.
HTTP code: 500, type: com.dataiku.dip.exceptions.DataStoreIOException"
Thanks!
Answers
-
CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
Hi, @ls
! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community. -
I created a dataset by using an sql query like this...
select c1, c2, c3, date1
from t1
qualify row_number() over(partition by c1 order by date1 desc) = 1The source, Teradata, returned with a Top N error. I didn't put top n into my query so I believed Dataiku did so.
I solved it by putting the query in a derived table like this...
select *
from
( select c1, c2, c3, date1
from t1
qualify row_number() over(partition by c1 order by date1 desc) = 1
) aAny other suggestions?
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron
Hi @ls
,If you need to use a dataset that represents a SQL query then your solution is a good one.
I don't often use datasets defined in this way since they can't be used with downstream SQL recipes (and maybe other types of recipes).
Another perhaps obvious option is to define a SQL table dataset (rather than a SQL query dataset) that points to t1 and then use a subsequent SQL recipe containing the qualify logic to create a new dataset / table. The drawback is that you'll end up with a second table.
A variation on this is that you could include the qualify logic to create a temp table at the beginning of a downstream recipe and avoid the second table.
Note that if you use a SQL Script type SQL recipe, you don't need to specify an input dataset. You could reference t1 in such a recipe directly without first creating a SQL table dataset.
Marlan
-
Thanks very much @Marlan
, the input is invaluable.I am providing you with more context.
T1 is a daily table that has 100+ million records optimized by indexes and row-partitioning on Teradata. The expected result set is about 100 k records. If I set up the dataset on Dataiku by pointing it to the table without qualifying the partition, I believe Dataiku would just submit the query as a simple select statement and Teradata would just do a full table scan, which is not ideal and as a matter of fact, the query would be eventually get rejected due to high cpu consumption, as I don’t have DBA privilege.
The next downstream step is a left join of T1 and T2 where T2 is on the right side. The data is on Oracle and the expected result set is about half a million of records. I didn’t expect Dataiku to handle this cross-database join efficiently, and wanted to avoid data movement.
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron
Hi @ls
,The additional context is helpful.
Can you create tables on Teradata? If so, what I'd do is create a DSS dataset that simply points to T1 (i.e., a SQL table dataset) and then use a SQL code recipe that does the qualify filter and saves the result to a new table. Then join T2 to that table. DSS shouldn't do a table scan but rather simply execute the SQL query with the qualify - so no different than your workaround.
If you can't create tables or don't want to, then your workaround seems like the best option.
Note that it might be more efficient to copy the filtered T1 (with 100K rows) over to Oracle (if you can) so the join can run in database. I don't know how DSS does joins across databases but it may be that data is read in from both sources and then DSS does the join internally (rather in one of the databases). You could go the other way (Oracle to Teradata) but it sounds like the Oracle table is bigger and obviously better to copy the smaller table.
Marlan
-
Thanks @Marlan
.It's easier and perhaps more intuitive for me to simply join them by using Spark.