Ready for Dataiku 9? Try out the Crash Course on new features! GET STARTED

Dataset SQL Query

ls
Level 2
Dataset SQL Query

Hello, how do I remove the "Top N" clause added by Dataiku when my new dataset is set up by using SQL query Mode. Here is the error...

"Failed to read data from DB
Failed to read data from DB, caused by: SQLException: [Teradata Database] [TeraJDBC 16.00.00.24] [Error 6916] [SQLState HY000] TOP N Syntax error: Top N option is not supported with QUALIFY clause.
HTTP code: 500, type: com.dataiku.dip.exceptions.DataStoreIOException"

Thanks!

0 Kudos
6 Replies
CoreyS
Community Manager
Community Manager

Hi, @ls! Can you provide any further details on the thread to assist users in helping you find a solution (insert examples like DSS version etc.) Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community.

Looking for more resources to help you use DSS effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!
0 Kudos
ls
Level 2
Author

I created a dataset by using an sql query like this...

select c1, c2, c3, date1
from t1
qualify row_number() over(partition by c1 order by date1 desc) = 1

The source, Teradata, returned with a Top N error. I didn't put top n into my query so I believed Dataiku did so.

I solved it by putting the query in a derived table like this...

select *
from
( select c1, c2, c3, date1
  from t1
  qualify row_number() over(partition by c1 order by date1 desc) = 1
) a

Any other suggestions?

0 Kudos
Marlan
Neuron
Neuron

Hi @ls,

If you need to use a dataset that represents a SQL query then your solution is a good one.

I don't often use datasets defined in this way since they can't be used with downstream SQL recipes (and maybe other types of recipes).  

Another perhaps obvious option is to define a SQL table dataset (rather than a SQL query dataset) that points to t1 and then use a subsequent SQL recipe containing the qualify logic to create a new dataset / table. The drawback is that you'll end up with a second table.

A variation on this is that you could include the qualify logic to create a temp table at the beginning of a downstream recipe and avoid the second table.

Note that if you use a SQL Script type SQL recipe, you don't need to specify an input dataset. You could reference t1 in such a recipe directly without first creating a SQL table dataset.

Marlan  

 

  

0 Kudos
ls
Level 2
Author

Thanks very much @Marlan, the input is invaluable.

I am providing you with more context. 

T1 is a daily table that has 100+ million records optimized by indexes and row-partitioning on Teradata. The expected result set is about 100 k records. If I set up the dataset on Dataiku by pointing it to the table without qualifying the partition, I believe Dataiku would just submit the query as a simple select statement and Teradata would just do a full table scan, which is not ideal and as a matter of fact, the query would be eventually get rejected due to high cpu consumption, as I don’t have DBA privilege.

The next downstream step is a left join of T1 and T2 where T2 is on the right side. The data is on Oracle and the expected result set is about half a million of records. I didn’t expect Dataiku to handle this cross-database join efficiently, and wanted to avoid data movement.

Marlan
Neuron
Neuron

Hi @ls,

The additional context is helpful.

Can you create tables on Teradata? If so, what I'd do is create a DSS dataset that simply points to T1 (i.e., a SQL table dataset) and then use a SQL code recipe that does the qualify filter and saves the result to a new table.  Then join T2 to that table. DSS shouldn't do a table scan but rather simply execute the SQL query with the qualify - so no different than your workaround.

If you can't create tables or don't want to, then your workaround seems like the best option.

Note that it might be more efficient to copy the filtered T1 (with 100K rows) over to Oracle (if you can) so the join can run in database. I don't know how DSS does joins across databases but it may be that data is read in from both sources and then DSS does the join internally (rather in one of the databases). You could go the other way (Oracle to Teradata) but it sounds like the Oracle table is bigger and obviously better to copy the smaller table.

Marlan

ls
Level 2
Author

Thanks @Marlan

It's easier and perhaps more intuitive for me to simply join them by using Spark.

0 Kudos