Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am trying to create a data input via a SQL query as shown below. I am getting errors from Dataiku for the "distinct" command and also for the "max" command. My problem is that the result is a 296 million row table when I really only need about 100k rows.
I know I could take the 296M row table and run a DISTINCT recipe and a TOPN recipe but both of these takes such a long time on 296M rows.
Is there a proper syntax for either of the "distinct" or "max" use cases below?
-----
select distinct
a.ACCOUNT_ID,
b.ACCOUNT_NAME,
c.VERSION_ID
from TABLE1
where VERSION_ID = (select max(VERSION_ID) from TABLE2)
If I am reading correctly you are using a subquery in the where clause which can lead to bad performance. You can try something like this instead and use a join so the subquery is only executed once.
SELECT DISTINCT a.ACCOUNT_ID, b.ACCOUNT_NAME, c.VERSION_ID
FROM TABLE1 a
JOIN (SELECT MAX(VERSION_ID) as VERSION_ID FROM TABLE2) b
ON a.VERSION_ID = b.VERSION_ID
If I am reading correctly you are using a subquery in the where clause which can lead to bad performance. You can try something like this instead and use a join so the subquery is only executed once.
SELECT DISTINCT a.ACCOUNT_ID, b.ACCOUNT_NAME, c.VERSION_ID
FROM TABLE1 a
JOIN (SELECT MAX(VERSION_ID) as VERSION_ID FROM TABLE2) b
ON a.VERSION_ID = b.VERSION_ID
This worked great for the max(VERSION_ID) !!! My dataset is now down to 2.8M rows. Distinct still giving me an error as shown below, but a DISTINCT recipe should be able to take care of this.