SQL Commands MAX and DISTINCT
I am trying to create a data input via a SQL query as shown below. I am getting errors from Dataiku for the "distinct" command and also for the "max" command. My problem is that the result is a 296 million row table when I really only need about 100k rows.
I know I could take the 296M row table and run a DISTINCT recipe and a TOPN recipe but both of these takes such a long time on 296M rows.
Is there a proper syntax for either of the "distinct" or "max" use cases below?
-----
select distinct
a.ACCOUNT_ID,
b.ACCOUNT_NAME,
c.VERSION_ID
from TABLE1
where VERSION_ID = (select max(VERSION_ID) from TABLE2)
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,248 Dataiker
If I am reading correctly you are using a subquery in the where clause which can lead to bad performance. You can try something like this instead and use a join so the subquery is only executed once.
SELECT DISTINCT a.ACCOUNT_ID, b.ACCOUNT_NAME, c.VERSION_ID
FROM TABLE1 a
JOIN (SELECT MAX(VERSION_ID) as VERSION_ID FROM TABLE2) b
ON a.VERSION_ID = b.VERSION_ID
Answers
-
This worked great for the max(VERSION_ID) !!! My dataset is now down to 2.8M rows. Distinct still giving me an error as shown below, but a DISTINCT recipe should be able to take care of this.
Syntax error: expected something between an integer and the 'distinct' keyword. -
dmunozp Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 ✭✭✭
Dataiku add top 10000 after select: "SELECT TOP 10000 DISCTINCT" is the real query that you run in the test and it is wrong. Try to put in a subquery:
select * from (
...your query...
) as X