Discover all of the brand-new features and improvements to existing capabilities in the Dataiku 11.3 updateLET'S GO

SQL Commands MAX and DISTINCT

Solved!
a76marine
Level 1
SQL Commands MAX and DISTINCT

I am trying to create a data input via a SQL query as shown below.  I am getting errors from Dataiku for the "distinct" command and also for the "max" command.  My problem is that the result is a 296 million row table when I really only need about 100k rows.

I know I could take the 296M row table and run a DISTINCT recipe and a TOPN recipe but both of these takes such a long time on 296M rows.

Is there a proper syntax for either of the "distinct" or "max" use cases below?  

-----

select distinct 
   a.ACCOUNT_ID,
   b.ACCOUNT_NAME,
   c.VERSION_ID

from TABLE1

where VERSION_ID = (select max(VERSION_ID) from TABLE2)

0 Kudos
1 Solution
AlexT
Dataiker

If I am reading correctly you are using a subquery in the where clause which can lead to bad performance. You can try something like this instead and use a join so the subquery is only executed once.


SELECT DISTINCT a.ACCOUNT_ID, b.ACCOUNT_NAME, c.VERSION_ID
FROM TABLE1 a
JOIN (SELECT MAX(VERSION_ID) as VERSION_ID FROM TABLE2) b
ON a.VERSION_ID = b.VERSION_ID

View solution in original post

0 Kudos
2 Replies
AlexT
Dataiker

If I am reading correctly you are using a subquery in the where clause which can lead to bad performance. You can try something like this instead and use a join so the subquery is only executed once.


SELECT DISTINCT a.ACCOUNT_ID, b.ACCOUNT_NAME, c.VERSION_ID
FROM TABLE1 a
JOIN (SELECT MAX(VERSION_ID) as VERSION_ID FROM TABLE2) b
ON a.VERSION_ID = b.VERSION_ID

0 Kudos
a76marine
Level 1
Author

This worked great for the max(VERSION_ID) !!!  My dataset is now down to 2.8M rows.  Distinct still giving me an error as shown below, but a DISTINCT recipe should be able to take care of this. 

Syntax error: expected something between an integer and the 'distinct' keyword.
0 Kudos