Survey banner
The Dataiku Community is moving to a new home! New posts are now disabled and the community will shortly be in temporary read only mode: LEARN MORE

SQL Commands MAX and DISTINCT

Solved!
a76marine
Level 1
SQL Commands MAX and DISTINCT

I am trying to create a data input via a SQL query as shown below.  I am getting errors from Dataiku for the "distinct" command and also for the "max" command.  My problem is that the result is a 296 million row table when I really only need about 100k rows.

I know I could take the 296M row table and run a DISTINCT recipe and a TOPN recipe but both of these takes such a long time on 296M rows.

Is there a proper syntax for either of the "distinct" or "max" use cases below?  

-----

select distinct 
   a.ACCOUNT_ID,
   b.ACCOUNT_NAME,
   c.VERSION_ID

from TABLE1

where VERSION_ID = (select max(VERSION_ID) from TABLE2)

0 Kudos
1 Solution
AlexT
Dataiker

If I am reading correctly you are using a subquery in the where clause which can lead to bad performance. You can try something like this instead and use a join so the subquery is only executed once.


SELECT DISTINCT a.ACCOUNT_ID, b.ACCOUNT_NAME, c.VERSION_ID
FROM TABLE1 a
JOIN (SELECT MAX(VERSION_ID) as VERSION_ID FROM TABLE2) b
ON a.VERSION_ID = b.VERSION_ID

View solution in original post

0 Kudos
3 Replies
AlexT
Dataiker

If I am reading correctly you are using a subquery in the where clause which can lead to bad performance. You can try something like this instead and use a join so the subquery is only executed once.


SELECT DISTINCT a.ACCOUNT_ID, b.ACCOUNT_NAME, c.VERSION_ID
FROM TABLE1 a
JOIN (SELECT MAX(VERSION_ID) as VERSION_ID FROM TABLE2) b
ON a.VERSION_ID = b.VERSION_ID

0 Kudos
a76marine
Level 1
Author

This worked great for the max(VERSION_ID) !!!  My dataset is now down to 2.8M rows.  Distinct still giving me an error as shown below, but a DISTINCT recipe should be able to take care of this. 

Syntax error: expected something between an integer and the 'distinct' keyword.
0 Kudos
dmunozp
Level 1

Dataiku add top 10000 after select: "SELECT TOP 10000 DISCTINCT" is the real query that you run in the test and it is wrong. Try to put in a subquery:

select * from (

...your query...

) as X

0 Kudos