SQL Commands MAX and DISTINCT

Options
a76marine
a76marine Registered Posts: 2

I am trying to create a data input via a SQL query as shown below. I am getting errors from Dataiku for the "distinct" command and also for the "max" command. My problem is that the result is a 296 million row table when I really only need about 100k rows.

I know I could take the 296M row table and run a DISTINCT recipe and a TOPN recipe but both of these takes such a long time on 296M rows.

Is there a proper syntax for either of the "distinct" or "max" use cases below?

-----

select distinct
a.ACCOUNT_ID,
b.ACCOUNT_NAME,
c.VERSION_ID

from TABLE1

where VERSION_ID = (select max(VERSION_ID) from TABLE2)

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Answer ✓
    Options

    If I am reading correctly you are using a subquery in the where clause which can lead to bad performance. You can try something like this instead and use a join so the subquery is only executed once.


    SELECT DISTINCT a.ACCOUNT_ID, b.ACCOUNT_NAME, c.VERSION_ID
    FROM TABLE1 a
    JOIN (SELECT MAX(VERSION_ID) as VERSION_ID FROM TABLE2) b
    ON a.VERSION_ID = b.VERSION_ID

Answers

  • a76marine
    a76marine Registered Posts: 2
    Options

    This worked great for the max(VERSION_ID) !!! My dataset is now down to 2.8M rows. Distinct still giving me an error as shown below, but a DISTINCT recipe should be able to take care of this.

    Syntax error: expected something between an integer and the 'distinct' keyword.
  • dmunozp
    dmunozp Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2
    Options

    Dataiku add top 10000 after select: "SELECT TOP 10000 DISCTINCT" is the real query that you run in the test and it is wrong. Try to put in a subquery:

    select * from (

    ...your query...

    ) as X

Setup Info
    Tags
      Help me…