Problem Joining a table with itself (file-based datasets)

PM
PM Registered Posts: 10 ✭✭✭✭

Hi all,

I noticed an unexpected issue while joining a table with itself ( from data as T1 join data as T2)

When filtering one of the instances of the table, the filter applies to both of the instances

When applying different filters to both instances, there is no problem.

I had to use a dummy filter (where key is defined) to instance 2 to avoid it being filtered

Im using Dataiku 5.1.4

Regards!

Best Answers

  • antonstam
    antonstam Partner, Registered Posts: 25 Partner
    Answer ✓

    This issue also occurs for me. I believe it has to do with how the underlying H2 database interprets the query it's given. I've synced the file dataset to a SQL database and passed the same query to that dataset, which is then executed correctly. The log file for the visual join execution is attached, I hope it is helpful.

  • ATsao
    ATsao Dataiker Alumni, Registered Posts: 139 ✭✭✭✭✭✭✭✭
    Answer ✓

    Hi,

    Thank you both for the additional details!

    So I indeed am able to reproduce the issue locally as well. Upon further investigation, this behavior does seem related to the underlying H2 implementation that is used for Join recipes when the DSS engine is selected specifically. In short, DSS will copy/dump the data into H2 first so that the join can be performed, but it appears that this filter might also already be included when the data is initially read into H2, thus affecting the subsequent join query. I will forward this internally to our Engineering team for further investigation.

    Meanwhile, as mentioned already, I would suggest continuing to either include the "dummy" filter as a workaround or to sync the dataset to some kind of database instead where this won't be an issue (as long as you make sure to use the in-database engine that is).

    Thanks,

    Andrew

Answers

  • ATsao
    ATsao Dataiker Alumni, Registered Posts: 139 ✭✭✭✭✭✭✭✭

    Hi,

    Could you clarify how you are applying the filter and what you mean that the filter is applying to "both instances"? Are you using a Filter recipe, is this filter being applied on the Explore tab of the output dataset, or something else?

    Also, how are you performing the join exactly? Are you using a join recipe where both input datasets are pointing to the same table or are you using a SQL recipe to handle the join?

    Thanks,

    Andrew

  • PM
    PM Registered Posts: 10 ✭✭✭✭

    Hi,

    yes, i'm using a Join recipe with pre-filters

    PM_0-1583224255950.png

    My dataset is file-based so i cannot use an SQL recipe

    Regards

Setup Info
    Tags
      Help me…