Problem Joining a table with itself (file-based datasets)

Solved!
PM
Level 2
Problem Joining a table with itself (file-based datasets)

Hi all,

I noticed an unexpected issue while joining a table with itself ( from data as T1 join data as T2)

When filtering one of the instances of the table, the filter applies to both of the instances

When applying different filters to both instances, there is no problem.

 

I had to use a dummy filter (where key is defined) to instance 2 to avoid it being filtered

Im using Dataiku 5.1.4

Regards!

0 Kudos
2 Solutions
antonstam
Level 3

This issue also occurs for me. I believe it has to do with how the underlying H2 database interprets the query it's given. I've synced the file dataset to a SQL database and passed the same query to that dataset, which is then executed correctly. The log file for the visual join execution is attached, I hope it is helpful.

View solution in original post

0 Kudos
ATsao
Dataiker

Hi,

Thank you both for the additional details!

So I indeed am able to reproduce the issue locally as well. Upon further investigation, this behavior does seem related to the underlying H2 implementation that is used for Join recipes when the DSS engine is selected specifically. In short, DSS will copy/dump the data into H2 first so that the join can be performed, but it appears that this filter might also already be included when the data is initially read into H2, thus affecting the subsequent join query. I will forward this internally to our Engineering team for further investigation. 

Meanwhile, as mentioned already, I would suggest continuing to either include the "dummy" filter as a workaround or to sync the dataset to some kind of database instead where this won't be an issue (as long as you make sure to use the in-database engine that is).

Thanks,

Andrew

View solution in original post

4 Replies
ATsao
Dataiker

Hi,

Could you clarify how you are applying the filter and what you mean that the filter is applying to "both instances"? Are you using a Filter recipe, is this filter being applied on the Explore tab of the output dataset, or something else? 

Also, how are you performing the join exactly? Are you using a join recipe where both input datasets are pointing to the same table or are you using a SQL recipe to handle the join? 

Thanks,

Andrew

0 Kudos
PM
Level 2
Author

Hi,

yes, i'm using a Join recipe with pre-filters

PM_0-1583224255950.png

 

My dataset is file-based so i cannot use an SQL recipe

Regards

0 Kudos
antonstam
Level 3

This issue also occurs for me. I believe it has to do with how the underlying H2 database interprets the query it's given. I've synced the file dataset to a SQL database and passed the same query to that dataset, which is then executed correctly. The log file for the visual join execution is attached, I hope it is helpful.

0 Kudos
ATsao
Dataiker

Hi,

Thank you both for the additional details!

So I indeed am able to reproduce the issue locally as well. Upon further investigation, this behavior does seem related to the underlying H2 implementation that is used for Join recipes when the DSS engine is selected specifically. In short, DSS will copy/dump the data into H2 first so that the join can be performed, but it appears that this filter might also already be included when the data is initially read into H2, thus affecting the subsequent join query. I will forward this internally to our Engineering team for further investigation. 

Meanwhile, as mentioned already, I would suggest continuing to either include the "dummy" filter as a workaround or to sync the dataset to some kind of database instead where this won't be an issue (as long as you make sure to use the in-database engine that is).

Thanks,

Andrew