Hive Query Odd Behaviour

MarcioCoelho
MarcioCoelho Dataiku DSS Core Designer, Registered Posts: 12 ✭✭✭✭
edited July 16 in General Discussion

Hey everyone,

I recently came across an odd behavior where two very similar hive queries behave in very differently.

The first query is something like this:

SELECT 
a.col1,
a.col2,
a.col3,
b.col1,
b.col2
from bd.table1 as a
left join bd.table2 as b on a.col1 = b.col5
where 
a.col1 is not null and
a.col2 between 20210101 and 20210131

and the second query is the same but without the join clause and respective table b columns:

SELECT 
a.col1,
a.col2,
a.col3
from bd.table1 as a
where 
a.col1 is not null and
a.col2 between 20210101 and 20210131

And what has been bugging my mind, is that query 1 runs in 2-5mins without any issue, while query takes over 1h to run, if it manages to finish at all - possible error causes are due to compaction of source tables.

Table a has around 100kk rows, while table b has only 6kk rows. I found this issue when I decided to remove the join from the query, as it was no longer needed, and all of sudden the query started taking to long to run and randomly failing.

I've tried join table a with other tables, but the issue persists.

I've also tried to run said queries in our hive platform, and both took around the same time (10min) without any issue.

Any ideas to help me better understand this odd behavior and possibly fix it, are more than welcome!

If you need more information, feel free to ask.

Thanks!


Operating system used: Windows

Tagged:
Setup Info
    Tags
      Help me…