Discover all of the brand-new features and improvements to existing capabilities in the Dataiku 11.3 updateLET'S GO

Partition Data on Snowflake Dataiku

AkshayArora1
Level 2
Partition Data on Snowflake Dataiku

We have 60 + Billion Data on Snowflake, we are removing the duplicates using distinct or group by but query is getting time out after 3 hours, is there any optimized way to try ,can we partition the data on snowflake or remove duplicates.

0 Kudos
1 Reply
AlexT
Dataiker

Hi @AkshayArora1 ,

This may be easier to handle over a support ticket where you can share the job diagnostics. 

A few things to check would be, if both input and output datasets are in Snowflake so you can leverage SQL engine and don't have to move this larger dataset outside of Snowflake.

What timeout are you seeing exactly is the OAuth credentials timing out? Depending on what timeout it is you can try to increase it. 


Partitioning may not help here because you would be running the Group By/Distinct on the respective partition if that is sufficient and are only for duplicates in specific "DAY" or within a specific discrete partiton. IT may be allowed to split the job into smaller parts and run multiple partitions concurrently additionally would permit not having to re-run this whole lon-running every time. 

Thanks,

 

0 Kudos