Partition Data on Snowflake Dataiku
We have 60 + Billion Data on Snowflake, we are removing the duplicates using distinct or group by but query is getting time out after 3 hours, is there any optimized way to try ,can we partition the data on snowflake or remove duplicates.
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @AkshayArora1
,This may be easier to handle over a support ticket where you can share the job diagnostics.
A few things to check would be, if both input and output datasets are in Snowflake so you can leverage SQL engine and don't have to move this larger dataset outside of Snowflake.
What timeout are you seeing exactly is the OAuth credentials timing out? Depending on what timeout it is you can try to increase it.
Partitioning may not help here because you would be running the Group By/Distinct on the respective partition if that is sufficient and are only for duplicates in specific "DAY" or within a specific discrete partiton. IT may be allowed to split the job into smaller parts and run multiple partitions concurrently additionally would permit not having to re-run this whole lon-running every time.Thanks,