Partition Data on Snowflake Dataiku

Options
AkshayArora1
AkshayArora1 Partner, Registered Posts: 11 Partner

We have 60 + Billion Data on Snowflake, we are removing the duplicates using distinct or group by but query is getting time out after 3 hours, is there any optimized way to try ,can we partition the data on snowflake or remove duplicates.

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @AkshayArora1
    ,

    This may be easier to handle over a support ticket where you can share the job diagnostics.

    A few things to check would be, if both input and output datasets are in Snowflake so you can leverage SQL engine and don't have to move this larger dataset outside of Snowflake.

    What timeout are you seeing exactly is the OAuth credentials timing out? Depending on what timeout it is you can try to increase it.


    Partitioning may not help here because you would be running the Group By/Distinct on the respective partition if that is sufficient and are only for duplicates in specific "DAY" or within a specific discrete partiton. IT may be allowed to split the job into smaller parts and run multiple partitions concurrently additionally would permit not having to re-run this whole lon-running every time.

    Thanks,

Setup Info
    Tags
      Help me…