How to sync a partitioned table to a single table efficiently

Zhengxin
Zhengxin Partner, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 9 Partner

Hi,

In my workflow, it generates a partitioned datasets with over 10000 partitions, each partition is just one row of data. I am currently sync this data set to a single table, but I found it very inefficient. To sync a 10000 table, it took over 1 hour to finish. Is there a way to run this more efficiently or another better way to accomplish it?

Joshuazzx_0-1588199062316.png

I am running Dataiku 7.0 on a VM set up in Azure.

Thanks in advance

Tagged:

Answers

  • Jediv
    Jediv Dataiker Posts: 17 Dataiker

    Is there any way to not generate a table with one row per partition? That is indeed extremely inefficient.

  • Zhengxin
    Zhengxin Partner, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 9 Partner

    This partitioned dataset was generated by a Python recipe where each partition ID were feed into the recipe to by using the dataiku.dku_flow_variables["DKU_DST_ID"] from this dataset.

    Any advice how to avoid this partition?

    Even there is more than one row per partition, to sync 10000 partitions to a single table would still be inefficient. Is this statement correct?

    Thanks a lot

  • Jediv
    Jediv Dataiker Posts: 17 Dataiker
    edited July 17

    Hi,

    Your statement is definitely correct. I'm not sure I understand the issue fully, but I'll try to talk about some concepts around partitioning and hopefully that helps. Lets say I have a file-based bunch of datasets that are in the following format - A base directory, then multiple subdirectories each with a dataset inside:

    ~ tree partition_ex/
    partition_ex/
    ├── a
    │   └── part1.csv
    ├── b
    │   └── part1.csv
    └── c
        └── part1.csv

    I can treat this a couple different ways in Dataiku. If I connect to the directory partition_ex as a server filesystem dataset, we'll get a view like this:

    Pasted_Image_5_1_20__6_52_PM.png

    I can then choose to partition this dataset, by clicking the partitioned tab and setting this:

    Pasted_Image_5_1_20__6_55_PM.png

    Then my dataset would have 3 partitions and I would take those into account when working with it in python. However, I could also just turn off partitioning in which case my dataset would combine ALL the files in the subdirectories into a single dataset.

    This is the better option of the number of records in every partition is not very large and the number of partitions is very high.

    Specifically for your case, instead of syncing your partitioned dataset to an unpartitioned dataset, simply create a new unpartitioned dataset pointing to the same location as the partitioned one.

Setup Info
    Tags
      Help me…