Missing ID in partitioned group by

Options
suard_raphaelle
suard_raphaelle Registered Posts: 2 ✭✭✭✭

Hi,

I've got a partitioned dataset with IDs in one column.

The dataset registers some transactions: it may well be that some IDs do not appear in all the transactions.

I then want to group this dataset and sum the transaction column.

Could you please confirm that when I do so, I'm not going to "lose" any of the IDs along the way?

For instance, let's take the example below

partition 1

ID| Transaction

1| 100

2| 200

partition 2

ID| Transaction

2| 300

Is the result the following table?

ID| Sum(transaction)

1| 100

2| 500

The reason why I'm asking is that when I'm sampling the partitions to take a look at them, I always take the first records. However I cannot find some of the ID that I can see in of some partitions in the output table (eg the output of the group by recipe). So I'm a bit worried I might have lost some data along the way.

Thank you very much for your help!

Best regards,

Best Answer

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Answer ✓
    Options

    Hello,

    All visual recipes in DSS operate on the full data, so without sampling. The sampling we apply when you visualize a dataset or work on a visual "Prepare" recipe is only for you to be able to prototype and understand your data quickly. But when you actually run a recipe, it is applied to the full data.

    Hence, if you create a "Group" recipe, taking the sum of "transactions" by "ID", it will perform what you want. From your example, I understand that you want the output of this group recipe to be "non-partitioned". Make sure you select this option when creating the output dataset.

    Then, make sure the partition dependency setting is "All available" in the Input/Output tab of your Group recipe:

    Note that if what you wanted to perform the sum of transactions by ID separately on each partition, you could do that with a partitioned output and the "Equals" partition dependency setting in the screen above. If you plan to work with partitions in DSS, I encourage you to read: https://doc.dataiku.com/dss/latest/concepts/index.html.

    Hope it helps,

    Alex

Answers

Setup Info
    Tags
      Help me…