Missing ID in partitioned group by

Registered Posts: 2 ✭✭✭✭

Hi,

I've got a partitioned dataset with IDs in one column.

The dataset registers some transactions: it may well be that some IDs do not appear in all the transactions.

I then want to group this dataset and sum the transaction column.

Could you please confirm that when I do so, I'm not going to "lose" any of the IDs along the way?

For instance, let's take the example below

partition 1

ID| Transaction

1| 100

2| 200

partition 2

ID| Transaction

2| 300

Is the result the following table?

ID| Sum(transaction)

1| 100

2| 500

The reason why I'm asking is that when I'm sampling the partitions to take a look at them, I always take the first records. However I cannot find some of the ID that I can see in of some partitions in the output table (eg the output of the group by recipe). So I'm a bit worried I might have lost some data along the way.

Thank you very much for your help!

Best regards,

Best Answer

  • Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Answer ✓

    Hello,

    All visual recipes in DSS operate on the full data, so without sampling. The sampling we apply when you visualize a dataset or work on a visual "Prepare" recipe is only for you to be able to prototype and understand your data quickly. But when you actually run a recipe, it is applied to the full data.

    Hence, if you create a "Group" recipe, taking the sum of "transactions" by "ID", it will perform what you want. From your example, I understand that you want the output of this group recipe to be "non-partitioned". Make sure you select this option when creating the output dataset.

    Then, make sure the partition dependency setting is "All available" in the Input/Output tab of your Group recipe:

    Note that if what you wanted to perform the sum of transactions by ID separately on each partition, you could do that with a partitioned output and the "Equals" partition dependency setting in the screen above. If you plan to work with partitions in DSS, I encourage you to read: https://doc.dataiku.com/dss/latest/concepts/index.html.

    Hope it helps,

    Alex

Answers

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.