Missing ID in partitioned group by
Hi,
I've got a partitioned dataset with IDs in one column.
The dataset registers some transactions: it may well be that some IDs do not appear in all the transactions.
I then want to group this dataset and sum the transaction column.
Could you please confirm that when I do so, I'm not going to "lose" any of the IDs along the way?
For instance, let's take the example below
partition 1
ID| Transaction
1| 100
2| 200
partition 2
ID| Transaction
2| 300
Is the result the following table?
ID| Sum(transaction)
1| 100
2| 500
The reason why I'm asking is that when I'm sampling the partitions to take a look at them, I always take the first records. However I cannot find some of the ID that I can see in of some partitions in the output table (eg the output of the group by recipe). So I'm a bit worried I might have lost some data along the way.
Thank you very much for your help!
Best regards,
Best Answer
-
Hello,
All visual recipes in DSS operate on the full data, so without sampling. The sampling we apply when you visualize a dataset or work on a visual "Prepare" recipe is only for you to be able to prototype and understand your data quickly. But when you actually run a recipe, it is applied to the full data.
Hence, if you create a "Group" recipe, taking the sum of "transactions" by "ID", it will perform what you want. From your example, I understand that you want the output of this group recipe to be "non-partitioned". Make sure you select this option when creating the output dataset.
Then, make sure the partition dependency setting is "All available" in the Input/Output tab of your Group recipe:
Note that if what you wanted to perform the sum of transactions by ID separately on each partition, you could do that with a partitioned output and the "Equals" partition dependency setting in the screen above. If you plan to work with partitions in DSS, I encourage you to read: https://doc.dataiku.com/dss/latest/concepts/index.html.
Hope it helps,
Alex
Answers
-
Great, thanks a lot for this prompt answer