Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on March 13, 2018 1:09AM
Likes: 0
Replies: 2
Hi,
I've got a partitioned dataset with IDs in one column.
The dataset registers some transactions: it may well be that some IDs do not appear in all the transactions.
I then want to group this dataset and sum the transaction column.
Could you please confirm that when I do so, I'm not going to "lose" any of the IDs along the way?
For instance, let's take the example below
partition 1
ID| Transaction
1| 100
2| 200
partition 2
ID| Transaction
2| 300
Is the result the following table?
ID| Sum(transaction)
1| 100
2| 500
The reason why I'm asking is that when I'm sampling the partitions to take a look at them, I always take the first records. However I cannot find some of the ID that I can see in of some partitions in the output table (eg the output of the group by recipe). So I'm a bit worried I might have lost some data along the way.
Thank you very much for your help!
Best regards,
Hello,
All visual recipes in DSS operate on the full data, so without sampling. The sampling we apply when you visualize a dataset or work on a visual "Prepare" recipe is only for you to be able to prototype and understand your data quickly. But when you actually run a recipe, it is applied to the full data.
Hence, if you create a "Group" recipe, taking the sum of "transactions" by "ID", it will perform what you want. From your example, I understand that you want the output of this group recipe to be "non-partitioned". Make sure you select this option when creating the output dataset.
Then, make sure the partition dependency setting is "All available" in the Input/Output tab of your Group recipe:
Note that if what you wanted to perform the sum of transactions by ID separately on each partition, you could do that with a partitioned output and the "Equals" partition dependency setting in the screen above. If you plan to work with partitions in DSS, I encourage you to read: https://doc.dataiku.com/dss/latest/concepts/index.html.
Hope it helps,
Alex