I have a dataset with some IDs and dates. I would like to keep only one line for an ID and keep the most recent date corresponding to this ID. For exemple (screenshot "dataset_init"), for the ID "1020731", the good line I want to keep is the number 2 with "last_interaction_date_parsed" = "2020-07-01T23:10:21.000Z".
I have tested two techniques :
1. GROUP BY recipe with "id_correspondant" as group key. In "Per field aggregation", I clicked on "Last" corresponding to column "last_interaction_date_parsed" then on the settings of this "Last" metric I clicked on "First/Last ordered by: last_interaction_date_parsed". I suppose that this technique works pretty well and that the results is the same for each engine (DSS, Spark, Hive), that's what I see in the results when consulting 10 exemples of IDs.
BUT, I was told that the recipe Group By performs in a exogenous way when dealing with ranking and thus it highly depends on the engine we use, ongoing parallels computations, datasets types etc. So it is only "by chance" that the good line with most recent dates was retrieve in the result dataset. So I am now sceptical because it seems to me that using the setting "First/Last ordered by: last_interaction_date_parsed" is rather explicit and is the good way of doing what I want to do, otherwise why do we have this option, right ?
Anyway, I have another technique which works everytime :
2. TOP N recipe (screenshot "topn_recipe") :
- retrieve top rows
- order by "last_interaction_date_parsed" descending
- from "each group of rows identified by "id_correspondant"
- in addition, copute for each row "count of rows in its group"
Can you tell me what is the best technique and why ?
Also, why is the GROUP BY recipe not recommended for my current problem ?
Thank you for you help
Hi @alicetheyoung ,
After checking with the R&D team, we would recommend using GROUP BY, either :
So your first suggestion should work with all the engines