Are there any tips or tricks beyond what's covered in the training for using group recipes?
I'm having problems getting the correct min and max values calculated using group recipes. I can confirm that sometimes they calculate the correct values and sometimes they don't.
This is probably user error, but the dataset is large enough to not allow me to not use a sample, making it difficult or impossible to trace the error back to a previous recipe step.
I'm not sure how to help you without a more detailed example of the cases where you find wrong values. Could you share a more concrete example?
Also, are you using partitioned datasets? Just from experience, I've had some group issues when not using the partitioned datasets in the right way.
I am using a dataset imported from Snowflake by someone else who is not reachable at the moment. Is there a way I can tell if the dataset is partitioned?
I have data that looks like this:
When I group to get the min, I get results that look like this...
When I group to get the max, I get results like this...
For B, it works perfectly. But for A, it gives me the same value in both situations and the value it gives me is neither the min nor the max.
The value is calculated in a formula in the prepare recipe that immediately precedes the groupings.
The easiest way to check if a dataset is partitioned is by looking at is "icon" in the flow:
As you can see it looks like a "set" of superimposed icons.
Could you share a screen shot of your group recipe? Like this example:
I'm mostly interested in the areas within the red boxes.
Thanks for the capture. From your recipe I can't seem to find a clear indication of what might be happening. Maybe, and following your previous example, some of the IDs look similar but are actually different? Like instead of just having "A" values you also have "A " (and A plus a space), but in that case you would be seeing apparently repeating IDs (or repeated 'CUSTOMER_VEHICLE_VRM')
Just to discard a final issue, could you share, from the same recipe, what you have in the tabs "Input/output" and "Advanced", like in these examples:
Hope that with this extra info we might get something!
Thanks for the screenshots. Sadly I'm running out of the ideas.
If you select both min and max to be calculated at the same time in the grouping recipe, do you still get the same min and max values?
Is there any subset of the data that you could share publicly? To try to replicate the problem.
Hope we can find the problem!
Yes, I've replicated the problem doing min and max in the same recipe and in separate recipes. I started with one recipe and split them out when I encountered this problem to see if that would help (and of course it didn't).
Unfortunately, I can't share the dataset at present. I'd have to mask the customer identifiers it contains to do that.
Thank you for your help! I've opened a ticket with Dataiku support.