Group Recipe - Min / Max calculation problem

dfwmike
Level 2
Group Recipe - Min / Max calculation problem

Are there any tips or tricks beyond what's covered in the training for using group recipes?

I'm having problems getting the correct min and max values calculated using group recipes. I can confirm that sometimes they calculate the correct values and sometimes they don't.

This is probably user error, but the dataset is large enough to not allow me to not use a sample, making it difficult or impossible to trace the error back to a previous recipe step.

0 Kudos
11 Replies
Ignacio_Toledo

Hi @dfwmike,

I'm not sure how to help you without a more detailed example of the cases where you find wrong values. Could you share a more concrete example?

Also, are you using partitioned datasets? Just from experience, I've had some group issues when not using the partitioned datasets in the right way.

0 Kudos
dfwmike
Level 2
Author

Thanks, @Ignacio_Toledo!

I am using a dataset imported from Snowflake by someone else who is not reachable at the moment. Is there a way I can tell if the dataset is partitioned?

I have data that looks like this:

ID         Value

A           10

A           11

B           9

A           15

B           8

When I group to get the min, I get results that look like this...

ID         min_Value

A           11

B           8

When I group to get the max, I get results like this...

ID         max_Value

A           11

B           9

For B, it works perfectly. But for A, it gives me the same value in both situations and the value it gives me is neither the min nor the max.

The value is calculated in a formula in the prepare recipe that immediately precedes the groupings.

0 Kudos

Hi @dfwmike,

The easiest way to check if a dataset is partitioned is by looking at is "icon" in the flow:

partitioned_dataset.png

As you can see it looks like a "set" of superimposed icons.

Could you share a screen shot of your group recipe? Like this example:

group_recipe.png

I'm mostly interested in the areas within the red boxes.

Cheers!

0 Kudos
dfwmike
Level 2
Author

@Ignacio_Toledo 

Thanks again! The dataset is not partitioned.

 

Attached is a screenshot of the group recipe where I'm calculating the ADR max by CUSTOMER_VEHICLE_VRM.

 
 
0 Kudos
Ignacio_Toledo

Thanks for the capture. From your recipe I can't seem to find a clear indication of what might be happening. Maybe, and following your previous example, some of the IDs look similar but are actually different? Like instead of just having "A" values you also have "A " (and A plus a space), but in that case you would be seeing apparently repeating IDs (or repeated 'CUSTOMER_VEHICLE_VRM')

Just to discard a final issue, could you share, from the same recipe, what you have in the tabs "Input/output" and "Advanced", like in these examples:

group_recipe_inout.png

group_recipe_adv.png

 Hope that with this extra info we might get something!

0 Kudos
dfwmike
Level 2
Author

I don't think the CUSTOMER_VEHICLE_VRM values are different because at an earlier point in the flow, I do a count by CUSTOMER_VEHICLE_VRM and I get the correct number.

Attached are screenshots from the requested tabs.

Thanks!

0 Kudos
Ignacio_Toledo

Hi @dfwmike,

Thanks for the screenshots. Sadly I'm running out of the ideas.

If you select both min and max to be calculated at the same time in the grouping recipe, do you still get the same min and max values?

Is there any subset of the data that you could share publicly? To try to replicate the problem.

Hope we can find the problem!

0 Kudos
dfwmike
Level 2
Author

Yes, I've replicated the problem doing min and max in the same recipe and in separate recipes. I started with one recipe and split them out when I encountered this problem to see if that would help (and of course it didn't).

Unfortunately, I can't share the dataset at present. I'd have to mask the customer identifiers it contains to do that.

Thank you for your help! I've opened a ticket with Dataiku support.

Mike

I was going to suggest next to create a support ticket, so great!

Let us know what they found, it will be interesting to find out the cause.

Good luck!

0 Kudos
EliasH
Dataiker

The issue for @dfwmike has been resolved, thank you @Ignacio_Toledo for your assistance! 

Jurre
Level 5

Hi @EliasH and @dfwmike  , Would it be possible to expand a little on the nature of the problem and it's final solution ? I use this functionality a lot so there is some interest in the reliability of it (haven't encountered any challenges with it to date..)