Submit your use case or success story to the 2023 edition of the Dataiku Frontrunner Awards ENTER YOUR SUBMISSION

Non idempotent problem with variable expansions in recipe formula language

haoxian
Level 2
Non idempotent problem with variable expansions in recipe formula language

Hello, 

I am trying to use variable expansion to rename my clusters_label column and some date matching operations with formula. 

1. A same formula code works with the "group" recipe, but not formula in "prepare" recipe. 

2. The third way of accessing variables in formula of recipe are not idempotent.

 

${variable_name}, variables.variable_name, variable_name

 

 have different results. 

 

3. Spark engine outputs empty column while local stream has the correct output. 

Do you have any suggestion/solution on this? Thank you very much. 

PS: I have to use Spark(due to the data volume) and formula in recipe. 

0 Kudos
11 Replies
fchataigner2
Dataiker

Hi,

1) Can you share the formula, and where/how it's used in the various recipes

2) what are the different results? Note that using ${...} means you replace the value directly in the formula text, so it happens before evaluation (as opposed to the other 2)

3) is it with the formula of 1) ? or an unrelated recipe?

0 Kudos
haoxian
Level 2
Author

Hello, Thank you for you quick response. 

Example 1: 

Hereby the variable 

{
    "cluster_model_1_naming_mapping": [
        {"cluster": "cluster_outliers", "new_name": "HCA-BF-HP-BN"},
        {"cluster": "cluster_0", "new_name": "BCA-BF-BP-HN"},
        {"cluster": "cluster_1", "new_name": "BCA-BF-HP-BN"},
        {"cluster": "cluster_2", "new_name": "BCA-BF-BP-BN"},
        {"cluster": "cluster_3", "new_name": "BCA-HF-BP-BN"},
        {"cluster": "cluster_4", "new_name": "HCA-HF-BP-HN"},
        {"cluster": "cluster_5", "new_name": "MCA-BF-BP-BN"}
    ]
}

The formula used is 

filter(variables.cluster_model_1_naming_mapping, item, item["cluster"] == cluster_labels)[0]["new_name"]

Where cluster_labels is the output column of a KMeans model, whose values are "cluster_1", "cluster_2" .. and so on. 

With this formula, the normal engine works but Spark gives nothing. 

 

Example 2: 

For formula in group recipe 

(arrayContains(${precedent_years}, val('date_creation_order_year'))) 
&& (arrayContains(${trimesters_to_analyse}, trimester))

with variables

{
    "precedent_year": [2018, 2019], 
    "trimesters_to_analyse": [2, 3]
}

This works well in group recipe pre-filter formula but not for the prepare recipe formula. 

Simply I want to filter the lines with the correct year in the range and the correct range of trimester. 

I am thinking that this may be because of the non-idempotent problem of retrieving values of variables expansion. 

 

Thank you

0 Kudos
fchataigner2
Dataiker

for the Spark issue, indeed in Spark variables are not available via the `variables` object. You need to use `parseJson(${cluster_model_1_naming_mapping})` instead

The second issue is more puzzling. Can you show the step of the Prepare recipe where you use the formula?

0 Kudos
haoxian
Level 2
Author

For Spark issue, the editor of formula gives me this error: 

Formula is invalid : Incorrect formula: 'filter(parseJson([{"cluster":"cluster_outliers","new_name":"HCA-BF-HP-BN"},{"cluster":"cluster_0","new_name":"BCA-BF-BP-HN"},{"cluster":"cluster_1","new_name":"BCA-BF-HP-BN"},{"cluster":"cluster_2","new_name":"BCA-BF-BP-BN"},{"cluster":"cluster_3","new_name":"BCA-HF-BP-BN"},{"cluster":"cluster_4","new_name":"HCA-HF-BP-HN"},{"cluster":"cluster_5","new_name":"MCA-BF-BP-BN"}]), item, item["cluster"] == cluster_labels)[0]["new_name"]' : Missing number, string, identifier, regex, or parenthesized expression(Parsing error at offset 18)

(Sorry I don't have time at this moment for the second one, please allow me to do this in later post.) 

0 Kudos
fchataigner2
Dataiker

apologies, I lost the quotes when copying: it should be `parseJson('${cluster_model_1_naming_mapping}')`

0 Kudos
haoxian
Level 2
Author

Thank you, it worked this way. 

Finally I believe that the second example is the same problem of the quote. 

Have a nice day!

0 Kudos
haoxian
Level 2
Author

Hello. 

There is a problem with formula again. I used what you suggested as formula and it worked in the "prepare" dataset recipe. This time, I use the same formula in create computed colunms in a "joined recipe" and the parser failed to parse the filter function. 

This is the formula 

filter(parseJson('${cluster_model_1_naming_mapping}'), item, item["cluster"] == before_cluster_labels)[0]["new_name"]

The error is showed as in the picture bug1.png

Thank you in advance for your help.

0 Kudos
fchataigner2
Dataiker

Hi,

this is indeed a parse-time error, and the recipe will pretend to be incorrectly setup, but the expression seems actually correct so the recipe should be working fine if you run it

0 Kudos
haoxian
Level 2
Author

Hi I ran the formula but the same error appears. bug2.png

0 Kudos
fchataigner2
Dataiker

considering the operation you're doing (enriching a dataset with a fixed set of values), you should try putting the mapping in an Editable dataset and doing a Join recipe to get the mapped value.

If you absolutely need to use a Grouping recipe, can you check the version of DSS you are using?

0 Kudos
haoxian
Level 2
Author

Thank you. I am using the DSS 7.0. 

It's a good idea with editable dataset. In fact, the formula worked in the prepare recipe, I could use the formula in prepare recipe too. The reason why I try to use this, it's to reduce the the shape of the flow. If I use the Editable dataset, once I need to use the variables in several places, it will ruin the shape of the flow and reduce the maintenanablity. 

Thank you very much. I guess that I will have to use another solution. 
I am looking forward to your future improvement on this function. 

0 Kudos