Non idempotent problem with variable expansions in recipe formula language

haoxian
Level 2
Non idempotent problem with variable expansions in recipe formula language

Hello, 

I am trying to use variable expansion to rename my clusters_label column and some date matching operations with formula. 

1. A same formula code works with the "group" recipe, but not formula in "prepare" recipe. 

2. The third way of accessing variables in formula of recipe are not idempotent.

 

${variable_name}, variables.variable_name, variable_name

 

 have different results. 

 

3. Spark engine outputs empty column while local stream has the correct output. 

Do you have any suggestion/solution on this? Thank you very much. 

PS: I have to use Spark(due to the data volume) and formula in recipe. 

0 Kudos
11 Replies
fchataigner2
Dataiker

Hi,

1) Can you share the formula, and where/how it's used in the various recipes

2) what are the different results? Note that using ${...} means you replace the value directly in the formula text, so it happens before evaluation (as opposed to the other 2)

3) is it with the formula of 1) ? or an unrelated recipe?

0 Kudos
haoxian
Level 2
Author

Hello, Thank you for you quick response. 

Example 1: 

Hereby the variable 

{
    "cluster_model_1_naming_mapping": [
        {"cluster": "cluster_outliers", "new_name": "HCA-BF-HP-BN"},
        {"cluster": "cluster_0", "new_name": "BCA-BF-BP-HN"},
        {"cluster": "cluster_1", "new_name": "BCA-BF-HP-BN"},
        {"cluster": "cluster_2", "new_name": "BCA-BF-BP-BN"},
        {"cluster": "cluster_3", "new_name": "BCA-HF-BP-BN"},
        {"cluster": "cluster_4", "new_name": "HCA-HF-BP-HN"},
        {"cluster": "cluster_5", "new_name": "MCA-BF-BP-BN"}
    ]
}

The formula used is 

filter(variables.cluster_model_1_naming_mapping, item, item["cluster"] == cluster_labels)[0]["new_name"]

Where cluster_labels is the output column of a KMeans model, whose values are "cluster_1", "cluster_2" .. and so on. 

With this formula, the normal engine works but Spark gives nothing. 

 

Example 2: 

For formula in group recipe 

(arrayContains(${precedent_years}, val('date_creation_order_year'))) 
&& (arrayContains(${trimesters_to_analyse}, trimester))

with variables

{
    "precedent_year": [2018, 2019], 
    "trimesters_to_analyse": [2, 3]
}

This works well in group recipe pre-filter formula but not for the prepare recipe formula. 

Simply I want to filter the lines with the correct year in the range and the correct range of trimester. 

I am thinking that this may be because of the non-idempotent problem of retrieving values of variables expansion. 

 

Thank you

0 Kudos
fchataigner2
Dataiker

for the Spark issue, indeed in Spark variables are not available via the `variables` object. You need to use `parseJson(${cluster_model_1_naming_mapping})` instead

The second issue is more puzzling. Can you show the step of the Prepare recipe where you use the formula?

0 Kudos
haoxian
Level 2
Author

For Spark issue, the editor of formula gives me this error: 

Formula is invalid : Incorrect formula: 'filter(parseJson([{"cluster":"cluster_outliers","new_name":"HCA-BF-HP-BN"},{"cluster":"cluster_0","new_name":"BCA-BF-BP-HN"},{"cluster":"cluster_1","new_name":"BCA-BF-HP-BN"},{"cluster":"cluster_2","new_name":"BCA-BF-BP-BN"},{"cluster":"cluster_3","new_name":"BCA-HF-BP-BN"},{"cluster":"cluster_4","new_name":"HCA-HF-BP-HN"},{"cluster":"cluster_5","new_name":"MCA-BF-BP-BN"}]), item, item["cluster"] == cluster_labels)[0]["new_name"]' : Missing number, string, identifier, regex, or parenthesized expression(Parsing error at offset 18)

(Sorry I don't have time at this moment for the second one, please allow me to do this in later post.) 

0 Kudos
fchataigner2
Dataiker

apologies, I lost the quotes when copying: it should be `parseJson('${cluster_model_1_naming_mapping}')`

0 Kudos
haoxian
Level 2
Author

Thank you, it worked this way. 

Finally I believe that the second example is the same problem of the quote. 

Have a nice day!

0 Kudos
haoxian
Level 2
Author

Hello. 

There is a problem with formula again. I used what you suggested as formula and it worked in the "prepare" dataset recipe. This time, I use the same formula in create computed colunms in a "joined recipe" and the parser failed to parse the filter function. 

This is the formula 

filter(parseJson('${cluster_model_1_naming_mapping}'), item, item["cluster"] == before_cluster_labels)[0]["new_name"]

The error is showed as in the picture bug1.png

Thank you in advance for your help.

0 Kudos
fchataigner2
Dataiker

Hi,

this is indeed a parse-time error, and the recipe will pretend to be incorrectly setup, but the expression seems actually correct so the recipe should be working fine if you run it

0 Kudos
haoxian
Level 2
Author

Hi I ran the formula but the same error appears. bug2.png

0 Kudos
fchataigner2
Dataiker

considering the operation you're doing (enriching a dataset with a fixed set of values), you should try putting the mapping in an Editable dataset and doing a Join recipe to get the mapped value.

If you absolutely need to use a Grouping recipe, can you check the version of DSS you are using?

0 Kudos
haoxian
Level 2
Author

Thank you. I am using the DSS 7.0. 

It's a good idea with editable dataset. In fact, the formula worked in the prepare recipe, I could use the formula in prepare recipe too. The reason why I try to use this, it's to reduce the the shape of the flow. If I use the Editable dataset, once I need to use the variables in several places, it will ruin the shape of the flow and reduce the maintenanablity. 

Thank you very much. I guess that I will have to use another solution. 
I am looking forward to your future improvement on this function. 

0 Kudos