Non idempotent problem with variable expansions in recipe formula language

haoxian · November 2020

Hello,

I am trying to use variable expansion to rename my clusters_label column and some date matching operations with formula.

1. A same formula code works with the "group" recipe, but not formula in "prepare" recipe.

2. The third way of accessing variables in formula of recipe are not idempotent.

${variable_name}, variables.variable_name, variable_name

have different results.

3. Spark engine outputs empty column while local stream has the correct output.

Do you have any suggestion/solution on this? Thank you very much.

PS: I have to use Spark(due to the data volume) and formula in recipe.

fchataigner2 · November 2020

Hi,

1) Can you share the formula, and where/how it's used in the various recipes

2) what are the different results? Note that using ${...} means you replace the value directly in the formula text, so it happens before evaluation (as opposed to the other 2)

3) is it with the formula of 1) ? or an unrelated recipe?

haoxian · November 2020

Hello, Thank you for you quick response.

Example 1:

Hereby the variable

{
    "cluster_model_1_naming_mapping": [
        {"cluster": "cluster_outliers", "new_name": "HCA-BF-HP-BN"},
        {"cluster": "cluster_0", "new_name": "BCA-BF-BP-HN"},
        {"cluster": "cluster_1", "new_name": "BCA-BF-HP-BN"},
        {"cluster": "cluster_2", "new_name": "BCA-BF-BP-BN"},
        {"cluster": "cluster_3", "new_name": "BCA-HF-BP-BN"},
        {"cluster": "cluster_4", "new_name": "HCA-HF-BP-HN"},
        {"cluster": "cluster_5", "new_name": "MCA-BF-BP-BN"}
    ]
}

The formula used is

filter(variables.cluster_model_1_naming_mapping, item, item["cluster"] == cluster_labels)[0]["new_name"]

Where cluster_labels is the output column of a KMeans model, whose values are "cluster_1", "cluster_2" .. and so on.

With this formula, the normal engine works but Spark gives nothing.

Example 2:

For formula in group recipe

(arrayContains(${precedent_years}, val('date_creation_order_year'))) 
&& (arrayContains(${trimesters_to_analyse}, trimester))

with variables

{
    "precedent_year": [2018, 2019], 
    "trimesters_to_analyse": [2, 3]
}

This works well in group recipe pre-filter formula but not for the prepare recipe formula.

Simply I want to filter the lines with the correct year in the range and the correct range of trimester.

I am thinking that this may be because of the non-idempotent problem of retrieving values of variables expansion.

Thank you

fchataigner2 · November 2020

for the Spark issue, indeed in Spark variables are not available via the `variables` object. You need to use `parseJson(${cluster_model_1_naming_mapping})` instead

The second issue is more puzzling. Can you show the step of the Prepare recipe where you use the formula?

haoxian · November 2020

For Spark issue, the editor of formula gives me this error:

Formula is invalid : Incorrect formula: 'filter(parseJson([{"cluster":"cluster_outliers","new_name":"HCA-BF-HP-BN"},{"cluster":"cluster_0","new_name":"BCA-BF-BP-HN"},{"cluster":"cluster_1","new_name":"BCA-BF-HP-BN"},{"cluster":"cluster_2","new_name":"BCA-BF-BP-BN"},{"cluster":"cluster_3","new_name":"BCA-HF-BP-BN"},{"cluster":"cluster_4","new_name":"HCA-HF-BP-HN"},{"cluster":"cluster_5","new_name":"MCA-BF-BP-BN"}]), item, item["cluster"] == cluster_labels)[0]["new_name"]' : Missing number, string, identifier, regex, or parenthesized expression(Parsing error at offset 18)

(Sorry I don't have time at this moment for the second one, please allow me to do this in later post.)

fchataigner2 · November 2020

apologies, I lost the quotes when copying: it should be `parseJson('${cluster_model_1_naming_mapping}')`

haoxian · November 2020

Thank you, it worked this way.

Finally I believe that the second example is the same problem of the quote.

Have a nice day!

haoxian · December 2020

Hello.

There is a problem with formula again. I used what you suggested as formula and it worked in the "prepare" dataset recipe. This time, I use the same formula in create computed colunms in a "joined recipe" and the parser failed to parse the filter function.

This is the formula

filter(parseJson('${cluster_model_1_naming_mapping}'), item, item["cluster"] == before_cluster_labels)[0]["new_name"]

The error is showed as in the picture

Thank you in advance for your help.

fchataigner2 · December 2020

Hi,

this is indeed a parse-time error, and the recipe will pretend to be incorrectly setup, but the expression seems actually correct so the recipe should be working fine if you run it

haoxian · December 2020

Hi I ran the formula but the same error appears.

fchataigner2 · December 2020

considering the operation you're doing (enriching a dataset with a fixed set of values), you should try putting the mapping in an Editable dataset and doing a Join recipe to get the mapped value.

If you absolutely need to use a Grouping recipe, can you check the version of DSS you are using?

haoxian · December 2020

Thank you. I am using the DSS 7.0.

It's a good idea with editable dataset. In fact, the formula worked in the prepare recipe, I could use the formula in prepare recipe too. The reason why I try to use this, it's to reduce the the shape of the flow. If I use the Editable dataset, once I need to use the variables in several places, it will ruin the shape of the flow and reduce the maintenanablity.

Thank you very much. I guess that I will have to use another solution.
I am looking forward to your future improvement on this function.

Non idempotent problem with variable expansions in recipe formula language

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories

Sign up to take part

Non idempotent problem with variable expansions in recipe formula language

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories