Non idempotent problem with variable expansions in recipe formula language

haoxian · November 2020

Hello,

I am trying to use variable expansion to rename my clusters_label column and some date matching operations with formula.

1. A same formula code works with the "group" recipe, but not formula in "prepare" recipe.

2. The third way of accessing variables in formula of recipe are not idempotent.

${variable_name}, variables.variable_name, variable_name

have different results.

3. Spark engine outputs empty column while local stream has the correct output.

Do you have any suggestion/solution on this? Thank you very much.

PS: I have to use Spark(due to the data volume) and formula in recipe.

fchataigner2 · November 2020

Hi,

1) Can you share the formula, and where/how it's used in the various recipes

2) what are the different results? Note that using ${...} means you replace the value directly in the formula text, so it happens before evaluation (as opposed to the other 2)

3) is it with the formula of 1) ? or an unrelated recipe?

haoxian · November 2020

Hello, Thank you for you quick response.

Example 1:

Hereby the variable

{
    "cluster_model_1_naming_mapping": [
        {"cluster": "cluster_outliers", "new_name": "HCA-BF-HP-BN"},
        {"cluster": "cluster_0", "new_name": "BCA-BF-BP-HN"},
        {"cluster": "cluster_1", "new_name": "BCA-BF-HP-BN"},
        {"cluster": "cluster_2", "new_name": "BCA-BF-BP-BN"},
        {"cluster": "cluster_3", "new_name": "BCA-HF-BP-BN"},
        {"cluster": "cluster_4", "new_name": "HCA-HF-BP-HN"},
        {"cluster": "cluster_5", "new_name": "MCA-BF-BP-BN"}
    ]
}

The formula used is

filter(variables.cluster_model_1_naming_mapping, item, item["cluster"] == cluster_labels)[0]["new_name"]

Where cluster_labels is the output column of a KMeans model, whose values are "cluster_1", "cluster_2" .. and so on.

With this formula, the normal engine works but Spark gives nothing.

Example 2:

For formula in group recipe

(arrayContains(${precedent_years}, val('date_creation_order_year'))) 
&& (arrayContains(${trimesters_to_analyse}, trimester))

with variables

{
    "precedent_year": [2018, 2019], 
    "trimesters_to_analyse": [2, 3]
}

This works well in group recipe pre-filter formula but not for the prepare recipe formula.

Simply I want to filter the lines with the correct year in the range and the correct range of trimester.

I am thinking that this may be because of the non-idempotent problem of retrieving values of variables expansion.

Thank you

fchataigner2 · November 2020

for the Spark issue, indeed in Spark variables are not available via the `variables` object. You need to use `parseJson(${cluster_model_1_naming_mapping})` instead

The second issue is more puzzling. Can you show the step of the Prepare recipe where you use the formula?

haoxian · November 2020

For Spark issue, the editor of formula gives me this error:

Formula is invalid : Incorrect formula: 'filter(parseJson([{"cluster":"cluster_outliers","new_name":"HCA-BF-HP-BN"},{"cluster":"cluster_0","new_name":"BCA-BF-BP-HN"},{"cluster":"cluster_1","new_name":"BCA-BF-HP-BN"},{"cluster":"cluster_2","new_name":"BCA-BF-BP-BN"},{"cluster":"cluster_3","new_name":"BCA-HF-BP-BN"},{"cluster":"cluster_4","new_name":"HCA-HF-BP-HN"},{"cluster":"cluster_5","new_name":"MCA-BF-BP-BN"}]), item, item["cluster"] == cluster_labels)[0]["new_name"]' : Missing number, string, identifier, regex, or parenthesized expression(Parsing error at offset 18)

(Sorry I don't have time at this moment for the second one, please allow me to do this in later post.)

fchataigner2 · November 2020

apologies, I lost the quotes when copying: it should be `parseJson('${cluster_model_1_naming_mapping}')`

haoxian · November 2020

Thank you, it worked this way.

Finally I believe that the second example is the same problem of the quote.

Have a nice day!

haoxian · December 2020

Hello.

There is a problem with formula again. I used what you suggested as formula and it worked in the "prepare" dataset recipe. This time, I use the same formula in create computed colunms in a "joined recipe" and the parser failed to parse the filter function.

This is the formula

filter(parseJson('${cluster_model_1_naming_mapping}'), item, item["cluster"] == before_cluster_labels)[0]["new_name"]

The error is showed as in the picture

Thank you in advance for your help.

fchataigner2 · December 2020

Hi,

this is indeed a parse-time error, and the recipe will pretend to be incorrectly setup, but the expression seems actually correct so the recipe should be working fine if you run it

haoxian · December 2020

Hi I ran the formula but the same error appears.

fchataigner2 · December 2020

considering the operation you're doing (enriching a dataset with a fixed set of values), you should try putting the mapping in an Editable dataset and doing a Join recipe to get the mapped value.

If you absolutely need to use a Grouping recipe, can you check the version of DSS you are using?

haoxian · December 2020

Thank you. I am using the DSS 7.0.

It's a good idea with editable dataset. In fact, the formula worked in the prepare recipe, I could use the formula in prepare recipe too. The reason why I try to use this, it's to reduce the the shape of the flow. If I use the Editable dataset, once I need to use the variables in several places, it will ruin the shape of the flow and reduce the maintenanablity.

Thank you very much. I guess that I will have to use another solution.
I am looking forward to your future improvement on this function.

Non idempotent problem with variable expansions in recipe formula language

Answers

Categories

Setup Info

Tags