Non idempotent problem with variable expansions in recipe formula language

Registered Posts: 13 ✭✭✭
edited July 2024 in Using Dataiku

Hello,

I am trying to use variable expansion to rename my clusters_label column and some date matching operations with formula.

1. A same formula code works with the "group" recipe, but not formula in "prepare" recipe.

2. The third way of accessing variables in formula of recipe are not idempotent.

${variable_name}, variables.variable_name, variable_name

have different results.

3. Spark engine outputs empty column while local stream has the correct output.

Do you have any suggestion/solution on this? Thank you very much.

PS: I have to use Spark(due to the data volume) and formula in recipe.

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

  • Dataiker Posts: 355 Dataiker

    Hi,

    1) Can you share the formula, and where/how it's used in the various recipes

    2) what are the different results? Note that using ${...} means you replace the value directly in the formula text, so it happens before evaluation (as opposed to the other 2)

    3) is it with the formula of 1) ? or an unrelated recipe?

  • Registered Posts: 13 ✭✭✭
    edited July 2024

    Hello, Thank you for you quick response.

    Example 1:

    Hereby the variable

    {
        "cluster_model_1_naming_mapping": [
            {"cluster": "cluster_outliers", "new_name": "HCA-BF-HP-BN"},
            {"cluster": "cluster_0", "new_name": "BCA-BF-BP-HN"},
            {"cluster": "cluster_1", "new_name": "BCA-BF-HP-BN"},
            {"cluster": "cluster_2", "new_name": "BCA-BF-BP-BN"},
            {"cluster": "cluster_3", "new_name": "BCA-HF-BP-BN"},
            {"cluster": "cluster_4", "new_name": "HCA-HF-BP-HN"},
            {"cluster": "cluster_5", "new_name": "MCA-BF-BP-BN"}
        ]
    }

    The formula used is

    filter(variables.cluster_model_1_naming_mapping, item, item["cluster"] == cluster_labels)[0]["new_name"]

    Where cluster_labels is the output column of a KMeans model, whose values are "cluster_1", "cluster_2" .. and so on.

    With this formula, the normal engine works but Spark gives nothing.

    Example 2:

    For formula in group recipe

    (arrayContains(${precedent_years}, val('date_creation_order_year'))) 
    && (arrayContains(${trimesters_to_analyse}, trimester))

    with variables

    {
        "precedent_year": [2018, 2019], 
        "trimesters_to_analyse": [2, 3]
    }

    This works well in group recipe pre-filter formula but not for the prepare recipe formula.

    Simply I want to filter the lines with the correct year in the range and the correct range of trimester.

    I am thinking that this may be because of the non-idempotent problem of retrieving values of variables expansion.

    Thank you

  • Dataiker Posts: 355 Dataiker

    for the Spark issue, indeed in Spark variables are not available via the `variables` object. You need to use `parseJson(${cluster_model_1_naming_mapping})` instead

    The second issue is more puzzling. Can you show the step of the Prepare recipe where you use the formula?

  • Registered Posts: 13 ✭✭✭
    edited July 2024

    For Spark issue, the editor of formula gives me this error:

    Formula is invalid : Incorrect formula: 'filter(parseJson([{"cluster":"cluster_outliers","new_name":"HCA-BF-HP-BN"},{"cluster":"cluster_0","new_name":"BCA-BF-BP-HN"},{"cluster":"cluster_1","new_name":"BCA-BF-HP-BN"},{"cluster":"cluster_2","new_name":"BCA-BF-BP-BN"},{"cluster":"cluster_3","new_name":"BCA-HF-BP-BN"},{"cluster":"cluster_4","new_name":"HCA-HF-BP-HN"},{"cluster":"cluster_5","new_name":"MCA-BF-BP-BN"}]), item, item["cluster"] == cluster_labels)[0]["new_name"]' : Missing number, string, identifier, regex, or parenthesized expression(Parsing error at offset 18)

    (Sorry I don't have time at this moment for the second one, please allow me to do this in later post.)

  • Dataiker Posts: 355 Dataiker

    apologies, I lost the quotes when copying: it should be `parseJson('${cluster_model_1_naming_mapping}')`

  • Registered Posts: 13 ✭✭✭

    Thank you, it worked this way.

    Finally I believe that the second example is the same problem of the quote.

    Have a nice day!

  • Registered Posts: 13 ✭✭✭

    Hello.

    There is a problem with formula again. I used what you suggested as formula and it worked in the "prepare" dataset recipe. This time, I use the same formula in create computed colunms in a "joined recipe" and the parser failed to parse the filter function.

    This is the formula

    filter(parseJson('${cluster_model_1_naming_mapping}'), item, item["cluster"] == before_cluster_labels)[0]["new_name"]

    The error is showed as in the picture bug1.png

    Thank you in advance for your help.

  • Dataiker Posts: 355 Dataiker

    Hi,

    this is indeed a parse-time error, and the recipe will pretend to be incorrectly setup, but the expression seems actually correct so the recipe should be working fine if you run it

  • Registered Posts: 13 ✭✭✭

    Hi I ran the formula but the same error appears. bug2.png

  • Dataiker Posts: 355 Dataiker

    considering the operation you're doing (enriching a dataset with a fixed set of values), you should try putting the mapping in an Editable dataset and doing a Join recipe to get the mapped value.

    If you absolutely need to use a Grouping recipe, can you check the version of DSS you are using?

  • Registered Posts: 13 ✭✭✭

    Thank you. I am using the DSS 7.0.

    It's a good idea with editable dataset. In fact, the formula worked in the prepare recipe, I could use the formula in prepare recipe too. The reason why I try to use this, it's to reduce the the shape of the flow. If I use the Editable dataset, once I need to use the variables in several places, it will ruin the shape of the flow and reduce the maintenanablity.

    Thank you very much. I guess that I will have to use another solution.
    I am looking forward to your future improvement on this function.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.