Non idempotent problem with variable expansions in recipe formula language

Options
haoxian
haoxian Registered Posts: 13 ✭✭✭
edited July 16 in Using Dataiku

Hello,

I am trying to use variable expansion to rename my clusters_label column and some date matching operations with formula.

1. A same formula code works with the "group" recipe, but not formula in "prepare" recipe.

2. The third way of accessing variables in formula of recipe are not idempotent.

${variable_name}, variables.variable_name, variable_name

have different results.

3. Spark engine outputs empty column while local stream has the correct output.

Do you have any suggestion/solution on this? Thank you very much.

PS: I have to use Spark(due to the data volume) and formula in recipe.

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Options

    Hi,

    1) Can you share the formula, and where/how it's used in the various recipes

    2) what are the different results? Note that using ${...} means you replace the value directly in the formula text, so it happens before evaluation (as opposed to the other 2)

    3) is it with the formula of 1) ? or an unrelated recipe?

  • haoxian
    haoxian Registered Posts: 13 ✭✭✭
    edited July 17
    Options

    Hello, Thank you for you quick response.

    Example 1:

    Hereby the variable

    {
        "cluster_model_1_naming_mapping": [
            {"cluster": "cluster_outliers", "new_name": "HCA-BF-HP-BN"},
            {"cluster": "cluster_0", "new_name": "BCA-BF-BP-HN"},
            {"cluster": "cluster_1", "new_name": "BCA-BF-HP-BN"},
            {"cluster": "cluster_2", "new_name": "BCA-BF-BP-BN"},
            {"cluster": "cluster_3", "new_name": "BCA-HF-BP-BN"},
            {"cluster": "cluster_4", "new_name": "HCA-HF-BP-HN"},
            {"cluster": "cluster_5", "new_name": "MCA-BF-BP-BN"}
        ]
    }

    The formula used is

    filter(variables.cluster_model_1_naming_mapping, item, item["cluster"] == cluster_labels)[0]["new_name"]

    Where cluster_labels is the output column of a KMeans model, whose values are "cluster_1", "cluster_2" .. and so on.

    With this formula, the normal engine works but Spark gives nothing.

    Example 2:

    For formula in group recipe

    (arrayContains(${precedent_years}, val('date_creation_order_year'))) 
    && (arrayContains(${trimesters_to_analyse}, trimester))

    with variables

    {
        "precedent_year": [2018, 2019], 
        "trimesters_to_analyse": [2, 3]
    }

    This works well in group recipe pre-filter formula but not for the prepare recipe formula.

    Simply I want to filter the lines with the correct year in the range and the correct range of trimester.

    I am thinking that this may be because of the non-idempotent problem of retrieving values of variables expansion.

    Thank you

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Options

    for the Spark issue, indeed in Spark variables are not available via the `variables` object. You need to use `parseJson(${cluster_model_1_naming_mapping})` instead

    The second issue is more puzzling. Can you show the step of the Prepare recipe where you use the formula?

  • haoxian
    haoxian Registered Posts: 13 ✭✭✭
    edited July 17
    Options

    For Spark issue, the editor of formula gives me this error:

    Formula is invalid : Incorrect formula: 'filter(parseJson([{"cluster":"cluster_outliers","new_name":"HCA-BF-HP-BN"},{"cluster":"cluster_0","new_name":"BCA-BF-BP-HN"},{"cluster":"cluster_1","new_name":"BCA-BF-HP-BN"},{"cluster":"cluster_2","new_name":"BCA-BF-BP-BN"},{"cluster":"cluster_3","new_name":"BCA-HF-BP-BN"},{"cluster":"cluster_4","new_name":"HCA-HF-BP-HN"},{"cluster":"cluster_5","new_name":"MCA-BF-BP-BN"}]), item, item["cluster"] == cluster_labels)[0]["new_name"]' : Missing number, string, identifier, regex, or parenthesized expression(Parsing error at offset 18)

    (Sorry I don't have time at this moment for the second one, please allow me to do this in later post.)

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Options

    apologies, I lost the quotes when copying: it should be `parseJson('${cluster_model_1_naming_mapping}')`

  • haoxian
    haoxian Registered Posts: 13 ✭✭✭
    Options

    Thank you, it worked this way.

    Finally I believe that the second example is the same problem of the quote.

    Have a nice day!

  • haoxian
    haoxian Registered Posts: 13 ✭✭✭
    Options

    Hello.

    There is a problem with formula again. I used what you suggested as formula and it worked in the "prepare" dataset recipe. This time, I use the same formula in create computed colunms in a "joined recipe" and the parser failed to parse the filter function.

    This is the formula

    filter(parseJson('${cluster_model_1_naming_mapping}'), item, item["cluster"] == before_cluster_labels)[0]["new_name"]

    The error is showed as in the picture bug1.png

    Thank you in advance for your help.

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Options

    Hi,

    this is indeed a parse-time error, and the recipe will pretend to be incorrectly setup, but the expression seems actually correct so the recipe should be working fine if you run it

  • haoxian
    haoxian Registered Posts: 13 ✭✭✭
    Options

    Hi I ran the formula but the same error appears. bug2.png

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Options

    considering the operation you're doing (enriching a dataset with a fixed set of values), you should try putting the mapping in an Editable dataset and doing a Join recipe to get the mapped value.

    If you absolutely need to use a Grouping recipe, can you check the version of DSS you are using?

  • haoxian
    haoxian Registered Posts: 13 ✭✭✭
    Options

    Thank you. I am using the DSS 7.0.

    It's a good idea with editable dataset. In fact, the formula worked in the prepare recipe, I could use the formula in prepare recipe too. The reason why I try to use this, it's to reduce the the shape of the flow. If I use the Editable dataset, once I need to use the variables in several places, it will ruin the shape of the flow and reduce the maintenanablity.

    Thank you very much. I guess that I will have to use another solution.
    I am looking forward to your future improvement on this function.

Setup Info
    Tags
      Help me…