How to diagnose a failing SQL Probe metric (Dataiku 7.01)

Keufran
Keufran Registered Posts: 12 ✭✭✭✭

Hello World,

We fail to compute a metric (as a SQL Probe) on a dataset. The request works great in a query editor but fails when used as a probe. How can we diagnose ?

We of course double-checked syntax, table paths notation, etc...

Answers

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Hi, @Keufran
    ! Can you provide any further details on the thread to assist users in helping you find a solution. Also, can you let us know if you’ve tried any fixes already?This should lead to a quicker response from the community.

  • Keufran
    Keufran Registered Posts: 12 ✭✭✭✭

    Thank you @CoreyS
    for your remarks.

    Infortunatly, I don't see what to add. MA request is failing as a SQL Probe but working in a query editor. DSS gives no explicit error, thus how can I diagnose ?

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Thanks @Keufran
    hopefully this will at least give some more added visibility to help find you a solution.

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 330 Neuron

    Hi @Keufran
    ,

    Have you tried the Click to run this now / Run on the Metrics Edit page below the SQL Probe? What is the result of that?

    And can you say more about what you mean by fail? Do you see an error message somewhere or the result doesn't update or it's incorrect?

    Marlan

  • Keufran
    Keufran Registered Posts: 12 ✭✭✭✭
    edited July 2024

    Hi @Marlan,

    Thanks for your interest.

    When I "run this now" the compute dialog opens, no error is shown and when I click on "Last run results" only the computation time is displayed, not the metric.

    Here is the request:

    SELECT
    A.nb_cmd[0]/(A.nb_cmd[0]+A.nb_cmd[1])*100 AS percent_training
    FROM (
    SELECT
    COLLECT_LIST(B.nb_cmd) AS nb_cmd
    FROM (
    SELECT
    is_placebo,
    COUNT(DISTINCT ref_externe) AS nb_cmd
    FROM
    ${hiveconf:database_socle}.${projectKey}_commandes_ftth_historique_a_jour
    GROUP BY
    is_placebo
    ORDER BY
    is_placebo DESC
    ) B
    ) A;

    The dataset is not purly managed by DSS. It's HDFS.

    With a pySpark script, all is fine

    def process(dataset, partition_id):
    # dataset is a dataiku.Dataset object
    bdd = dataiku.get_custom_variables()["database_socle"]
    project = dataiku.get_custom_variables()["projectKey"]
    query = f"SELECT A.nb_cmd[0]/(A.nb_cmd[0]+A.nb_cmd[1])*100 AS percent_training FROM (SELECT COLLECT_LIST(B.nb_cmd) AS nb_cmd FROM (SELECT is_placebo,COUNT(DISTINCT ref_externe) AS nb_cmd FROM {bdd}.{project}_commandes_ftth_historique_a_jour GROUP BY is_placebo ORDER BY is_placebo DESC) B) A"
    dataset = dataiku.Dataset("commandes_ftth_historique_a_jour")
    executor = dataiku.HiveExecutor(dataset=dataset)
    resultdf = executor.query_to_df(query)
    res = resultdf.iloc[0].values[0]
    return {"placebo": res}

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 330 Neuron

    Hi @Keufran
    ,

    It's odd that you don't get an error message as I'd expect one given that you are using a couple of variables in your SQL script. Unfortunately, variables besides those listed in the sample code (${DKU_DATASET_TABLE_NAME}, ${DKU_PARTITION_FILTER}, etc.) are not available in SQL Probes.

    I have submitted an idea to enable all project variables in SQL Probes. The status is Planned so I'd think it'd be coming within the next few releases.

    After clicking Run this now / Run the bar below the probe definition turns green? I'd expect it to turn red and display an error message.

    In any case, try replacing the variables you use with their values and see if it works then. If so, you'll need to decide if you are OK hard coding those values or not. I was not and my workaround was to create an extra recipe and dataset in my flow to do part of the calculation and then apply a SQL Probe to that. Not a fan of cluttering the flow with unneeded stuff but for me was better than hard coding and having that cause problems later.

    Marlan

  • Keufran
    Keufran Registered Posts: 12 ✭✭✭✭

    Thanks again Marlan for your interest.

    Results are the same if we use the variables or not.

    After clicking "Run" the status bar is green (RGB(215,238,205) to be precise as I'm colour-blind ). But "Last run results" display only the duration, not the metric itself.

    The Dataset contains more than 15M rows, so duplicating it (well, almost) to do calculations is not an option for now. I prefer to use the PySpark to bypass the SQL Probe problem.

    We will soon upgrade to Dataiku 9, I hope the problem will vanish...

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 330 Neuron

    Yes, that is green - very odd that it appears to run successfully with variables (not my experience on v9 ). OK, well I'm out of ideas. Sorry I couldn't help.

    Marlan

Setup Info
    Tags
      Help me…