How to execute a recipe after an empty dataset ?

Options
vic
vic Registered Posts: 16 ✭✭✭✭
edited July 16 in Using Dataiku

Is there any possible way of checking readyness of a dataset? I have a dataset that might be empty after a Hive query, it shouldn't be a problem but since it is (I cannot use it in a left join...) I decided to build another dataset that would contain either the result if it exists or a dummy line if it does not.

All this just to be able to perform the join.

I tried maaaany different ways but didn't succeed... Any advice on how to overcome the Input dataset is not ready message?

Here I only read the dataset if there is at least a file (how I understand that DSS decides if a dataset is ready or not, see community.dataiku.com/.../Left-join-of-empty-dataset) but I still get the error message.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
dernier_DEVIS = dataiku.Dataset("DERNIER_DEVIS")
rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value'])
files = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:COUNT_FILES')['lastValues'][0]['value'])

# if empty query, we insert an empty roz to prevent DSS failure
if files==0:
    der_DEVIS_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis'])
else:
    dernier_DEVIS_df = dernier_DEVIS.get_dataframe()
    der_DEVIS_df = dernier_DEVIS_df
#der_DEVIS_df

Tagged:

Best Answer

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Hi @vic
    ,

    Before going into the code you posted, where are you storing the output of the Hive query that could be empty? HDFS, local filesystem, a SQL database, etc?

    I just made a test where I performed a hive query (that stores the result in HDFS) producing an empty result, and I didn't have the problem you are describing. I'm using version 8.0.3.

    Could you maybe provide some more information from the logs and/or a screenshot of your flow?

    Cheers.

  • vic
    vic Registered Posts: 16 ✭✭✭✭
    edited July 17
    Options

    Hi,

    It is stored in HDFS as well but DSS version is 6.0.5.

    If I do

    select * from table

    it will work no problem if there are records show anything but if the query is empty, as if for example the query is:

    select * from table where 1=2

    Then I will get:

    Input dataset is not ready (no files found)
    Input dataset projectkey_DATASET is not ready, caused by: CodedIOException: in running compute_DATASET: No files found in dataset

    In order to be able to not have this problem I tried to following in python:

    dataset = dataiku.Dataset("dataset")
    rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value'])
    
    # if empty query, we insert an empty row to prevent DSS failure
    if rows==0:
        dataset_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id_assure', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis'])
    else:
        dataset_df = dataset.get_dataframe()

    But it does not work

  • vic
    vic Registered Posts: 16 ✭✭✭✭
    Options

    The aforementioned python code works perfectly in an edition Notebook but fails with the Input dataset is not ready when executing it as a recipe.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    edited July 17
    Options

    Hi @vic
    ,

    As a recipe I expect it will fail when you execute the statement

    dataset = dataiku.Dataset("dataset")

    if "dataset" was never "built" at least once, it won't work.

    But I'm missing a better idea of how you are handling this in the "flow". When do you insert this code recipe in the flow? Is the

    dernier_DEVIS

    dataset shown as "built"?

  • vic
    vic Registered Posts: 16 ✭✭✭✭
    Options

    Yes, it is indeed built, but it is empty. As you can see it is dark blue in the flow since it has been successfully built.

    Capture d’écran 2021-02-19 181944.png

    Capture d’écran 2021-02-19 183021bis.png

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Mmm, I wonder if this is not related at the end to the DSS version, I'm out of ideas. I can't not reproduce this behavior in my DSS 8.0.5 instance.

    Any dataiker around that could give some hints?

  • vic
    vic Registered Posts: 16 ✭✭✭✭
    edited July 17
    Options

    Thank you for your help, I guess I will have to wait for the version upgrade.

    The workaround I made in the meanwhile is to create a dummy one row dataset with 'dummy' as id.

    I then modified the Hive query that generates the desired dataset and added the following:

    UNION ALL SELECT * FROM dummy

    In my final data query I have a where clause to filter this row out:

    WHERE id NOT LIKE 'dummy'
  • vg
    vg Registered Posts: 1 ✭✭✭
    Options

    if you replace a visual recipe which creates you dataset with and SQL recipe instead of 'dataset is not build' you'll get a dataset with no rows but with columns. Than you could work with it since it is already build.

  • vic
    vic Registered Posts: 16 ✭✭✭✭
    Options

    We finally have version 8.0.5 and still experiencing the same problem

  • vic
    vic Registered Posts: 16 ✭✭✭✭
    Options

    It is a Hive recipe since there is HDFS behind, SQL is not possible

  • vic
    vic Registered Posts: 16 ✭✭✭✭
    Options

    What engine did you use ? I can't manage to make it work on DSS 8.0.5 with Hive nor PySpark.

    There is an option in the dataset advanced properties : empty as not ready (see attachments). When not selected it should work, it doesn't (see dataset not ready error also in attachments).

  • vic
    vic Registered Posts: 16 ✭✭✭✭
    Options

    I can reproduce it in 8.0.5, please see last comment on this page

  • khairulfathi
    khairulfathi Registered Posts: 11 ✭✭✭✭
    Options

    I'm using Spark engine, and on DSS 8.0.3. Hive definitely wouldn't work.

  • vic
    vic Registered Posts: 16 ✭✭✭✭
    Options

    Thanks, finally working on Spark... but many other probems arise now with this change

Setup Info
    Tags
      Help me…