How to execute a recipe after an empty dataset ?

vic · February 2021

Is there any possible way of checking readyness of a dataset? I have a dataset that might be empty after a Hive query, it shouldn't be a problem but since it is (I cannot use it in a left join...) I decided to build another dataset that would contain either the result if it exists or a dummy line if it does not.

All this just to be able to perform the join.

I tried maaaany different ways but didn't succeed... Any advice on how to overcome the Input dataset is not ready message?

Here I only read the dataset if there is at least a file (how I understand that DSS decides if a dataset is ready or not, see community.dataiku.com/.../Left-join-of-empty-dataset) but I still get the error message.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
dernier_DEVIS = dataiku.Dataset("DERNIER_DEVIS")
rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value'])
files = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:COUNT_FILES')['lastValues'][0]['value'])

# if empty query, we insert an empty roz to prevent DSS failure
if files==0:
    der_DEVIS_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis'])
else:
    dernier_DEVIS_df = dernier_DEVIS.get_dataframe()
    der_DEVIS_df = dernier_DEVIS_df
#der_DEVIS_df

khairulfathi · July 2021

Had the same issue earlier, resolved in by using different engine. Tested with DSS and Spark.

Ignacio_Toledo · February 2021

Hi @vic
,

Before going into the code you posted, where are you storing the output of the Hive query that could be empty? HDFS, local filesystem, a SQL database, etc?

I just made a test where I performed a hive query (that stores the result in HDFS) producing an empty result, and I didn't have the problem you are describing. I'm using version 8.0.3.

Could you maybe provide some more information from the logs and/or a screenshot of your flow?

Cheers.

vic · February 2021

Hi,

It is stored in HDFS as well but DSS version is 6.0.5.

If I do

select * from table

it will work no problem if there are records show anything but if the query is empty, as if for example the query is:

select * from table where 1=2

Then I will get:

Input dataset is not ready (no files found)
Input dataset projectkey_DATASET is not ready, caused by: CodedIOException: in running compute_DATASET: No files found in dataset

In order to be able to not have this problem I tried to following in python:

dataset = dataiku.Dataset("dataset")
rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value'])

# if empty query, we insert an empty row to prevent DSS failure
if rows==0:
    dataset_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id_assure', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis'])
else:
    dataset_df = dataset.get_dataframe()

But it does not work

vic · February 2021

The aforementioned python code works perfectly in an edition Notebook but fails with the Input dataset is not ready when executing it as a recipe.

Ignacio_Toledo · February 2021

Hi @vic
,

As a recipe I expect it will fail when you execute the statement

dataset = dataiku.Dataset("dataset")

if "dataset" was never "built" at least once, it won't work.

But I'm missing a better idea of how you are handling this in the "flow". When do you insert this code recipe in the flow? Is the

dernier_DEVIS

dataset shown as "built"?

vic · February 2021

Yes, it is indeed built, but it is empty. As you can see it is dark blue in the flow since it has been successfully built.

Capture d’écran 2021-02-19 181944.png

Capture d’écran 2021-02-19 183021bis.png

Ignacio_Toledo · February 2021

Mmm, I wonder if this is not related at the end to the DSS version, I'm out of ideas. I can't not reproduce this behavior in my DSS 8.0.5 instance.

Any dataiker around that could give some hints?

vic · February 2021

Thank you for your help, I guess I will have to wait for the version upgrade.

The workaround I made in the meanwhile is to create a dummy one row dataset with 'dummy' as id.

I then modified the Hive query that generates the desired dataset and added the following:

UNION ALL SELECT * FROM dummy

In my final data query I have a where clause to filter this row out:

WHERE id NOT LIKE 'dummy'

vg · May 2021

if you replace a visual recipe which creates you dataset with and SQL recipe instead of 'dataset is not build' you'll get a dataset with no rows but with columns. Than you could work with it since it is already build.

vic · June 2021

We finally have version 8.0.5 and still experiencing the same problem

vic · June 2021

It is a Hive recipe since there is HDFS behind, SQL is not possible

vic · July 2021

What engine did you use ? I can't manage to make it work on DSS 8.0.5 with Hive nor PySpark.

There is an option in the dataset advanced properties : empty as not ready (see attachments). When not selected it should work, it doesn't (see dataset not ready error also in attachments).

vic · July 2021

I can reproduce it in 8.0.5, please see last comment on this page

khairulfathi · July 2021

I'm using Spark engine, and on DSS 8.0.3. Hive definitely wouldn't work.

vic · July 2021

Thanks, finally working on Spark... but many other probems arise now with this change

How to execute a recipe after an empty dataset ?

Best Answer

Answers

Categories

Setup Info

Tags