The February release for the Community is live! Read More

How to execute a recipe after an empty dataset ?

vic
Level 2
How to execute a recipe after an empty dataset ?

Is there any possible way of checking readyness of a dataset? I have a dataset that might be empty after a Hive query, it shouldn't be a problem but since it is (I cannot use it in a left join...) I decided to build another dataset that would contain either the result if it exists or a dummy line if it does not.

All this just to be able to perform the join.

I tried maaaany different ways but didn't succeed... Any advice on how to overcome the Input dataset is not ready message?

Here I only read the dataset if there is at least a file (how I understand that DSS decides if a dataset is ready or not, see community.dataiku.com/.../Left-join-of-empty-dataset) but I still get the error message.

 

 

 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
dernier_DEVIS = dataiku.Dataset("DERNIER_DEVIS")
rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value'])
files = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:COUNT_FILES')['lastValues'][0]['value'])

# if empty query, we insert an empty roz to prevent DSS failure
if files==0:
    der_DEVIS_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis'])
else:
    dernier_DEVIS_df = dernier_DEVIS.get_dataframe()
    der_DEVIS_df = dernier_DEVIS_df
#der_DEVIS_df

 

 

 

 

0 Kudos
7 Replies
Ignacio_Toledo

Hi @vic,

Before going into the code you posted, where are you storing the output of the Hive query that could be empty? HDFS, local filesystem, a SQL database, etc?

I just made a test where I performed a hive query (that stores the result in HDFS) producing an empty result, and I didn't have the problem you are describing. I'm using version 8.0.3.

Could you maybe provide some more information from the logs and/or a screenshot of your flow?

Cheers.

0 Kudos
vic
Level 2
Author

Hi,

It is stored in HDFS as well but DSS version is 6.0.5.

If I do

 

select * from table

 

it will work no problem if there are records show anything but if the query is empty, as if for example the query is:

 

select * from table where 1=2

 

Then I will get:

 

Input dataset is not ready (no files found)
Input dataset projectkey_DATASET is not ready, caused by: CodedIOException: in running compute_DATASET: No files found in dataset

 

In order to be able to not have this problem I tried to following in python:

 

dataset = dataiku.Dataset("dataset")
rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value'])

# if empty query, we insert an empty row to prevent DSS failure
if rows==0:
    dataset_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id_assure', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis'])
else:
    dataset_df = dataset.get_dataframe()

 

But it does not work

0 Kudos
vic
Level 2
Author

The aforementioned python code works perfectly in an edition Notebook but fails with the Input dataset is not ready when executing it as a recipe. 

0 Kudos
Ignacio_Toledo

Hi @vic,

As a recipe I expect it will fail when you execute the statement 

 

dataset = dataiku.Dataset("dataset")

 

if "dataset" was never "built" at least once, it won't work.

But I'm missing a better idea of how you are handling this in the "flow". When do you insert this code recipe in the flow? Is the 

 

dernier_DEVIS

 

dataset shown as "built"?

0 Kudos
vic
Level 2
Author

Yes, it is indeed built, but it is empty. As you can see it is dark blue in the flow since it has been successfully built.

 
 

Capture d’écran 2021-02-19 181944.png

 

Capture d’écran 2021-02-19 183021bis.png

 

Ignacio_Toledo

Mmm, I wonder if this is not related at the end to the DSS version, I'm out of ideas. I can't not reproduce this behavior in my DSS 8.0.5 instance.

Any dataiker around that could give some hints?

vic
Level 2
Author

Thank you for your help, I guess I will have to wait for the version upgrade.

 

The workaround I made in the meanwhile is to create a dummy one row dataset with 'dummy' as id.

I then modified the Hive query that generates the desired dataset and added the following:

UNION ALL SELECT * FROM dummy

 

In my final data query I have a where clause to filter this row out:

WHERE id NOT LIKE 'dummy'
0 Kudos
Labels (3)
A banner prompting to get Dataiku DSS