How to execute a recipe after an empty dataset ?

Solved!
vic
Level 2
How to execute a recipe after an empty dataset ?

Is there any possible way of checking readyness of a dataset? I have a dataset that might be empty after a Hive query, it shouldn't be a problem but since it is (I cannot use it in a left join...) I decided to build another dataset that would contain either the result if it exists or a dummy line if it does not.

All this just to be able to perform the join.

I tried maaaany different ways but didn't succeed... Any advice on how to overcome the Input dataset is not ready message?

Here I only read the dataset if there is at least a file (how I understand that DSS decides if a dataset is ready or not, see community.dataiku.com/.../Left-join-of-empty-dataset) but I still get the error message.

 

 

 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
dernier_DEVIS = dataiku.Dataset("DERNIER_DEVIS")
rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value'])
files = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:COUNT_FILES')['lastValues'][0]['value'])

# if empty query, we insert an empty roz to prevent DSS failure
if files==0:
    der_DEVIS_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis'])
else:
    dernier_DEVIS_df = dernier_DEVIS.get_dataframe()
    der_DEVIS_df = dernier_DEVIS_df
#der_DEVIS_df

 

 

 

 

0 Kudos
1 Solution
khairulfathi
Level 2

Had the same issue earlier, resolved in by using different engine. Tested with DSS and Spark.

View solution in original post

0 Kudos
15 Replies
Ignacio_Toledo

Hi @vic,

Before going into the code you posted, where are you storing the output of the Hive query that could be empty? HDFS, local filesystem, a SQL database, etc?

I just made a test where I performed a hive query (that stores the result in HDFS) producing an empty result, and I didn't have the problem you are describing. I'm using version 8.0.3.

Could you maybe provide some more information from the logs and/or a screenshot of your flow?

Cheers.

0 Kudos
vic
Level 2
Author

Hi,

It is stored in HDFS as well but DSS version is 6.0.5.

If I do

 

select * from table

 

it will work no problem if there are records show anything but if the query is empty, as if for example the query is:

 

select * from table where 1=2

 

Then I will get:

 

Input dataset is not ready (no files found)
Input dataset projectkey_DATASET is not ready, caused by: CodedIOException: in running compute_DATASET: No files found in dataset

 

In order to be able to not have this problem I tried to following in python:

 

dataset = dataiku.Dataset("dataset")
rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value'])

# if empty query, we insert an empty row to prevent DSS failure
if rows==0:
    dataset_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id_assure', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis'])
else:
    dataset_df = dataset.get_dataframe()

 

But it does not work

0 Kudos
vic
Level 2
Author

The aforementioned python code works perfectly in an edition Notebook but fails with the Input dataset is not ready when executing it as a recipe. 

0 Kudos
Ignacio_Toledo

Hi @vic,

As a recipe I expect it will fail when you execute the statement 

 

dataset = dataiku.Dataset("dataset")

 

if "dataset" was never "built" at least once, it won't work.

But I'm missing a better idea of how you are handling this in the "flow". When do you insert this code recipe in the flow? Is the 

 

dernier_DEVIS

 

dataset shown as "built"?

0 Kudos
vic
Level 2
Author

Yes, it is indeed built, but it is empty. As you can see it is dark blue in the flow since it has been successfully built.

 
 

Capture dโ€™รฉcran 2021-02-19 181944.png

 

Capture dโ€™รฉcran 2021-02-19 183021bis.png

 

Ignacio_Toledo

Mmm, I wonder if this is not related at the end to the DSS version, I'm out of ideas. I can't not reproduce this behavior in my DSS 8.0.5 instance.

Any dataiker around that could give some hints?

vic
Level 2
Author

Thank you for your help, I guess I will have to wait for the version upgrade.

 

The workaround I made in the meanwhile is to create a dummy one row dataset with 'dummy' as id.

I then modified the Hive query that generates the desired dataset and added the following:

UNION ALL SELECT * FROM dummy

 

In my final data query I have a where clause to filter this row out:

WHERE id NOT LIKE 'dummy'
0 Kudos
vic
Level 2
Author

We finally have version 8.0.5 and still experiencing the same problem

0 Kudos
vic
Level 2
Author

I can reproduce it in 8.0.5, please see last comment on this page

0 Kudos
vg
Level 1

if you replace a visual recipe which creates  you dataset with and SQL recipe instead of 'dataset is not build' you'll get a dataset with no rows but with columns. Than you could work with it since it is already build.

0 Kudos
vic
Level 2
Author

It is a Hive recipe since there is HDFS behind, SQL is not possible

0 Kudos
khairulfathi
Level 2

Had the same issue earlier, resolved in by using different engine. Tested with DSS and Spark.

0 Kudos
vic
Level 2
Author

What engine did you use ? I can't manage to make it work on DSS 8.0.5 with Hive nor PySpark.

There is an option in the dataset advanced properties : empty as not ready (see attachments). When not selected it should work, it doesn't (see dataset not ready error also in attachments).

 

0 Kudos
khairulfathi
Level 2

I'm using Spark engine, and on DSS 8.0.3. Hive definitely wouldn't work.

0 Kudos
vic
Level 2
Author

Thanks, finally working on Spark... but many other probems arise now with this change

0 Kudos