How to execute a recipe after an empty dataset ?
Is there any possible way of checking readyness of a dataset? I have a dataset that might be empty after a Hive query, it shouldn't be a problem but since it is (I cannot use it in a left join...) I decided to build another dataset that would contain either the result if it exists or a dummy line if it does not.
All this just to be able to perform the join.
I tried maaaany different ways but didn't succeed... Any advice on how to overcome the Input dataset is not ready message?
Here I only read the dataset if there is at least a file (how I understand that DSS decides if a dataset is ready or not, see community.dataiku.com/.../Left-join-of-empty-dataset) but I still get the error message.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs dernier_DEVIS = dataiku.Dataset("DERNIER_DEVIS") rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value']) files = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:COUNT_FILES')['lastValues'][0]['value']) # if empty query, we insert an empty roz to prevent DSS failure if files==0: der_DEVIS_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis']) else: dernier_DEVIS_df = dernier_DEVIS.get_dataframe() der_DEVIS_df = dernier_DEVIS_df #der_DEVIS_df
Best Answer
-
Had the same issue earlier, resolved in by using different engine. Tested with DSS and Spark.
Answers
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron
Hi @vic
,Before going into the code you posted, where are you storing the output of the Hive query that could be empty? HDFS, local filesystem, a SQL database, etc?
I just made a test where I performed a hive query (that stores the result in HDFS) producing an empty result, and I didn't have the problem you are describing. I'm using version 8.0.3.
Could you maybe provide some more information from the logs and/or a screenshot of your flow?
Cheers.
-
Hi,
It is stored in HDFS as well but DSS version is 6.0.5.
If I do
select * from table
it will work no problem if there are records show anything but if the query is empty, as if for example the query is:
select * from table where 1=2
Then I will get:
Input dataset is not ready (no files found) Input dataset projectkey_DATASET is not ready, caused by: CodedIOException: in running compute_DATASET: No files found in dataset
In order to be able to not have this problem I tried to following in python:
dataset = dataiku.Dataset("dataset") rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value']) # if empty query, we insert an empty row to prevent DSS failure if rows==0: dataset_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id_assure', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis']) else: dataset_df = dataset.get_dataframe()
But it does not work
-
The aforementioned python code works perfectly in an edition Notebook but fails with the Input dataset is not ready when executing it as a recipe.
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron
Hi @vic
,As a recipe I expect it will fail when you execute the statement
dataset = dataiku.Dataset("dataset")
if "dataset" was never "built" at least once, it won't work.
But I'm missing a better idea of how you are handling this in the "flow". When do you insert this code recipe in the flow? Is the
dernier_DEVIS
dataset shown as "built"?
-
Yes, it is indeed built, but it is empty. As you can see it is dark blue in the flow since it has been successfully built.
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron
Mmm, I wonder if this is not related at the end to the DSS version, I'm out of ideas. I can't not reproduce this behavior in my DSS 8.0.5 instance.
Any dataiker around that could give some hints?
-
Thank you for your help, I guess I will have to wait for the version upgrade.
The workaround I made in the meanwhile is to create a dummy one row dataset with 'dummy' as id.
I then modified the Hive query that generates the desired dataset and added the following:
UNION ALL SELECT * FROM dummy
In my final data query I have a where clause to filter this row out:
WHERE id NOT LIKE 'dummy'
-
if you replace a visual recipe which creates you dataset with and SQL recipe instead of 'dataset is not build' you'll get a dataset with no rows but with columns. Than you could work with it since it is already build.
-
We finally have version 8.0.5 and still experiencing the same problem
-
It is a Hive recipe since there is HDFS behind, SQL is not possible
-
What engine did you use ? I can't manage to make it work on DSS 8.0.5 with Hive nor PySpark.
There is an option in the dataset advanced properties : empty as not ready (see attachments). When not selected it should work, it doesn't (see dataset not ready error also in attachments).
-
I can reproduce it in 8.0.5, please see last comment on this page
-
I'm using Spark engine, and on DSS 8.0.3. Hive definitely wouldn't work.
-
Thanks, finally working on Spark... but many other probems arise now with this change