Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Is there any possible way of checking readyness of a dataset? I have a dataset that might be empty after a Hive query, it shouldn't be a problem but since it is (I cannot use it in a left join...) I decided to build another dataset that would contain either the result if it exists or a dummy line if it does not.
All this just to be able to perform the join.
I tried maaaany different ways but didn't succeed... Any advice on how to overcome the Input dataset is not ready message?
Here I only read the dataset if there is at least a file (how I understand that DSS decides if a dataset is ready or not, see community.dataiku.com/.../Left-join-of-empty-dataset) but I still get the error message.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
dernier_DEVIS = dataiku.Dataset("DERNIER_DEVIS")
rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value'])
files = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:COUNT_FILES')['lastValues'][0]['value'])
# if empty query, we insert an empty roz to prevent DSS failure
if files==0:
der_DEVIS_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis'])
else:
dernier_DEVIS_df = dernier_DEVIS.get_dataframe()
der_DEVIS_df = dernier_DEVIS_df
#der_DEVIS_df
Had the same issue earlier, resolved in by using different engine. Tested with DSS and Spark.
Hi @vic,
Before going into the code you posted, where are you storing the output of the Hive query that could be empty? HDFS, local filesystem, a SQL database, etc?
I just made a test where I performed a hive query (that stores the result in HDFS) producing an empty result, and I didn't have the problem you are describing. I'm using version 8.0.3.
Could you maybe provide some more information from the logs and/or a screenshot of your flow?
Cheers.
Hi,
It is stored in HDFS as well but DSS version is 6.0.5.
If I do
select * from table
it will work no problem if there are records show anything but if the query is empty, as if for example the query is:
select * from table where 1=2
Then I will get:
Input dataset is not ready (no files found)
Input dataset projectkey_DATASET is not ready, caused by: CodedIOException: in running compute_DATASET: No files found in dataset
In order to be able to not have this problem I tried to following in python:
dataset = dataiku.Dataset("dataset")
rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues'][0]['value'])
# if empty query, we insert an empty row to prevent DSS failure
if rows==0:
dataset_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id_assure', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis'])
else:
dataset_df = dataset.get_dataframe()
But it does not work
The aforementioned python code works perfectly in an edition Notebook but fails with the Input dataset is not ready when executing it as a recipe.
Hi @vic,
As a recipe I expect it will fail when you execute the statement
dataset = dataiku.Dataset("dataset")
if "dataset" was never "built" at least once, it won't work.
But I'm missing a better idea of how you are handling this in the "flow". When do you insert this code recipe in the flow? Is the
dernier_DEVIS
dataset shown as "built"?
Yes, it is indeed built, but it is empty. As you can see it is dark blue in the flow since it has been successfully built.
Mmm, I wonder if this is not related at the end to the DSS version, I'm out of ideas. I can't not reproduce this behavior in my DSS 8.0.5 instance.
Any dataiker around that could give some hints?
Thank you for your help, I guess I will have to wait for the version upgrade.
The workaround I made in the meanwhile is to create a dummy one row dataset with 'dummy' as id.
I then modified the Hive query that generates the desired dataset and added the following:
UNION ALL SELECT * FROM dummy
In my final data query I have a where clause to filter this row out:
WHERE id NOT LIKE 'dummy'
We finally have version 8.0.5 and still experiencing the same problem
I can reproduce it in 8.0.5, please see last comment on this page
if you replace a visual recipe which creates you dataset with and SQL recipe instead of 'dataset is not build' you'll get a dataset with no rows but with columns. Than you could work with it since it is already build.
It is a Hive recipe since there is HDFS behind, SQL is not possible
Had the same issue earlier, resolved in by using different engine. Tested with DSS and Spark.
What engine did you use ? I can't manage to make it work on DSS 8.0.5 with Hive nor PySpark.
There is an option in the dataset advanced properties : empty as not ready (see attachments). When not selected it should work, it doesn't (see dataset not ready error also in attachments).
I'm using Spark engine, and on DSS 8.0.3. Hive definitely wouldn't work.
Thanks, finally working on Spark... but many other probems arise now with this change