Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Is there any possible way of checking readyness of a dataset? I have a dataset that might be empty after a Hive query, it shouldn't be a problem but since it is (I cannot use it in a left join...) I decided to build another dataset that would contain either the result if it exists or a dummy line if it does not.
All this just to be able to perform the join.
I tried maaaany different ways but didn't succeed... Any advice on how to overcome the Input dataset is not ready message?
Here I only read the dataset if there is at least a file (how I understand that DSS decides if a dataset is ready or not, see community.dataiku.com/.../Left-join-of-empty-dataset) but I still get the error message.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs dernier_DEVIS = dataiku.Dataset("DERNIER_DEVIS") rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues']['value']) files = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:COUNT_FILES')['lastValues']['value']) # if empty query, we insert an empty roz to prevent DSS failure if files==0: der_DEVIS_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis']) else: dernier_DEVIS_df = dernier_DEVIS.get_dataframe() der_DEVIS_df = dernier_DEVIS_df #der_DEVIS_df
Before going into the code you posted, where are you storing the output of the Hive query that could be empty? HDFS, local filesystem, a SQL database, etc?
I just made a test where I performed a hive query (that stores the result in HDFS) producing an empty result, and I didn't have the problem you are describing. I'm using version 8.0.3.
Could you maybe provide some more information from the logs and/or a screenshot of your flow?
It is stored in HDFS as well but DSS version is 6.0.5.
If I do
select * from table
it will work no problem if there are records show anything but if the query is empty, as if for example the query is:
select * from table where 1=2
Then I will get:
Input dataset is not ready (no files found) Input dataset projectkey_DATASET is not ready, caused by: CodedIOException: in running compute_DATASET: No files found in dataset
In order to be able to not have this problem I tried to following in python:
dataset = dataiku.Dataset("dataset") rows = int(dernier_DEVIS.get_last_metric_values().get_metric_by_id('basic:SIZE')['lastValues']['value']) # if empty query, we insert an empty row to prevent DSS failure if rows==0: dataset_df = pd.DataFrame([[np.NaN,np.NaN,np.NaN,np.NaN]], columns = ['id_assure', 'num_dernier_devis', 'date_dernier_devis', 'canal_dernier_devis']) else: dataset_df = dataset.get_dataframe()
But it does not work
As a recipe I expect it will fail when you execute the statement
dataset = dataiku.Dataset("dataset")
if "dataset" was never "built" at least once, it won't work.
But I'm missing a better idea of how you are handling this in the "flow". When do you insert this code recipe in the flow? Is the
dataset shown as "built"?
Mmm, I wonder if this is not related at the end to the DSS version, I'm out of ideas. I can't not reproduce this behavior in my DSS 8.0.5 instance.
Any dataiker around that could give some hints?
Thank you for your help, I guess I will have to wait for the version upgrade.
The workaround I made in the meanwhile is to create a dummy one row dataset with 'dummy' as id.
I then modified the Hive query that generates the desired dataset and added the following:
UNION ALL SELECT * FROM dummy
In my final data query I have a where clause to filter this row out:
WHERE id NOT LIKE 'dummy'
if you replace a visual recipe which creates you dataset with and SQL recipe instead of 'dataset is not build' you'll get a dataset with no rows but with columns. Than you could work with it since it is already build.