Fixing exception "Failed to read dataset stream data"
Hi,
While running a python code recipe, I'm getting Class exception error:
<class 'Exception'>: Failed to read dataset stream data: b'
How to rectify the same.
Thanks.
(Topic title edited by moderator to be more descriptive. Original title "Using Dataiku")
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
Can you share a snippet of your code?
The error "Failed to read dataset stream data" means it could not load "b" , how did you define b?
Thanks,
-
Hi Alex,
Sharing the code snippet:
dump_time_sync_export_rop = dataiku.Dataset("dump_sync")
dump_time_sync_export_rop.read_partitions = [run]
dump_df = dump_time_sync_export_rop.get_dataframe() GETTING ERROR AT THIS LINE
Exception: Failed to read dataset stream data: b"Path does not exist in the dataset:
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
Looks like you have a partitioned dataset and are running this in a recipe correct?
The code would not work in recipe because read_partitions is automatically filled by the recipe and partitions you select when running the recipe.
Try removing (read_partitions = [run]) and re-run the recipe.
If you do need to use read_partitions in the actual recipe then please have a look at:
You would need to add ignore_flow:
dump_time_sync_export_rop = dataiku.Dataset("dump_sync",ignore_flow=True )
-
Hi Alex,
Will try this, however, I want the dataset to be partitioned with RunID. So, before running this recipe (with read_partitions = [run] line removed), do I need to create a scenario for partitioning the dataset?
Thanks,
Parul.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
You can use a scenario but you can also specify the partition/s in the Recipe run options.
https://doc.dataiku.com/dss/latest/partitions/identifiers.html
-
Hi Alex,
Now I'm getting this error: The Python process died (killed - maybe out of memory ?), while running the scenario.
Thanks,
Parul.
-
Hi Alex,
Can you please revisit this section of the code:
I'm getting a key error:
surface_df5.rename(columns={'ROP':'ROP_5'}, inplace=True)
surface_df5 = pd.merge_asof(surface_df, surface_df5.reset_index(drop=True)[['HDTH', 'ROP_5']], left_on='HDTH_monotonic', right_on='HDTH', direction='forward')#.dropna(subset=['HDTH_y'])if surface_df['ROP5'].dropna().empty: AT THIS LINE
surface_df['ROP5'] = surface_df5['ROP5'].values
surface_df.loc[surface_df['ROP'].isna(), 'ROP5'] = np.nanThanks,
Parul.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
The error indicates you are more memory than what is available in your cgroup configuration or the kernel is killing the process as it's using too much memory. You can try to reduce the memory usage of script by using chunked reading https://doc.dataiku.com/dss/latest/python-api/datasets-data.html
Or increase the memory available on the DSS instance or Cgroups.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
The key error indicates that the column name does not exist in your dataframe.
I see you may have mismatched the column names with ROP_5 vs ROP? I suggest you print your df in a notebook before the line that fails and see exactly what column names you have.
-
Hi Alex,
I figured it out that ROP 5 is not there in the dataset. So for that can I make ROP5=ROP, since ROP channel is there in my dataset.
OR add a condition if it is not there : I may skip it?
Thanks,
Parul. -
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
If you want to skip when not available you can use try: except in your code