Fixing exception "Failed to read dataset stream data"

Parul_ch · May 2022

Hi,

While running a python code recipe, I'm getting Class exception error:

<class 'Exception'>: Failed to read dataset stream data: b'

How to rectify the same.

Thanks.

(Topic title edited by moderator to be more descriptive. Original title "Using Dataiku")

Alexandru · May 2022

Hi,

Can you share a snippet of your code?

The error "Failed to read dataset stream data" means it could not load "b" , how did you define b?

Thanks,

Parul_ch · May 2022

Hi Alex,

Sharing the code snippet:

dump_time_sync_export_rop = dataiku.Dataset("dump_sync")

dump_time_sync_export_rop.read_partitions = [run]

dump_df = dump_time_sync_export_rop.get_dataframe() GETTING ERROR AT THIS LINE

Exception: Failed to read dataset stream data: b"Path does not exist in the dataset:

Alexandru · May 2022

Hi,

Looks like you have a partitioned dataset and are running this in a recipe correct?

The code would not work in recipe because read_partitions is automatically filled by the recipe and partitions you select when running the recipe.

Try removing (read_partitions = [run]) and re-run the recipe.

If you do need to use read_partitions in the actual recipe then please have a look at:

https://community.dataiku.com/t5/Using-Dataiku/Reading-partitions-one-at-a-time-from-Python/td-p/4995

You would need to add ignore_flow:

dump_time_sync_export_rop = dataiku.Dataset("dump_sync",ignore_flow=True )

Parul_ch · May 2022

Hi Alex,

Will try this, however, I want the dataset to be partitioned with RunID. So, before running this recipe (with read_partitions = [run] line removed), do I need to create a scenario for partitioning the dataset?

Thanks,

Parul.

Alexandru · May 2022

You can use a scenario but you can also specify the partition/s in the Recipe run options.

https://doc.dataiku.com/dss/latest/partitions/identifiers.html

Screenshot 2022-05-26 at 13.14.02.png

Parul_ch · May 2022

Hi Alex,

Now I'm getting this error: The Python process died (killed - maybe out of memory ?), while running the scenario.

Thanks,

Parul.

Parul_ch · May 2022

Hi Alex,

Can you please revisit this section of the code:

I'm getting a key error:

surface_df5.rename(columns={'ROP':'ROP_5'}, inplace=True)
surface_df5 = pd.merge_asof(surface_df, surface_df5.reset_index(drop=True)[['HDTH', 'ROP_5']], left_on='HDTH_monotonic', right_on='HDTH', direction='forward')#.dropna(subset=['HDTH_y'])

if surface_df['ROP5'].dropna().empty: AT THIS LINE
surface_df['ROP5'] = surface_df5['ROP5'].values
surface_df.loc[surface_df['ROP'].isna(), 'ROP5'] = np.nan

Thanks,

Parul.

Alexandru · May 2022

The error indicates you are more memory than what is available in your cgroup configuration or the kernel is killing the process as it's using too much memory. You can try to reduce the memory usage of script by using chunked reading https://doc.dataiku.com/dss/latest/python-api/datasets-data.html

Or increase the memory available on the DSS instance or Cgroups.

Alexandru · May 2022

The key error indicates that the column name does not exist in your dataframe.

I see you may have mismatched the column names with ROP_5 vs ROP? I suggest you print your df in a notebook before the line that fails and see exactly what column names you have.

Parul_ch · May 2022

Hi Alex,

I figured it out that ROP 5 is not there in the dataset. So for that can I make ROP5=ROP, since ROP channel is there in my dataset.

OR add a condition if it is not there : I may skip it?

Thanks,
Parul.

Alexandru · May 2022

If you want to skip when not available you can use try: except in your code

Fixing exception "Failed to read dataset stream data"

<class 'Exception'>: Failed to read dataset stream data: b'

Answers

Categories

Setup Info

Tags