Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Fixing exception "Failed to read dataset stream data"

Parul_ch
Level 3
Level 3
Fixing exception "Failed to read dataset stream data"

Hi,

While running a python code recipe, I'm getting Class exception error:

<class 'Exception'>: Failed to read dataset stream data: b'

How to rectify the same.

Thanks. 

(Topic title edited by moderator to be more descriptive. Original title "Using Dataiku")

0 Kudos
11 Replies
AlexT
Dataiker
Dataiker

Hi,

Can you share a snippet of your code?

The error "Failed to read dataset stream data" means it could not load "b" , how did you define b? 

Thanks,

0 Kudos
Parul_ch
Level 3
Level 3
Author

Hi Alex,

 

Sharing the code snippet:

 dump_time_sync_export_rop = dataiku.Dataset("dump_sync")

dump_time_sync_export_rop.read_partitions = [run]

dump_df = dump_time_sync_export_rop.get_dataframe() GETTING ERROR AT THIS LINE

 

Exception: Failed to read dataset stream data: b"Path does not exist in the dataset: 

 

0 Kudos
AlexT
Dataiker
Dataiker

Hi,

Looks like you have a partitioned dataset and are running this in a recipe correct?

The code would not work in recipe because read_partitions is automatically filled by the recipe and partitions you select when running the recipe.

Try removing (read_partitions = [run]) and  re-run the recipe.

If you do need to use read_partitions in the actual recipe then please have a look at:

https://community.dataiku.com/t5/Using-Dataiku/Reading-partitions-one-at-a-time-from-Python/td-p/499...

You would need to add ignore_flow:  

dump_time_sync_export_rop = dataiku.Dataset("dump_sync",ignore_flow=True )  

 

 

Parul_ch
Level 3
Level 3
Author

Hi Alex,

Will try this, however, I want the dataset to be partitioned with RunID. So, before running this recipe (with read_partitions = [run] line removed), do I need to create a scenario for partitioning the dataset?

Thanks,

Parul.

0 Kudos
AlexT
Dataiker
Dataiker

You can use a scenario but you can also specify the partition/s in the  Recipe run options.

https://doc.dataiku.com/dss/latest/partitions/identifiers.html

Screenshot 2022-05-26 at 13.14.02.png

0 Kudos
Parul_ch
Level 3
Level 3
Author

Hi Alex,

Now I'm getting this error: The Python process died (killed - maybe out of memory ?), while running the scenario.

 

Thanks,

Parul.

 

0 Kudos
Parul_ch
Level 3
Level 3
Author

Hi Alex,

 Can you please revisit this section of the code:

I'm getting  a key error:

 surface_df5.rename(columns={'ROP':'ROP_5'}, inplace=True)
surface_df5 = pd.merge_asof(surface_df, surface_df5.reset_index(drop=True)[['HDTH', 'ROP_5']], left_on='HDTH_monotonic', right_on='HDTH', direction='forward')#.dropna(subset=['HDTH_y'])

if surface_df['ROP5'].dropna().empty:     AT THIS LINE
surface_df['ROP5'] = surface_df5['ROP5'].values
surface_df.loc[surface_df['ROP'].isna(), 'ROP5'] = np.nan

Thanks,

Parul.

0 Kudos
AlexT
Dataiker
Dataiker

The key error indicates that the column name does not exist in your dataframe.

I see you may have mismatched the column names with ROP_5 vs ROP? I suggest you print your df in a notebook before the line that fails and see exactly what column names you have. 

0 Kudos
Parul_ch
Level 3
Level 3
Author

Hi Alex,

I figured it out that ROP 5 is not there in the dataset. So for that can I make ROP5=ROP, since ROP channel is there in my dataset.

OR add a condition if it is not there : I may skip it?

Thanks,
Parul.

0 Kudos
AlexT
Dataiker
Dataiker

If you want to skip when not available you can use  try:  except in your code

AlexT
Dataiker
Dataiker

The error indicates you are more memory than what is available in your cgroup configuration or the kernel is killing the process as it's using too much memory. You can try to reduce the memory usage of script by using chunked reading https://doc.dataiku.com/dss/latest/python-api/datasets-data.html

Or increase the memory available on the DSS instance or Cgroups.