Fixing exception "Failed to read dataset stream data"

Parul_ch
Parul_ch Partner, Registered Posts: 34 Partner

Hi,

While running a python code recipe, I'm getting Class exception error:

<class 'Exception'>: Failed to read dataset stream data: b'

How to rectify the same.

Thanks.

(Topic title edited by moderator to be more descriptive. Original title "Using Dataiku")

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker

    Hi,

    Can you share a snippet of your code?

    The error "Failed to read dataset stream data" means it could not load "b" , how did you define b?

    Thanks,

  • Parul_ch
    Parul_ch Partner, Registered Posts: 34 Partner
    edited July 17

    Hi Alex,

    Sharing the code snippet:

    dump_time_sync_export_rop = dataiku.Dataset("dump_sync")

    dump_time_sync_export_rop.read_partitions = [run]

    dump_df = dump_time_sync_export_rop.get_dataframe() GETTING ERROR AT THIS LINE

    Exception: Failed to read dataset stream data: b"Path does not exist in the dataset: 

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
    edited July 17

    Hi,

    Looks like you have a partitioned dataset and are running this in a recipe correct?

    The code would not work in recipe because read_partitions is automatically filled by the recipe and partitions you select when running the recipe.

    Try removing (read_partitions = [run]) and re-run the recipe.

    If you do need to use read_partitions in the actual recipe then please have a look at:

    https://community.dataiku.com/t5/Using-Dataiku/Reading-partitions-one-at-a-time-from-Python/td-p/4995

    You would need to add ignore_flow:

    dump_time_sync_export_rop = dataiku.Dataset("dump_sync",ignore_flow=True )  

  • Parul_ch
    Parul_ch Partner, Registered Posts: 34 Partner

    Hi Alex,

    Will try this, however, I want the dataset to be partitioned with RunID. So, before running this recipe (with read_partitions = [run] line removed), do I need to create a scenario for partitioning the dataset?

    Thanks,

    Parul.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker

    You can use a scenario but you can also specify the partition/s in the Recipe run options.

    https://doc.dataiku.com/dss/latest/partitions/identifiers.html

    Screenshot 2022-05-26 at 13.14.02.png

  • Parul_ch
    Parul_ch Partner, Registered Posts: 34 Partner

    Hi Alex,

    Now I'm getting this error: The Python process died (killed - maybe out of memory ?), while running the scenario.

    Thanks,

    Parul.

  • Parul_ch
    Parul_ch Partner, Registered Posts: 34 Partner

    Hi Alex,

    Can you please revisit this section of the code:

    I'm getting a key error:

    surface_df5.rename(columns={'ROP':'ROP_5'}, inplace=True)
    surface_df5 = pd.merge_asof(surface_df, surface_df5.reset_index(drop=True)[['HDTH', 'ROP_5']], left_on='HDTH_monotonic', right_on='HDTH', direction='forward')#.dropna(subset=['HDTH_y'])

    if surface_df['ROP5'].dropna().empty: AT THIS LINE
    surface_df['ROP5'] = surface_df5['ROP5'].values
    surface_df.loc[surface_df['ROP'].isna(), 'ROP5'] = np.nan

    Thanks,

    Parul.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker

    The error indicates you are more memory than what is available in your cgroup configuration or the kernel is killing the process as it's using too much memory. You can try to reduce the memory usage of script by using chunked reading https://doc.dataiku.com/dss/latest/python-api/datasets-data.html

    Or increase the memory available on the DSS instance or Cgroups.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker

    The key error indicates that the column name does not exist in your dataframe.

    I see you may have mismatched the column names with ROP_5 vs ROP? I suggest you print your df in a notebook before the line that fails and see exactly what column names you have.

  • Parul_ch
    Parul_ch Partner, Registered Posts: 34 Partner

    Hi Alex,

    I figured it out that ROP 5 is not there in the dataset. So for that can I make ROP5=ROP, since ROP channel is there in my dataset.

    OR add a condition if it is not there : I may skip it?

    Thanks,
    Parul.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker

    If you want to skip when not available you can use try: except in your code

Setup Info
    Tags
      Help me…