get_dataframe with columns= parameter fails

NN
NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron
edited July 16 in Using Dataiku

Hi Dataiku Team,
When i try to read only certain columns from a dataset using the get_dataframe() function. I get an error when trying to read date columns.
The only way it works is if i set parse_dates to False.

Can you please check if this is how its supposed to work or am i missing something?

ds=dataiku.Dataset("mydataset")
#--- Below step fails -----
df=ds.get_dataframe(columns=['idcolumn','datecolumn'])
#--- Below step works -----
df=ds.get_dataframe(columns=['idcolumn','datecolumn'],parse_dates=False)
df.head()

Tagged:

Best Answer

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron
    Answer ✓

    Hi @NN
    ,

    I refreshed my memory on the issue. All works fine unless you set the columns argument. Then what is returned in the parse_dates argument will most likely be incorrect. This argument returns indices that indicate which columns are dates. However, when columns is set, the indices are those of the original positions of the date columns, not the indices of the smaller list of columns you specified. Thus those indices will likely either refer to incorrect columns or be larger than than the current list of columns.

    Here's an example:

    Original dataset columns: N1, D1, S1, D2, S2, N2, D3 (where N is a numeric column, S is a string column, and D is date column)

    If you don't specify columns, parse_dates would correctly return 1, 3, 6 (where indices start with 0).

    If you specify columns = D1, S2, N2, D3, parse_dates would incorrectly return 1, 6 instead of the correct 0, 3.

    I ended up writing a function that replaces get_dataframe_schema_st as I had an application where I needed it to work correctly for both this issue and handling int and smallint datatypes. Happy to share that if of interest. Obviously best to fix get_dataframe_schema_st but that won't happen right away (I reported this bug a while back).

    Marlan

Answers

  • MikeG
    MikeG Dataiker, Registered Posts: 15 Dataiker
    edited July 17

    Hi @NN
    ,

    This is curious behavior. Can you send me the following information?

    >When I try to read only certain columns from a dataset using the get_dataframe() function. I get an error when trying to read date columns.

    (1) Can you send me a copy of the error stack you received (i.e. when `parse_dates=True`, the default behavior for `get_dataframe()`)

    (2) Can you send me the output of `df.info(verbose=True)` for the successfully created dataframe:

    df = ds.get_dataframe(columns=['idcolumn', 'datecolumn'], parse_dates=False)
    df.info(verbose=True)

    (3) Can you send me a sample date value from the date column in the dataset of interest?

    (4) Where does the underlying DSS dataset come from? Is it from a csv file, a SQL connection, etc?

    Thank you,
    Mike

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron

    Hi @NN
    and @MikeG
    ,

    FYI, I while back I ran into something similar. It sounds similar enough that it could be the same issue. The get_dataframe_schema_st method has a bug that results in returning the wrong columns being designated as date columns. This issue would get propagated if get_dataframe calls get_dataframe_schema_st which it seems reasonable that it does.

    Marlan

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron
    edited July 17

    HI @MikeG

    I have attached a screenshot of the dataset , the code i ran and the message i get as well.

    (4) Where does the underlying DSS dataset come from? Is it from a csv file, a SQL connection, etc?

    The dataset is an S3 CSV and i tried a editable dataset as well and got the same error.
    Please note i am running on a Notebook

    (3) Can you send me a sample date value from the date column in the dataset of interest?

    list.jpg

    (2) Can you send me the output of `df.info(verbose=True)` for the successfully created dataframe:

    (1) Can you send me a copy of the error stack you received (i.e. when `parse_dates=True`, the default behavior for `get_dataframe()`)

    import dataiku
    import pandas as pd, numpy as np
    ds=dataiku.Dataset("append1_prepared")
    df=ds.get_dataframe(columns=['Int_not_null'])
    df.info(verbose=True)
    df=ds.get_dataframe(columns=['Int_not_null','date_parsed','date_not_null_formatted'],parse_dates=False)
    df.info(verbose=True)
    df=ds.get_dataframe(columns=['Int_not_null','date_parsed'])
    df.info(verbose=True)

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 2 entries, 0 to 1
    Data columns (total 1 columns):
     #   Column        Non-Null Count  Dtype
    ---  ------        --------------  -----
     0   Int_not_null  2 non-null      int64
    dtypes: int64(1)
    memory usage: 144.0 bytes
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3 entries, 0 to 2
    Data columns (total 3 columns):
     #   Column                   Non-Null Count  Dtype  
    ---  ------                   --------------  -----  
     0   Int_not_null             2 non-null      float64
     1   date_parsed              3 non-null      object 
     2   date_not_null_formatted  3 non-null      object 
    dtypes: float64(1), object(2)
    memory usage: 200.0+ bytes
    ---------------------------------------------------------------------------
    IndexError                                Traceback (most recent call last)
    <ipython-input-7-48220baa034b> in <module>
          6 df=ds.get_dataframe(columns=['Int_not_null','date_parsed','date_not_null_formatted'],parse_dates=False)
          7 df.info(verbose=True)
    ----> 8 df=ds.get_dataframe(columns=['Int_not_null','date_parsed'])
          9 df.info(verbose=True)
    
    /data/dataiku-dss-9.0.4/python/dataiku/core/dataset.py in get_dataframe(self, columns, sampling, sampling_column, limit, ratio, infer_with_pandas, parse_dates, bool_as_str, float_precision, na_values, keep_default_na)
        422                                  float_precision=float_precision,
        423                                  na_values=na_values,
    --> 424                                  keep_default_na=keep_default_na)
        425 
        426     def _stream(self,
    
    /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
        674         )
        675 
    --> 676         return _read(filepath_or_buffer, kwds)
        677 
        678     parser_f.__name__ = name
    
    /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
        452 
        453     try:
    --> 454         data = parser.read(nrows)
        455     finally:
        456         parser.close()
    
    /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
       1131     def read(self, nrows=None):
       1132         nrows = _validate_integer("nrows", nrows)
    -> 1133         ret = self._engine.read(nrows)
       1134 
       1135         # May alter columns / col_dict
    
    /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
       2110             data = {k: v for k, (i, v) in zip(names, data)}
       2111 
    -> 2112             names, data = self._do_date_conversions(names, data)
       2113             index, names = self._make_index(data, alldata, names)
       2114 
    
    /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _do_date_conversions(self, names, data)
       1851                 self.index_names,
       1852                 names,
    -> 1853                 keep_date_col=self.keep_date_col,
       1854             )
       1855 
    
    /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _process_date_conversion(data_dict, converter, parse_spec, index_col, index_names, columns, keep_date_col)
       3286             if is_scalar(colspec):
       3287                 if isinstance(colspec, int) and colspec not in data_dict:
    -> 3288                     colspec = orig_names[colspec]
       3289                 if _isindex(colspec):
       3290                     continue
    
    IndexError: list index out of range

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron

    Hi @Marlan

    Can you guide me a bit / share and example , on what issue you see with the schema_st method.
    I have been using that in a lot of my plugins and codes.

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron

    Thanks @Marlan

    That makes sense and like you said earlier, it seems the two issues might be due to the same reason.

    I just had this adhoc requirement to pull limited columns .
    All my processes which use the schema_st method run for all columns. So hopefully i wont run into this issue anytime soon.

  • MikeG
    MikeG Dataiker, Registered Posts: 15 Dataiker
    edited July 17

    Hi @NN

    Thank you for sending over the requested details.

    I was able to reproduce the same error in a test environment:

    df=ds.get_dataframe(columns=['date_parsed'])
    df.info(verbose=True)
    
    IndexError Traceback (most recent call last)
    <ipython-input-21-406d388a5b0c> in <module>()
    ----> 1 df=ds.get_dataframe(columns=['date_parsed'])
    2 df.info(verbose=True)
    [...]
    3061 if is_scalar(colspec):
    3062 if isinstance(colspec, int) and colspec not in data_dict:
    -> 3063 colspec = orig_names[colspec]
    3064 if _isindex(colspec):
    3065 continue
    
    IndexError: list index out of range


    I am confirming the issue you're encountering is a known bug involving date columns that has been added to our backlog for fixing in a future release.

    From my repro here's an example of expected output:

    df=ds.get_dataframe(columns=['id','date_not_null_formatted','date_parsed'])
    df.info(verbose=True)
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1 entries, 0 to 0
    Data columns (total 3 columns):
    id 1 non-null int64
    date_not_null_formatted 1 non-null object
    date_parsed 1 non-null datetime64[ns]
    dtypes: datetime64[ns](1), int64(1), object(1)
    memory usage: 104.0+ bytes


    One curious thing I noticed is if I change the column order (i.e. switch `date_not_null_formatted` and `date_parsed`) the order of column names will change in the output of df.info, but the associated data type order does not change (i.e. `date_parsed` should be datatype `datetime64[ns]` but it shows up as `object` here)

    df=ds.get_dataframe(columns=['id','date_parsed','date_not_null_formatted'])
    df.info(verbose=True)
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1 entries, 0 to 0
    Data columns (total 3 columns):
    id 1 non-null int64
    date_parsed 1 non-null object
    date_not_null_formatted 1 non-null datetime64[ns]
    dtypes: datetime64[ns](1), int64(1), object(1)
    memory usage: 104.0+ bytes

    I believe this ties in to what @Marlan
    mentioned with regards to root cause.

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron

    Thanks @MikeG
    , Appreciate your help looking into this.

    I shall work for now without parsing the dates.

Setup Info
    Tags
      Help me…