get_dataframe with columns= parameter fails

Options
NN
NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 143 Neuron

Hi Dataiku Team,
When i try to read only certain columns from a dataset using the get_dataframe() function. I get an error when trying to read date columns.
The only way it works is if i set parse_dates to False.

Can you please check if this is how its supposed to work or am i missing something?

ds=dataiku.Dataset("mydataset")#--- Below step fails -----df=ds.get_dataframe(columns=['idcolumn','datecolumn'])#--- Below step works -----df=ds.get_dataframe(columns=['idcolumn','datecolumn'],parse_dates=False)df.head()

Tagged:

Best Answer

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 316 Neuron
    Answer ✓
    Options

    Hi @NN
    ,

    I refreshed my memory on the issue. All works fine unless you set the columns argument. Then what is returned in the parse_dates argument will most likely be incorrect. This argument returns indices that indicate which columns are dates. However, when columns is set, the indices are those of the original positions of the date columns, not the indices of the smaller list of columns you specified. Thus those indices will likely either refer to incorrect columns or be larger than than the current list of columns.

    Here's an example:

    Original dataset columns: N1, D1, S1, D2, S2, N2, D3 (where N is a numeric column, S is a string column, and D is date column)

    If you don't specify columns, parse_dates would correctly return 1, 3, 6 (where indices start with 0).

    If you specify columns = D1, S2, N2, D3, parse_dates would incorrectly return 1, 6 instead of the correct 0, 3.

    I ended up writing a function that replaces get_dataframe_schema_st as I had an application where I needed it to work correctly for both this issue and handling int and smallint datatypes. Happy to share that if of interest. Obviously best to fix get_dataframe_schema_st but that won't happen right away (I reported this bug a while back).

    Marlan

Answers

  • MikeG
    MikeG Dataiker, Registered Posts: 15 Dataiker
    Options

    Hi @NN
    ,

    This is curious behavior. Can you send me the following information?

    >When I try to read only certain columns from a dataset using the get_dataframe() function. I get an error when trying to read date columns.

    (1) Can you send me a copy of the error stack you received (i.e. when `parse_dates=True`, the default behavior for `get_dataframe()`)

    (2) Can you send me the output of `df.info(verbose=True)` for the successfully created dataframe:

    df = ds.get_dataframe(columns=['idcolumn', 'datecolumn'], parse_dates=False)df.info(verbose=True)

    (3) Can you send me a sample date value from the date column in the dataset of interest?

    (4) Where does the underlying DSS dataset come from? Is it from a csv file, a SQL connection, etc?

    Thank you,
    Mike

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 316 Neuron
    Options

    Hi @NN
    and @MikeG
    ,

    FYI, I while back I ran into something similar. It sounds similar enough that it could be the same issue. The get_dataframe_schema_st method has a bug that results in returning the wrong columns being designated as date columns. This issue would get propagated if get_dataframe calls get_dataframe_schema_st which it seems reasonable that it does.

    Marlan

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 143 Neuron
    Options

    HI @MikeG

    I have attached a screenshot of the dataset , the code i ran and the message i get as well.

    (4) Where does the underlying DSS dataset come from? Is it from a csv file, a SQL connection, etc?

    The dataset is an S3 CSV and i tried a editable dataset as well and got the same error.
    Please note i am running on a Notebook

    (3) Can you send me a sample date value from the date column in the dataset of interest?

    list.jpg

    (2) Can you send me the output of `df.info(verbose=True)` for the successfully created dataframe:

    (1) Can you send me a copy of the error stack you received (i.e. when `parse_dates=True`, the default behavior for `get_dataframe()`)

    import dataikuimport pandas as pd, numpy as npds=dataiku.Dataset("append1_prepared")df=ds.get_dataframe(columns=['Int_not_null'])df.info(verbose=True)df=ds.get_dataframe(columns=['Int_not_null','date_parsed','date_not_null_formatted'],parse_dates=False)df.info(verbose=True)df=ds.get_dataframe(columns=['Int_not_null','date_parsed'])df.info(verbose=True)

    <class 'pandas.core.frame.DataFrame'>RangeIndex: 2 entries, 0 to 1Data columns (total 1 columns):# Column Non-Null Count Dtype--- ------ -------------- -----0 Int_not_null 2 non-null int64dtypes: int64(1)memory usage: 144.0 bytes<class 'pandas.core.frame.DataFrame'>RangeIndex: 3 entries, 0 to 2Data columns (total 3 columns):# Column Non-Null Count Dtype--- ------ -------------- -----0 Int_not_null 2 non-null float641 date_parsed 3 non-null object2 date_not_null_formatted 3 non-null objectdtypes: float64(1), object(2)memory usage: 200.0+ bytes---------------------------------------------------------------------------IndexError Traceback (most recent call last)<ipython-input-7-48220baa034b> in <module>6 df=ds.get_dataframe(columns=['Int_not_null','date_parsed','date_not_null_formatted'],parse_dates=False)7 df.info(verbose=True)----> 8 df=ds.get_dataframe(columns=['Int_not_null','date_parsed'])9 df.info(verbose=True)/data/dataiku-dss-9.0.4/python/dataiku/core/dataset.py in get_dataframe(self, columns, sampling, sampling_column, limit, ratio, infer_with_pandas, parse_dates, bool_as_str, float_precision, na_values, keep_default_na)422 float_precision=float_precision,423 na_values=na_values,--> 424 keep_default_na=keep_default_na)425426 def _stream(self,/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)674 )675--> 676 return _read(filepath_or_buffer, kwds)677678 parser_f.__name__ = name/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)452453 try:--> 454 data = parser.read(nrows)455 finally:456 parser.close()/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)1131 def read(self, nrows=None):1132 nrows = _validate_integer("nrows", nrows)-> 1133 ret = self._engine.read(nrows)11341135 # May alter columns / col_dict/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)2110 data = {k: v for k, (i, v) in zip(names, data)}2111-> 2112 names, data = self._do_date_conversions(names, data)2113 index, names = self._make_index(data, alldata, names)2114/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _do_date_conversions(self, names, data)1851 self.index_names,1852 names,-> 1853 keep_date_col=self.keep_date_col,1854 )1855/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _process_date_conversion(data_dict, converter, parse_spec, index_col, index_names, columns, keep_date_col)3286 if is_scalar(colspec):3287 if isinstance(colspec, int) and colspec not in data_dict:-> 3288 colspec = orig_names[colspec]3289 if _isindex(colspec):3290 continueIndexError: list index out of range

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 143 Neuron
    Options

    Hi @Marlan

    Can you guide me a bit / share and example , on what issue you see with the schema_st method.
    I have been using that in a lot of my plugins and codes.

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 143 Neuron
    Options

    Thanks @Marlan

    That makes sense and like you said earlier, it seems the two issues might be due to the same reason.

    I just had this adhoc requirement to pull limited columns .
    All my processes which use the schema_st method run for all columns. So hopefully i wont run into this issue anytime soon.

  • MikeG
    MikeG Dataiker, Registered Posts: 15 Dataiker
    Options

    Hi @NN

    Thank you for sending over the requested details.

    I was able to reproduce the same error in a test environment:

    df=ds.get_dataframe(columns=['date_parsed'])df.info(verbose=True)IndexError Traceback (most recent call last)<ipython-input-21-406d388a5b0c> in <module>()----> 1 df=ds.get_dataframe(columns=['date_parsed'])2 df.info(verbose=True)[...]3061 if is_scalar(colspec):3062 if isinstance(colspec, int) and colspec not in data_dict:-> 3063 colspec = orig_names[colspec]3064 if _isindex(colspec):3065 continueIndexError: list index out of range


    I am confirming the issue you're encountering is a known bug involving date columns that has been added to our backlog for fixing in a future release.

    From my repro here's an example of expected output:

    df=ds.get_dataframe(columns=['id','date_not_null_formatted','date_parsed'])df.info(verbose=True)<class 'pandas.core.frame.DataFrame'>RangeIndex: 1 entries, 0 to 0Data columns (total 3 columns):id 1 non-null int64date_not_null_formatted 1 non-null objectdate_parsed 1 non-null datetime64[ns]dtypes: datetime64[ns](1), int64(1), object(1)memory usage: 104.0+ bytes


    One curious thing I noticed is if I change the column order (i.e. switch `date_not_null_formatted` and `date_parsed`) the order of column names will change in the output of df.info, but the associated data type order does not change (i.e. `date_parsed` should be datatype `datetime64[ns]` but it shows up as `object` here)

    df=ds.get_dataframe(columns=['id','date_parsed','date_not_null_formatted'])df.info(verbose=True)<class 'pandas.core.frame.DataFrame'>RangeIndex: 1 entries, 0 to 0Data columns (total 3 columns):id 1 non-null int64date_parsed 1 non-null objectdate_not_null_formatted 1 non-null datetime64[ns]dtypes: datetime64[ns](1), int64(1), object(1)memory usage: 104.0+ bytes

    I believe this ties in to what @Marlan
    mentioned with regards to root cause.

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 143 Neuron
    Options

    Thanks @MikeG
    , Appreciate your help looking into this.

    I shall work for now without parsing the dates.

Setup Info
    Tags
      Help me…