get_dataframe with columns= parameter fails

NN · April 2022

Hi Dataiku Team,
When i try to read only certain columns from a dataset using the get_dataframe() function. I get an error when trying to read date columns.
The only way it works is if i set parse_dates to False.

Can you please check if this is how its supposed to work or am i missing something?

ds=dataiku.Dataset("mydataset")
#--- Below step fails -----
df=ds.get_dataframe(columns=['idcolumn','datecolumn'])
#--- Below step works -----
df=ds.get_dataframe(columns=['idcolumn','datecolumn'],parse_dates=False)
df.head()

Marlan · April 2022

Hi @NN
,

I refreshed my memory on the issue. All works fine unless you set the columns argument. Then what is returned in the parse_dates argument will most likely be incorrect. This argument returns indices that indicate which columns are dates. However, when columns is set, the indices are those of the original positions of the date columns, not the indices of the smaller list of columns you specified. Thus those indices will likely either refer to incorrect columns or be larger than than the current list of columns.

Here's an example:

Original dataset columns: N1, D1, S1, D2, S2, N2, D3 (where N is a numeric column, S is a string column, and D is date column)

If you don't specify columns, parse_dates would correctly return 1, 3, 6 (where indices start with 0).

If you specify columns = D1, S2, N2, D3, parse_dates would incorrectly return 1, 6 instead of the correct 0, 3.

I ended up writing a function that replaces get_dataframe_schema_st as I had an application where I needed it to work correctly for both this issue and handling int and smallint datatypes. Happy to share that if of interest. Obviously best to fix get_dataframe_schema_st but that won't happen right away (I reported this bug a while back).

Marlan

MikeG · April 2022

Hi @NN
,

This is curious behavior. Can you send me the following information?

>When I try to read only certain columns from a dataset using the get_dataframe() function. I get an error when trying to read date columns.

(1) Can you send me a copy of the error stack you received (i.e. when `parse_dates=True`, the default behavior for `get_dataframe()`)

(2) Can you send me the output of `df.info(verbose=True)` for the successfully created dataframe:

df = ds.get_dataframe(columns=['idcolumn', 'datecolumn'], parse_dates=False)
df.info(verbose=True)

(3) Can you send me a sample date value from the date column in the dataset of interest?

(4) Where does the underlying DSS dataset come from? Is it from a csv file, a SQL connection, etc?

Thank you,
Mike

Marlan · April 2022

Hi @NN
and @MikeG
,

FYI, I while back I ran into something similar. It sounds similar enough that it could be the same issue. The get_dataframe_schema_st method has a bug that results in returning the wrong columns being designated as date columns. This issue would get propagated if get_dataframe calls get_dataframe_schema_st which it seems reasonable that it does.

Marlan

NN · April 2022

HI @MikeG

I have attached a screenshot of the dataset , the code i ran and the message i get as well.

(4) Where does the underlying DSS dataset come from? Is it from a csv file, a SQL connection, etc?

The dataset is an S3 CSV and i tried a editable dataset as well and got the same error.
Please note i am running on a Notebook

(3) Can you send me a sample date value from the date column in the dataset of interest?

(2) Can you send me the output of `df.info(verbose=True)` for the successfully created dataframe:

(1) Can you send me a copy of the error stack you received (i.e. when `parse_dates=True`, the default behavior for `get_dataframe()`)

import dataiku
import pandas as pd, numpy as np
ds=dataiku.Dataset("append1_prepared")
df=ds.get_dataframe(columns=['Int_not_null'])
df.info(verbose=True)
df=ds.get_dataframe(columns=['Int_not_null','date_parsed','date_not_null_formatted'],parse_dates=False)
df.info(verbose=True)
df=ds.get_dataframe(columns=['Int_not_null','date_parsed'])
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Int_not_null  2 non-null      int64
dtypes: int64(1)
memory usage: 144.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Int_not_null             2 non-null      float64
 1   date_parsed              3 non-null      object 
 2   date_not_null_formatted  3 non-null      object 
dtypes: float64(1), object(2)
memory usage: 200.0+ bytes
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-7-48220baa034b> in <module>
      6 df=ds.get_dataframe(columns=['Int_not_null','date_parsed','date_not_null_formatted'],parse_dates=False)
      7 df.info(verbose=True)
----> 8 df=ds.get_dataframe(columns=['Int_not_null','date_parsed'])
      9 df.info(verbose=True)

/data/dataiku-dss-9.0.4/python/dataiku/core/dataset.py in get_dataframe(self, columns, sampling, sampling_column, limit, ratio, infer_with_pandas, parse_dates, bool_as_str, float_precision, na_values, keep_default_na)
    422                                  float_precision=float_precision,
    423                                  na_values=na_values,
--> 424                                  keep_default_na=keep_default_na)
    425 
    426     def _stream(self,

/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    452 
    453     try:
--> 454         data = parser.read(nrows)
    455     finally:
    456         parser.close()

/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1131     def read(self, nrows=None):
   1132         nrows = _validate_integer("nrows", nrows)
-> 1133         ret = self._engine.read(nrows)
   1134 
   1135         # May alter columns / col_dict

/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   2110             data = {k: v for k, (i, v) in zip(names, data)}
   2111 
-> 2112             names, data = self._do_date_conversions(names, data)
   2113             index, names = self._make_index(data, alldata, names)
   2114 

/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _do_date_conversions(self, names, data)
   1851                 self.index_names,
   1852                 names,
-> 1853                 keep_date_col=self.keep_date_col,
   1854             )
   1855 

/data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _process_date_conversion(data_dict, converter, parse_spec, index_col, index_names, columns, keep_date_col)
   3286             if is_scalar(colspec):
   3287                 if isinstance(colspec, int) and colspec not in data_dict:
-> 3288                     colspec = orig_names[colspec]
   3289                 if _isindex(colspec):
   3290                     continue

IndexError: list index out of range

NN · April 2022

Hi @Marlan

Can you guide me a bit / share and example , on what issue you see with the schema_st method.
I have been using that in a lot of my plugins and codes.

NN · April 2022

Thanks @Marlan

That makes sense and like you said earlier, it seems the two issues might be due to the same reason.

I just had this adhoc requirement to pull limited columns .
All my processes which use the schema_st method run for all columns. So hopefully i wont run into this issue anytime soon.

MikeG · April 2022

Hi @NN

Thank you for sending over the requested details.

I was able to reproduce the same error in a test environment:

df=ds.get_dataframe(columns=['date_parsed'])
df.info(verbose=True)

IndexError Traceback (most recent call last)
<ipython-input-21-406d388a5b0c> in <module>()
----> 1 df=ds.get_dataframe(columns=['date_parsed'])
2 df.info(verbose=True)
[...]
3061 if is_scalar(colspec):
3062 if isinstance(colspec, int) and colspec not in data_dict:
-> 3063 colspec = orig_names[colspec]
3064 if _isindex(colspec):
3065 continue

IndexError: list index out of range

I am confirming the issue you're encountering is a known bug involving date columns that has been added to our backlog for fixing in a future release.

From my repro here's an example of expected output:

df=ds.get_dataframe(columns=['id','date_not_null_formatted','date_parsed'])
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
id 1 non-null int64
date_not_null_formatted 1 non-null object
date_parsed 1 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 104.0+ bytes

One curious thing I noticed is if I change the column order (i.e. switch `date_not_null_formatted` and `date_parsed`) the order of column names will change in the output of df.info, but the associated data type order does not change (i.e. `date_parsed` should be datatype `datetime64[ns]` but it shows up as `object` here)

df=ds.get_dataframe(columns=['id','date_parsed','date_not_null_formatted'])
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
id 1 non-null int64
date_parsed 1 non-null object
date_not_null_formatted 1 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 104.0+ bytes

I believe this ties in to what @Marlan
mentioned with regards to root cause.

NN · April 2022

Thanks @MikeG
, Appreciate your help looking into this.

I shall work for now without parsing the dates.

get_dataframe with columns= parameter fails

Best Answer

Answers

Categories

Setup Info

Tags