get_dataframe with columns= parameter fails
Hi Dataiku Team,
When i try to read only certain columns from a dataset using the get_dataframe() function. I get an error when trying to read date columns.
The only way it works is if i set parse_dates to False.
Can you please check if this is how its supposed to work or am i missing something?
ds=dataiku.Dataset("mydataset") #--- Below step fails ----- df=ds.get_dataframe(columns=['idcolumn','datecolumn']) #--- Below step works ----- df=ds.get_dataframe(columns=['idcolumn','datecolumn'],parse_dates=False) df.head()
Best Answer
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron
Hi @NN
,I refreshed my memory on the issue. All works fine unless you set the columns argument. Then what is returned in the parse_dates argument will most likely be incorrect. This argument returns indices that indicate which columns are dates. However, when columns is set, the indices are those of the original positions of the date columns, not the indices of the smaller list of columns you specified. Thus those indices will likely either refer to incorrect columns or be larger than than the current list of columns.
Here's an example:
Original dataset columns: N1, D1, S1, D2, S2, N2, D3 (where N is a numeric column, S is a string column, and D is date column)
If you don't specify columns, parse_dates would correctly return 1, 3, 6 (where indices start with 0).
If you specify columns = D1, S2, N2, D3, parse_dates would incorrectly return 1, 6 instead of the correct 0, 3.
I ended up writing a function that replaces get_dataframe_schema_st as I had an application where I needed it to work correctly for both this issue and handling int and smallint datatypes. Happy to share that if of interest. Obviously best to fix get_dataframe_schema_st but that won't happen right away (I reported this bug a while back).
Marlan
Answers
-
Hi @NN
,This is curious behavior. Can you send me the following information?
>When I try to read only certain columns from a dataset using the get_dataframe() function. I get an error when trying to read date columns.
(1) Can you send me a copy of the error stack you received (i.e. when `parse_dates=True`, the default behavior for `get_dataframe()`)
(2) Can you send me the output of `df.info(verbose=True)` for the successfully created dataframe:
df = ds.get_dataframe(columns=['idcolumn', 'datecolumn'], parse_dates=False) df.info(verbose=True)
(3) Can you send me a sample date value from the date column in the dataset of interest?
(4) Where does the underlying DSS dataset come from? Is it from a csv file, a SQL connection, etc?
Thank you,
Mike -
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron
FYI, I while back I ran into something similar. It sounds similar enough that it could be the same issue. The get_dataframe_schema_st method has a bug that results in returning the wrong columns being designated as date columns. This issue would get propagated if get_dataframe calls get_dataframe_schema_st which it seems reasonable that it does.
Marlan
-
HI @MikeG
I have attached a screenshot of the dataset , the code i ran and the message i get as well.
(4) Where does the underlying DSS dataset come from? Is it from a csv file, a SQL connection, etc?
The dataset is an S3 CSV and i tried a editable dataset as well and got the same error.
Please note i am running on a Notebook(3) Can you send me a sample date value from the date column in the dataset of interest?
(2) Can you send me the output of `df.info(verbose=True)` for the successfully created dataframe:
(1) Can you send me a copy of the error stack you received (i.e. when `parse_dates=True`, the default behavior for `get_dataframe()`)
import dataiku import pandas as pd, numpy as np ds=dataiku.Dataset("append1_prepared") df=ds.get_dataframe(columns=['Int_not_null']) df.info(verbose=True) df=ds.get_dataframe(columns=['Int_not_null','date_parsed','date_not_null_formatted'],parse_dates=False) df.info(verbose=True) df=ds.get_dataframe(columns=['Int_not_null','date_parsed']) df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2 entries, 0 to 1 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Int_not_null 2 non-null int64 dtypes: int64(1) memory usage: 144.0 bytes <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Int_not_null 2 non-null float64 1 date_parsed 3 non-null object 2 date_not_null_formatted 3 non-null object dtypes: float64(1), object(2) memory usage: 200.0+ bytes --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-7-48220baa034b> in <module> 6 df=ds.get_dataframe(columns=['Int_not_null','date_parsed','date_not_null_formatted'],parse_dates=False) 7 df.info(verbose=True) ----> 8 df=ds.get_dataframe(columns=['Int_not_null','date_parsed']) 9 df.info(verbose=True) /data/dataiku-dss-9.0.4/python/dataiku/core/dataset.py in get_dataframe(self, columns, sampling, sampling_column, limit, ratio, infer_with_pandas, parse_dates, bool_as_str, float_precision, na_values, keep_default_na) 422 float_precision=float_precision, 423 na_values=na_values, --> 424 keep_default_na=keep_default_na) 425 426 def _stream(self, /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision) 674 ) 675 --> 676 return _read(filepath_or_buffer, kwds) 677 678 parser_f.__name__ = name /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 452 453 try: --> 454 data = parser.read(nrows) 455 finally: 456 parser.close() /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows) 1131 def read(self, nrows=None): 1132 nrows = _validate_integer("nrows", nrows) -> 1133 ret = self._engine.read(nrows) 1134 1135 # May alter columns / col_dict /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows) 2110 data = {k: v for k, (i, v) in zip(names, data)} 2111 -> 2112 names, data = self._do_date_conversions(names, data) 2113 index, names = self._make_index(data, alldata, names) 2114 /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _do_date_conversions(self, names, data) 1851 self.index_names, 1852 names, -> 1853 keep_date_col=self.keep_date_col, 1854 ) 1855 /data/dss/code-envs/python/default_python36_env/lib64/python3.6/site-packages/pandas/io/parsers.py in _process_date_conversion(data_dict, converter, parse_spec, index_col, index_names, columns, keep_date_col) 3286 if is_scalar(colspec): 3287 if isinstance(colspec, int) and colspec not in data_dict: -> 3288 colspec = orig_names[colspec] 3289 if _isindex(colspec): 3290 continue IndexError: list index out of range
-
Hi @Marlan
Can you guide me a bit / share and example , on what issue you see with the schema_st method.
I have been using that in a lot of my plugins and codes. -
Thanks @Marlan
That makes sense and like you said earlier, it seems the two issues might be due to the same reason.I just had this adhoc requirement to pull limited columns .
All my processes which use the schema_st method run for all columns. So hopefully i wont run into this issue anytime soon. -
Hi @NN
Thank you for sending over the requested details.
I was able to reproduce the same error in a test environment:
df=ds.get_dataframe(columns=['date_parsed']) df.info(verbose=True) IndexError Traceback (most recent call last) <ipython-input-21-406d388a5b0c> in <module>() ----> 1 df=ds.get_dataframe(columns=['date_parsed']) 2 df.info(verbose=True) [...] 3061 if is_scalar(colspec): 3062 if isinstance(colspec, int) and colspec not in data_dict: -> 3063 colspec = orig_names[colspec] 3064 if _isindex(colspec): 3065 continue IndexError: list index out of range
I am confirming the issue you're encountering is a known bug involving date columns that has been added to our backlog for fixing in a future release.From my repro here's an example of expected output:
df=ds.get_dataframe(columns=['id','date_not_null_formatted','date_parsed']) df.info(verbose=True) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Data columns (total 3 columns): id 1 non-null int64 date_not_null_formatted 1 non-null object date_parsed 1 non-null datetime64[ns] dtypes: datetime64[ns](1), int64(1), object(1) memory usage: 104.0+ bytes
One curious thing I noticed is if I change the column order (i.e. switch `date_not_null_formatted` and `date_parsed`) the order of column names will change in the output of df.info, but the associated data type order does not change (i.e. `date_parsed` should be datatype `datetime64[ns]` but it shows up as `object` here)df=ds.get_dataframe(columns=['id','date_parsed','date_not_null_formatted']) df.info(verbose=True) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Data columns (total 3 columns): id 1 non-null int64 date_parsed 1 non-null object date_not_null_formatted 1 non-null datetime64[ns] dtypes: datetime64[ns](1), int64(1), object(1) memory usage: 104.0+ bytes
I believe this ties in to what @Marlan
mentioned with regards to root cause. -
Thanks @MikeG
, Appreciate your help looking into this.I shall work for now without parsing the dates.