Data Type Question when reading in DataFrame from DataSet
Hi
I have a dataset being read in from Excel. The preview has guessed the column to be a date, and the schema is set to date type. Reading it in as a dataframe via Python, when I run "dtypes" I see it has brought it in as an object type which I assume to be a string.
As a general rule of thumb, is the schema "all poweful" vs the types and semantic types I see in the preview window as column headers?
Also, in this instance I have an Excel file, providing dates as:
17-May-2022 12:27 PM |
20-May-2022 10:25 AM |
16-May-2022 02:16 PM |
19-May-2022 05:51 PM |
16-May-2022 10:22 AM |
But the preview sample in dataiku is showing them as:
17-May-2022 12\:27 PM
i.e an extra backslash?
This means when I manually try to coerce the series to be a date type, it fails.
My guess is I shouldnt have to be co-ercing data series objects to be a certain datatype if the schema is set up.
Answers
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron
Hi @indy2005
,For the case of the dataset based in an Excel spreadsheet, can you confirm that this is what you see related to the detected schema and meaning?
If that is the case, the dataset read in a python recipe or notebook, should return a pandas dataframe where the column has the correct datetime type. At least, it worked for me in an example I just tested.
Regarding the parsing of dates with human-readable formats, or dates without a timezone, DSS won't recognize those automatically, and most probably it will have detected the meaning 'Date (unparsed)'. In that case, you'll need to use a Prepare recipe to parse the dates to the correct format.
While this might seem inefficient, I believe this is so later any join, window function or date operation, can be performed without major problems.
Hope this helps.
-
see screenshots below, I dont know why the DataIku sample would introduce a backslash into the time, which is not there in Excel.
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron
I don't is Dataiku introducing a backslash, but some encoding of the source Excel file that when read with different parsers produces this kind of issues. It will also depend on if Excel stored the date time as a string or an actual timestamp. Just as with CSV files, there are multiple flavors and encoding for Excel files, and DSS will try to do it best to get the right types, but it will not always work.
Anyhow, my impression is that this is not a bug, but an example of the usual cleaning and preparation you have to do when ingesting data from files (csv, excel, etc.), and that you can perform using the Preparation recipe from DSS.
-
I can certainly work around this format, IF I know this is always the way the file will be sent to me. Without understanding why DSS does this on the read of an Excel date\time, if I build around this format - and Business Objects outputs something which changes at some point, the hard coded "format" taking care of the backslash will break.
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron
Hi @indy2005
Yes, that something that can certainly happen if the file sent to you is changed, either because new columns are added, or the date format is changed, or if the person/system creating the Excel file changes the Excel version or uses a computer with an OS that has different encoding options (this often happens with csv files, which can have different encoding when they are generated on a Window machine vs a Mac machine).
So, if you are building an automated flow, the best thing to do is to add a check or perhaps a reporter that lets you know when your flow breaks because of this kind of issue.
A much more robust approach would be if the Excel data is moved into a database, and then you can access that database directly from DSS. But, even then, if the people producing the data changes the schema of the database in the future, you'll still need to deal with the problems this will create for your flow.
Cheers!
-
Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭
I concur with @Ignacio_Toledo
on the need for checks when processing excel-sourced data containing dates. As there are basically no restrictions on input in Excel the possibility is always present that a different format is entered, which gets interpreted by excel as a different date altogether. That can be hard to detect later on! Also a placeholder like "0" in excel can give rise to dates like 00-01-1900 in the imported dataset. Good thing is that parsing a certain dateformat clears these faulty values, with a warning.If there is one thing i learned with processing excel-sourced data is that there is no limit to human creativity. And that 'helpful' features are not always helpful at all.