Optimize SQL date handling
I recognize that this is a hyper-specific use case, and I'm unable to test broader suitability because I only use SQL, but perhaps this would be useful in other databasing systems as well:
I recently added date partitioning to a datasource that exists outside of Dataiku. Notably, it DOES have indexes specified on the date field used for partitioning. Before partitioning the datasource, it performed well, and I was using a project variable to control how far back a query could go, but I wanted to remove this, and use partitioning instead. My query performance took a serious hit, and here is why:
My table contains datetime data that does not have time zone information with it. It is understood to be in local time. The partitioning functions decorate the query with a where clause that looks (in part) like this:
TODATETIMEOFFSET( "<my_timestamp_field>" , -420) >= '2022-10-11 09:00:00'....
The call to TODATETIMEOFFSET requires a calculation on each row, and therefore causes SQL to perform a tablescan rather than using the appropriate indexes. In testing this query as written, it took about 20 seconds to pull the required data. I then rewrote the query like this:
"<my_timestamp_field>" >= TODATETIMEOFFSET( '2022-10-11 09:00:00' , 420)....
And the same data was returned in under 1 second.
Is it possible to re-write the time offsets such that the fixed value (in this case, the partitioning dimension) is the one that gets modified by the offset? This allows the calculation to be performed just once, and allows the DBMS to utilize the indexes.
Comments
-
Yesterday I was asked by IT if they could kill a query that was written this way as it was interfering with other processes running on the server. To be fair the user in question was loading an enormous amount of data, however, making a change in the query similar to the one listed above took the process time from 10 minutes (or nearly 3 hours when it ran into locking issues) to less than 2 minutes. A faster running query will ensure that locking issues do not occur as readily and if they do the query is still completed faster rather than stalling for hours on a system.
Granted some of this is due to the database architecture and we are addressing that piece; however, this fix would prove very helpful while we wait on IT to make the necessary changes.