Survey banner
Share your feedback on the Dataiku documentation with this 5 min survey. Thanks! TAKE THE SURVEY

Add support for Pandas 2.0

Pandas 2.0 can bring great performance improvements when using the pyarrow backend:
 
3 Comments
crunis
Level 1

It also allows nullable integer data type , sometimes is really annoying that my integer field becomes float only because there's a single NA

It also allows nullable integer data type , sometimes is really annoying that my integer field becomes float only because there's a single NA

AshleyW
Dataiker

Thanks for your idea, @Turribeach. Your idea meets the criteria for submission, we'll reach out should we require more information.

If youโ€™re reading this post and think [add more optional details] would be a great capability to add to DSS, be sure to kudos the original post! Feel free to leave a comment in the discussion about how this capability would help you or your team.

Take care,
Ashley

Status changed to: In the Backlog

Thanks for your idea, @Turribeach. Your idea meets the criteria for submission, we'll reach out should we require more information.

If youโ€™re reading this post and think [add more optional details] would be a great capability to add to DSS, be sure to kudos the original post! Feel free to leave a comment in the discussion about how this capability would help you or your team.

Take care,
Ashley

It also worth pointing out that we are already seeing the impact on not having a recent pandas version supported. In our Dataiku v12 environments it takes more than 6 mins to build any code environment with pandas. For 3.9 envs pip downloads pandas-1.1.5-cp39-cp39-manylinux1_x86_64.whl but for 3.11 pip gets pandas-1.3.5.tar.gz. This is because there are no pre-compiled pandas 1.3.5 for Python 3.11! So on Python 3.11 we are basically downloading the pandas source and building it from scratch including the cpython extensions. This is a risky thing to do since building from source is a much more complex and prone to error process than just installing a pre-compiled package. So sooner or later this build will break either due to OS or package dependencies.

And to add more complexity our Python developers which work on our internal Python data libraries also struggle to get code envs created using Python 3.11 and pandas 1.3.5 since there are no pre-compiled binaries for Windows and building from source is even harder on Windows as Windows doesn't come with the necesary software to do so.

It also worth pointing out that we are already seeing the impact on not having a recent pandas version supported. In our Dataiku v12 environments it takes more than 6 mins to build any code environment with pandas. For 3.9 envs pip downloads pandas-1.1.5-cp39-cp39-manylinux1_x86_64.whl but for 3.11 pip gets pandas-1.3.5.tar.gz. This is because there are no pre-compiled pandas 1.3.5 for Python 3.11! So on Python 3.11 we are basically downloading the pandas source and building it from scratch including the cpython extensions. This is a risky thing to do since building from source is a much more complex and prone to error process than just installing a pre-compiled package. So sooner or later this build will break either due to OS or package dependencies.

And to add more complexity our Python developers which work on our internal Python data libraries also struggle to get code envs created using Python 3.11 and pandas 1.3.5 since there are no pre-compiled binaries for Windows and building from source is even harder on Windows as Windows doesn't come with the necesary software to do so.