Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Need help and advise on using Dataiku for faster data pre processing.
We have huge amount of data that needs to be pre-processed and using python data frames it is very slow. We have GPUs in the Dataiku cluster and want to use Rapidsai library however it needs Python 3.8 version whereas the Dataiku instance we have is on version 3.7 and immediate upgrade to 3.8 is not available.
 Is there any other library that can be used with Python 3.7 version for faster data processing using GPUs which is supported in Dataiku?
 Can Spark be integrated with the existing Dataiku cluster so that pyspark can be used for faster processing? What would it take to onboard spark in Dataiku instance?
 Does upgrade and support to Python 3.8 and above in Dataiku available ?
Operating system used: Redhat 7.9
Thanks for your reply.
We are using Dataiku version 11.3.1 and the python version available in the instance build is 3.7. We will look to have it upgraded to 3.8
Also can you please advise on point 1 and 2 .
If you are on v11.3.2 then all you need to do is to install the Python 3.8 and Python 3.9 packages as alternative installs in your RHEL box and then you will be able to create Python 3.8 and Python 3.9 code environments in Dataiku. See Setting up Spark integration: https://doc.dataiku.com/dss/latest/spark/installation.html