Preformat and preprocessing a register Dataiku model through python by using
I would like to use through a python recipe a registered model and apply the preformat and preproccesing through a huge dataframe connected to Databricks (which is good for memory issues). But it seems this is not possible to do it without passing through a pandas dataframe. Anybody know how to resolve this?
The error message I got:
I don't want to pass through df.to_pandas() for memory issue and kernel who died.
The idea is to use the result afterwards to apply shap in probability space …
Anybody know what to do in the recipe?
We work on Dataiku version 13
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,170 Neuron
Please alwys post your code undrr a code block (the </> icon in the toobar) so that it can be easily copied to replicate your issue. I don't think you are going to get away from your problem. First thing to understand is that the Databricks DBConnect gives you a Spark data frame, not a Pandas one. You check this doing type(df) in a Jupyter Notebook. Secondly when you want Dataiku to read the data it will need a Pandas data frame as that the only thing that it supports. Finally if you want to use dataiku.Model() you will have to load the data in memory, you can't push down the computation to Databricks or even Kubernetes (see this section of the API). To have the computation being push down to Databricks you will need to write PySpark code and find models you can use in Python so that they can be executed in the General Compute Cluster.