Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

py4j.protocol.Py4JJavaError when calling o100.savePyDataFrame

Level 1
py4j.protocol.Py4JJavaError when calling o100.savePyDataFrame


I recently started using Dataiku and I get the following error in my pyspark code:

Pyspark code failed
At line 58: <class 'py4j.protocol.Py4JJavaError'>: An error occurred while calling o100.savePyDataFrame. 

The code I am running in the recipe looks like this:


ds = dataiku.Dataset("xxx")
df = dkuspark.get_dataframe(sqlContext, ds)

df = df.groupBy(F.col(DC.vehicle_id),F.col(DC.datetime)).pivot('source').agg(F.avg('phys').alias('phys'))
df = df.sort(F.col(DC.vehicle_id).asc(), F.col(DC.datetime).asc())

## here some code to write the dataset to the output dataset


When I run this code on a small dataset of about 1000 rows, I get no error. But when I increase the number of rows to >100000 I get the error.
Therefore I do not think my code is wrong.

Would be cool if someone can help with this.

Thanks a lot!

This is the full error message:

com.dataiku.common.server.APIError$SerializedErrorException: At line 58: <class 'py4j.protocol.Py4JJavaError'>: An error occurred while calling o100.savePyDataFrame.

	at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner$3.throwFromErrorFileOrLogs(
	at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResult(
	at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runUsingSparkSubmit(
	at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.doRunSpark(
	at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runPySpark(
	at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runPySpark(
	at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$


0 Kudos
2 Replies

The error "An error occurred while calling o100.savePyDataFrame" is the last error but not the root cause here. The actual would further up the job logs. 

Likely your Spark Executor config is too small to handle the larger dataset. 

Try to increase executor memory and driver memory for example:

spark.executor.memory = 4g
spark.kubernetes.memoryOverheadFactor = 0.4

You can change this in the Spark config or override the recipe level:
Screen Shot 2023-03-31 at 11.11.57 AM.png



0 Kudos
Level 1

Hi AlexT,

thank you for you answer. You are right the Problem was in the config.

For me it helped to set these parameters:

Screenshot from 2023-04-06 08-46-15.png


However, for another pyspark recipe this did not help. The reason for this was apparently that I subsampled the dataset with a visual recipe and then did the same operation as on the full dataset.
When I changed the subsampling recipe to a pyspark recipe it worked.

Thanks again,


0 Kudos