py4j.protocol.Py4JJavaError when calling o100.savePyDataFrame

maxbe_ Registered Posts: 2
edited July 16 in Using Dataiku


I recently started using Dataiku and I get the following error in my pyspark code:

Pyspark code failed
At line 58: <class 'py4j.protocol.Py4JJavaError'>: An error occurred while calling o100.savePyDataFrame. 

The code I am running in the recipe looks like this:


ds = dataiku.Dataset("xxx")
df = dkuspark.get_dataframe(sqlContext, ds)

df = df.groupBy(F.col(DC.vehicle_id),F.col(DC.datetime)).pivot('source').agg(F.avg('phys').alias('phys'))
df = df.sort(F.col(DC.vehicle_id).asc(), F.col(DC.datetime).asc())

## here some code to write the dataset to the output dataset


When I run this code on a small dataset of about 1000 rows, I get no error. But when I increase the number of rows to >100000 I get the error.
Therefore I do not think my code is wrong.

Would be cool if someone can help with this.

Thanks a lot!

This is the full error message:

com.dataiku.common.server.APIError$SerializedErrorException: At line 58: <class 'py4j.protocol.Py4JJavaError'>: An error occurred while calling o100.savePyDataFrame.

    at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner$3.throwFromErrorFileOrLogs(
    at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResult(
    at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runUsingSparkSubmit(
    at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.doRunSpark(
    at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runPySpark(
    at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runPySpark(
    at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$


  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker

    The error "An error occurred while calling o100.savePyDataFrame" is the last error but not the root cause here. The actual would further up the job logs.

    Likely your Spark Executor config is too small to handle the larger dataset.

    Try to increase executor memory and driver memory for example:

    spark.executor.memory = 4g
    spark.kubernetes.memoryOverheadFactor = 0.4

    You can change this in the Spark config or override the recipe level:
    Screen Shot 2023-03-31 at 11.11.57 AM.png


  • maxbe_
    maxbe_ Registered Posts: 2

    Hi AlexT,

    thank you for you answer. You are right the Problem was in the config.

    For me it helped to set these parameters:

    Screenshot from 2023-04-06 08-46-15.png

    However, for another pyspark recipe this did not help. The reason for this was apparently that I subsampled the dataset with a visual recipe and then did the same operation as on the full dataset.
    When I changed the subsampling recipe to a pyspark recipe it worked.

    Thanks again,

Setup Info
      Help me…