py4j.protocol.Py4JJavaError when calling o100.savePyDataFrame

maxbe_
Level 1
py4j.protocol.Py4JJavaError when calling o100.savePyDataFrame

Hi,

I recently started using Dataiku and I get the following error in my pyspark code:

Pyspark code failed
At line 58: <class 'py4j.protocol.Py4JJavaError'>: An error occurred while calling o100.savePyDataFrame. 

The code I am running in the recipe looks like this:

-----

ds = dataiku.Dataset("xxx")
df = dkuspark.get_dataframe(sqlContext, ds)

df = df.groupBy(F.col(DC.vehicle_id),F.col(DC.datetime)).pivot('source').agg(F.avg('phys').alias('phys'))
df = df.sort(F.col(DC.vehicle_id).asc(), F.col(DC.datetime).asc())

## here some code to write the dataset to the output dataset

----


When I run this code on a small dataset of about 1000 rows, I get no error. But when I increase the number of rows to >100000 I get the error.
Therefore I do not think my code is wrong.

Would be cool if someone can help with this.

Thanks a lot!

This is the full error message:

com.dataiku.common.server.APIError$SerializedErrorException: At line 58: <class 'py4j.protocol.Py4JJavaError'>: An error occurred while calling o100.savePyDataFrame.

	at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner$3.throwFromErrorFileOrLogs(AbstractSparkBasedRecipeRunner.java:325)
	at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResult(JobExecutionResultHandler.java:26)
	at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runUsingSparkSubmit(AbstractSparkBasedRecipeRunner.java:340)
	at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.doRunSpark(AbstractSparkBasedRecipeRunner.java:145)
	at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runPySpark(AbstractSparkBasedRecipeRunner.java:119)
	at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runPySpark(AbstractSparkBasedRecipeRunner.java:104)
	at com.dataiku.dip.recipes.code.spark.PySparkRecipeRunner.run(PySparkRecipeRunner.java:55)
	at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:375)

 

0 Kudos
2 Replies
AlexT
Dataiker

Hi,
The error "An error occurred while calling o100.savePyDataFrame" is the last error but not the root cause here. The actual would further up the job logs. 

Likely your Spark Executor config is too small to handle the larger dataset. 

Try to increase executor memory and driver memory for example:

spark.executor.memory = 4g
spark.kubernetes.memoryOverheadFactor = 0.4

You can change this in the Spark config or override the recipe level:
Screen Shot 2023-03-31 at 11.11.57 AM.png

 

Thanks,

0 Kudos
maxbe_
Level 1
Author

Hi AlexT,

thank you for you answer. You are right the Problem was in the config.

For me it helped to set these parameters:

Screenshot from 2023-04-06 08-46-15.png

 

However, for another pyspark recipe this did not help. The reason for this was apparently that I subsampled the dataset with a visual recipe and then did the same operation as on the full dataset.
When I changed the subsampling recipe to a pyspark recipe it worked.

Thanks again,
Max

 

0 Kudos