Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I recently started using Dataiku and I get the following error in my pyspark code:
Pyspark code failed At line 58: <class 'py4j.protocol.Py4JJavaError'>: An error occurred while calling o100.savePyDataFrame.
The code I am running in the recipe looks like this:
ds = dataiku.Dataset("xxx") df = dkuspark.get_dataframe(sqlContext, ds) df = df.groupBy(F.col(DC.vehicle_id),F.col(DC.datetime)).pivot('source').agg(F.avg('phys').alias('phys')) df = df.sort(F.col(DC.vehicle_id).asc(), F.col(DC.datetime).asc()) ## here some code to write the dataset to the output dataset
When I run this code on a small dataset of about 1000 rows, I get no error. But when I increase the number of rows to >100000 I get the error.
Therefore I do not think my code is wrong.
Would be cool if someone can help with this.
Thanks a lot!
This is the full error message:
com.dataiku.common.server.APIError$SerializedErrorException: At line 58: <class 'py4j.protocol.Py4JJavaError'>: An error occurred while calling o100.savePyDataFrame. at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner$3.throwFromErrorFileOrLogs(AbstractSparkBasedRecipeRunner.java:325) at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResult(JobExecutionResultHandler.java:26) at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runUsingSparkSubmit(AbstractSparkBasedRecipeRunner.java:340) at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.doRunSpark(AbstractSparkBasedRecipeRunner.java:145) at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runPySpark(AbstractSparkBasedRecipeRunner.java:119) at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runPySpark(AbstractSparkBasedRecipeRunner.java:104) at com.dataiku.dip.recipes.code.spark.PySparkRecipeRunner.run(PySparkRecipeRunner.java:55) at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:375)
The error "An error occurred while calling o100.savePyDataFrame" is the last error but not the root cause here. The actual would further up the job logs.
Likely your Spark Executor config is too small to handle the larger dataset.
Try to increase executor memory and driver memory for example:
spark.executor.memory = 4g
spark.kubernetes.memoryOverheadFactor = 0.4
You can change this in the Spark config or override the recipe level:
thank you for you answer. You are right the Problem was in the config.
For me it helped to set these parameters:
However, for another pyspark recipe this did not help. The reason for this was apparently that I subsampled the dataset with a visual recipe and then did the same operation as on the full dataset.
When I changed the subsampling recipe to a pyspark recipe it worked.