Skip to content

pyspark UDF inside a Python package

Added on November 30, 2017 11:50PM

Likes: 0

Replies: 0

tdias

tdias Registered Posts: 1 ✭✭✭✭

November 2017 edited July 2024 in Using Dataiku

Creating a very minimalist Python package/module with a UDF :


import pyspark.sql.functions as func
import pyspark.sql.types as pysparktypes

def test_func(x):
    return 'test'

def transfo_dataset(df):
    test_udf = func.udf(test_func, pysparktypes.StringType())
    return df.withColumn("new_col", test_udf('category'))

Processing the dataset with a column 'category' with this package will fail.


import udf_test
df_transfo = udf_test.transfo_dataset(df)
df_transfo.show()

Error Message (first line)


Py4JJavaError: An error occurred while calling o64.showString.

The same error is happening if we try to export the dataset

Current solution

Copy paste the code in the Recipe/Notebook and call directly the function


df_transfo = transfo_dataset(df)
df_transfo.show()

I understand from this answer https://answers.dataiku.com/408/spark-packages-with-dss#a410 that the code for the UDF should be available for the executors.

But in this case, it feels like the python package environment is limited. The same code is working inside a recipe but failed in a package.

This is a problem specific to UDF in this case. If I remove the UDF the package is working well.

I tried to create a small reproducible example. I can give more details if needed.

EDIT :

This is apparently due to our version of Spark.

0 ·

Categories

Help me…

…ask a question …search for an answer …learn to use the Community …sign in

Top