Discussions - Dataiku Community

Latest Activity

PYSPARK_PYTHON environment variable issue in PySpark
Hi I am facing below issue for PySpark recipe. Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set. I have set the environment variables using…
import packages warnings
When I am importing packages, I get the warnings you see in the pictures. What are they and how can I get rid of them?
No connection defined to upload files/jars
I am trying to execute a PySpark recipe on a remote AWS EMR Spark cluster and I am getting: Your Spark settings don't define a temporary storage for yarn-cluster modein act.compute_prepdataset1_NP: No connection defined to upload files/jars I am using this runtime configuration: I also tried adding: spark.yarn.stagingDir…
Spark Cluster mode
Hello, As we are using Spark heavily, we are having the problem of slowness of application launching in yarn cluster mode. The slowness comes from having many DSS related files and also many jars files has to be uploaded for every single spark application. We checked the feature of using Cluster mode. However, we know that…
Spark Setup
Hi all ,, I need to use the sparkSql and spark for python I installed the spark and it shown in the administration settings . but when I run the sparkSQL it raised this error Cannot run program "spark-submit" (in directory "/data/design/jobs/DC please anyone can help or send an article to follow the configuration Thanks in…
Access the partition value in Pyspark Recipe
I have a table that is partitioned by date, how can I access the partition date in a pyspark recipe I tried the following code but does not recognize actual_date fct_pm_card.select("application_id", "product") \ .filter(col('actual_date') <= end_date)
Use Custom UDFs on Visual Recipes
Hello Dataikers! Since all visual recipes are based on SparkSQL, some "advance" aggregations aren't available. In this case, I have 3 values on 3 columns: A, B, C. And I just want to compute Median from them. The problem is that Median function doesn't exist on my current Spark backend version, so I need to use a UDF to do…
Best Practices For Updating and Renaming Spark and Container Configurations
Hello Dataiku Community, Hope all is well! Our team is looking to implement new Spark and container configuration settings on our instances. We are curious to understand what the best practices are for updating the existing configurations. For context we have existing Spark configurations already being used by end users,…
General / Rule of Thumb Spark Configuration Settings
We are using managed spark over kubernetes in EKS. We have about 80 active users on our design node, about 1/2 of them use spark regularly. We've tried to make things easy by creating simple spark configurations but are finding that we continuously are changing configurations. With multiple spark applications, has anyone…
How to save Pyspark model from notebook to managed folder
Hi, I'm trying to save pyspark model model.save("/opt/dataiku/design/managed_folders/PROJECT_TEST/9KeBcUKy/ML_SAVED") from notebook to managed folder but I'm getting the following error: Py4JJavaError: An error occurred while calling o2981.save.: org.apache.spark.SparkException: Job aborted. at…