Join us at the Everyday AI Conference in London, New York & Bengaluru! REGISTER NOW

Only Python 3 for Pyspark in 10.0.4?

JazzminnNo
Level 2
Only Python 3 for Pyspark in 10.0.4?

Hello community,

After the upgrade of Dataiku to 10.0.4, suddenly a Pyspark recipe didnt work no more on the default Python 2.7 code env due to syntactical error stated below: 

File "/appl/dataiku/dataiku-dss-10.0.4/spark-standalone-home/python/pyspark/find_spark_home.py", line 68    
print("Could not find valid SPARK_HOME while searching {0}".format(paths), file=sys.stderr)
^ SyntaxError: invalid syntax

Our guess is that the underlying pyspark scripts have been updated with Python 3 code and that Python 2 code is deprecated/removed. Is this correct? If so, can Pyspark never be used again with Python 2 code env?

Before the upgrade we had 9.0.3 and the script just worked fine with Python 2.

Hope someone can help me solving this problem.

Thank you in advance,

Nofit Kartoredjo


Operating system used: RedHat


Operating system used: RedHat

0 Kudos
4 Replies
AlexT
Dataiker
Dataiker

The Base python was likely upgraded to Python3 as part of your upgrade. 

1) If you want to override this for setting for a particular notebook, you can to these properties in the SparkConf parameter that's being passed to the SparkContext's create (or getOrCreate) method.

from pyspark import SparkConf

myconf = SparkConf()
myconf.set("spark.pyspark.python", "python2.7")

 

2) To use python2 for your PySpark globally for PySpark Notebooks you will need to use set a new config with spark.pyspark.python -> python2.7.  After saving the new config you should perform a hard refresh CRTL + SHIFT +R. You do need to restart DSS after making this change. 

 

Screenshot 2022-04-26 at 16.10.20.pngScreenshot 2022-04-26 at 16.11.25.png

Let me know if that helps!

0 Kudos
JazzminnNo
Level 2
Author

Hi Alex,

Thanks for the suggestion. The first approach is what we are looking for; so using 2.7 for one specific notebook. The problem is that the error already starts at the import level (see picture) when we are using a code env with Python 2 and pyspark installed. So, even though I would add your code, it wouldn't make a difference because the error starts at this line:

from dataiku import spark as dkuspark

The scripts in this module probably use python3 .. is there anyway to adjust this? 

Thanks again!

 

0 Kudos
AlexT
Dataiker
Dataiker

Hi,

Could you confirm if the spark integration was re-run after upgrading?

./bin/dssadmin install-spark-integration -standaloneArchive /PATH/TO/dataiku-dss-spark-standalone -forK8S

The spark standalone lib can be found at

https://downloads.dataiku.com/public/studio/10.0.4/dataiku-dss-spark-standalone-10.0.4-3.1.2-generic...

https://doc.dataiku.com/dss/latest/containers/setup-k8s.html#optional-setup-spark 

If that still doesn't help in understanding I would suggest you raise a support ticket with the instance diagnostic. 

0 Kudos
JazzminnNo
Level 2
Author

Hi Alex,

Thank you for the help. You can close the ticket since I told the user to just use Python 3.6 since Python 2 gives warning of deprecation