Error Running HDFS Command in Python Recipe
jkonieczny
Registered Posts: 13 ✭✭✭✭
I have some code where I need to run an HDFS command in Python to check if a file is present. See below for an example:
import subproces
command = 'hdfs dfs -ls /sandbox'
ssh = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE).communicate()
print(ssh)
When I run this in a Jupyter notebook in Dataiku, the command completes without any problems. However, when I run the notebook as a Python recipe, I get the following error message multiple times:
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)];
It looks as if there is a problem with Kerberos when I run the Jupyter notebook as a Recipe. What is the reason for this? Is there a Dataiku setting I can change to make sure the Kerberos ticket is generated properly?
Answers
-
Hi,
Your company is running Dataiku in Multi-User-Security mode. In this mode, Dataiku performs complex interaction with Kerberos in order to ensure that each activity runs as the end-user while Dataiku only has a single credential.
This interaction makes it so that the Python recipe does not have impersonated credentials. That being said, it is possible that a Pyspark recipe (rather than a vanilla Python one) would work (you don't need to actually do anything Spark-y in a Pyspark recipe) -
I can confirm, using PySpark in these cases are the solution. The impersonation is handled by Dataiku in this case, so you dont have to worry about keytabs, and do a kinit before the command (or to cron the kinit for the specific user)