hdfs/hadoop configuration for dss docker container,docker running in cloudera node host (secure clus

UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
Hi there, was checking for a guide/scripts on this specific scenario.

also to know if it is doable/possible scenario?

Docker dataiku dss container is running on a cloudera CDH host. Cloudera cluster needs kerberos ticket for authentication.

After reading available materials, thinking the best approach would be to install hadoop binaries and map/copy hadoop conf& kerberos conf on to the dss running container? Will this work?

Any existing scripts/material on how to do this properly? (cloudera cdh)

thanks!

Best Answer

  • pbertin
    pbertin Dataiker Posts: 27 Dataiker
    Answer ✓
    Hi Rui

    the approach you propose is indeed the correct one.

    For kerberos to work, you need to install the Kerberos client package inside the Docker image, and mount the krb5.conf configuration file

    For Hadoop, you will need to install the CDH client packages, and mount the various configuration directories (/etc/{hadoop,hive,spark,...}/conf). Beware that depending on the way your CDH cluster is setup, you may have a number of symlink indirections in there.

    Another difficulty related to Spark is that the Spark workers (running on the cluster nodes) need to be able to connect back to the Spark driver (running on the DSS host), which imposes extra constraints on the way the container network is configured.

    All in all it is a workable setup, though you need some understanding of the inners of Hadoop to configure it correctly. We have already done it a few times but do not have readily-exportable materials for it. Do not hesitate to come back to us if you get into difficulties

    Regards
    Patrice Bertin
    Dataiku

Answers

  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    Hi Patrice, thanks! additional question, if I go the other way around, trying to add the dss container instance as a hadoop node through the clouder manager interface? (so that install everything automatically) Do you think this could work? Any experience doing this with docker container? (as dss is not on a "real" host)
  • pbertin
    pbertin Dataiker Posts: 27 Dataiker
    In practice this should probably work but I have never attempted this approach. I would doubt it is simpler, as Cloudera is quite strict in checking the configuration of managed hosts, and is more designed to the managed of static hosts. It might be worth a try though.
    You will then definitely need the container to be reachable from outside with a "normal" network stack (no nat, and a globally-known hostname).
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    ok Patrice, agree, nice info, thanks!
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    Hi Patrice, getting close, but blocked on an issue, if you can help
    so:
    -kerberos, hadoop dfs -ls is working properly, installed spark, can also submit jobs and see them on cluster, changed some spark ports and allow them through docker, checked spark cluster can connect back to docker dss spark driver
    ex: this test works properly
    cd /usr/local/spark/
    ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    examples/jars/spark-examples*.jar \
    10

    but when trying to run dss spark setup, I get the below error

    checked that I also cannot reach python by default on bash (root or dataiku user)

    ideas? what am I missing?
    thx


    dataiku@6cdd03bb791a:~/dss$ ./bin/dssadmin install-spark-integration
    [+] Saving installation log to /home/dataiku/dss/run/install.log
    *** Error detecting SPARK_HOME using spark-submit
    Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:91)
    at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.(UNIXProcess.java:247)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 11 more
  • pbertin
    pbertin Dataiker Posts: 27 Dataiker
    Hi Rui,

    first of all I would prefer continuing this thread on a support ticket at support.dataiku.com, as it is getting very specific to your particular setup and might still need a couple more back-and-forths

    At first glance, you are missing the python subsystem which is used by spark to submit python files (as in: spark-submit file.py). DSS actually does this using a small test python file as part of the install-spark-integration script, here spark-submit fails because it cannot find python itself.

    That should be easy to reproduce outside DSS. To fix it, you should probably install python, or fix the spark config so that it properly locates the python subsystem which you intend it to use

    Regards
    Patrice Bertin
    Dataiku
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    Hi Patrice, created the ticket, thanks. After python setup I was able to proceed, but now remote spark execution seems to be looking for the container "virtual" hostname, which wont work from outside. More info on the ticket.
    RQ
Setup Info
    Tags
      Help me…