Spark can’t read my HDFS datasets

Benoni
Benoni Registered Posts: 23 ✭✭✭✭

Hello,

Spark won't see hfds:/// and just looks for file:/// when i'm trying to process a HDFS managed dataset. I followed the How-To link on:

https://www.dataiku.com/learn/guide/spark/tips-and-troubleshooting.html

However couldn't figure out what to edit. Here is my env-spark.sh in DATA_DIR/bin/

```

export DKU_SPARK_ENABLED=true

export DKU_SPARK_HOME='/usr/local/spark'

export DKU_SPARK_VERSION='2.4.2'

export PYSPARK_DRIVER_PYTHON="$DKUPYTHONBIN"

export DKU_PYSPARK_PYTHONPATH='/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.7-src.zip'

if [ -n "$DKURBIN" ]; then

export SPARKR_DRIVER_R="$DKURBIN"

fi

```

My hadoop is located at /usr/local/hadoop and spark is located at /usr/local/spark.

Can you please help me? Thanks in advance.

Best Answer

  • Benoni
    Benoni Registered Posts: 23 ✭✭✭✭
    Answer ✓
    Solved it by:

    ```
    export HADOOP_INSTALL='usr/local/hadoop'
    export HADOOP_CONF_DIR='/usr/local/hadoop/etc/hadoop'
    export DKU_SPARK_ENABLED=true
    export DKU_SPARK_HOME='/usr/local/spark'
    export DKU_SPARK_VERSION='2.4.2'
    export PYSPARK_DRIVER_PYTHON="$DKUPYTHONBIN"
    export DKU_PYSPARK_PYTHONPATH='/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.7-src.zip'
    if [ -n "$DKURBIN" ]; then
    export SPARKR_DRIVER_R="$DKURBIN"
    fi

    ```
    Thanks anyways my dudes
Setup Info
    Tags
      Help me…