Spark can’t read my HDFS datasets
Benoni
Registered Posts: 23 ✭✭✭✭
Hello,
Spark won't see hfds:/// and just looks for file:/// when i'm trying to process a HDFS managed dataset. I followed the How-To link on:
https://www.dataiku.com/learn/guide/spark/tips-and-troubleshooting.html
However couldn't figure out what to edit. Here is my env-spark.sh in DATA_DIR/bin/
```
export DKU_SPARK_ENABLED=true
export DKU_SPARK_HOME='/usr/local/spark'
export DKU_SPARK_VERSION='2.4.2'
export PYSPARK_DRIVER_PYTHON="$DKUPYTHONBIN"
export DKU_PYSPARK_PYTHONPATH='/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.7-src.zip'
if [ -n "$DKURBIN" ]; then
export SPARKR_DRIVER_R="$DKURBIN"
fi
```
My hadoop is located at /usr/local/hadoop and spark is located at /usr/local/spark.
Can you please help me? Thanks in advance.
Best Answer
-
Solved it by:
```
export HADOOP_INSTALL='usr/local/hadoop'
export HADOOP_CONF_DIR='/usr/local/hadoop/etc/hadoop'
export DKU_SPARK_ENABLED=true
export DKU_SPARK_HOME='/usr/local/spark'
export DKU_SPARK_VERSION='2.4.2'
export PYSPARK_DRIVER_PYTHON="$DKUPYTHONBIN"
export DKU_PYSPARK_PYTHONPATH='/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.7-src.zip'
if [ -n "$DKURBIN" ]; then
export SPARKR_DRIVER_R="$DKURBIN"
fi
```
Thanks anyways my dudes