Check out the first Dataiku 8 Deep Dive focusing on Productivity on October 29th Read More

Spark can’t read my HDFS datasets

Level 3
Spark can’t read my HDFS datasets

Hello, 



Spark won't see hfds:/// and just looks for file:/// when i'm trying to process a HDFS managed dataset. I followed the How-To link on: 



https://www.dataiku.com/learn/guide/spark/tips-and-troubleshooting.html



However couldn't figure out what to edit. Here is my env-spark.sh in DATA_DIR/bin/



```

export DKU_SPARK_ENABLED=true

export DKU_SPARK_HOME='/usr/local/spark'

export DKU_SPARK_VERSION='2.4.2'

export PYSPARK_DRIVER_PYTHON="$DKUPYTHONBIN"

export DKU_PYSPARK_PYTHONPATH='/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.7-src.zip'

if [ -n "$DKURBIN" ]; then

  export SPARKR_DRIVER_R="$DKURBIN"

fi



```

My hadoop is located at /usr/local/hadoop and spark is located at /usr/local/spark. 



Can you please help me? Thanks in advance. 

0 Kudos
1 Reply
Level 3
Author
Solved it by:

```
export HADOOP_INSTALL='usr/local/hadoop'
export HADOOP_CONF_DIR='/usr/local/hadoop/etc/hadoop'
export DKU_SPARK_ENABLED=true
export DKU_SPARK_HOME='/usr/local/spark'
export DKU_SPARK_VERSION='2.4.2'
export PYSPARK_DRIVER_PYTHON="$DKUPYTHONBIN"
export DKU_PYSPARK_PYTHONPATH='/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.7-src.zip'
if [ -n "$DKURBIN" ]; then
export SPARKR_DRIVER_R="$DKURBIN"
fi

```
Thanks anyways my dudes
0 Kudos
Labels (3)