Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Running Hadoop in your VM is going to be very complex and will not improve your execution speed, as the idea is to leverage the parallelism offered by a cluster of machines.
If you already have a working Hadoop cluster, you can theoretically connect to it from DSS (see Setting up Hadoop integration) but for that you need to give access to the VM to the Hadoop cluster, which is difficult to do with a local VM. You'd need to refer to the documentation of your Hadoop distribution.
Same goes for Spark, see also Setting up Spark integration.
If you need Hadoop and Spark, you may want to install them and DSS on some cloud servers (or your own servers) instead of a VM on your computer.
DSS can be configured to use a (already existing ) Hadoop and spark cluster .
Hadoop and spark are effective when they consume external resources and their use is dedicated to fairly big datasets ( multi hundred gigabytes or multi terabytes datasets) .
If your use case implies such requirements you will need to additional vm (mimum 32 GB of RAM ) on wihich you can install a hadoop distribution . Four on premise installations you can chose among these 3 distributions :
If your DSS is hosted on cloud you can also try to use Azure HDinsight (which provisions a DSS for you along with the cluster ) and EMR if you are on amazon (but you might need a very clear understanding of Hadoop integration to connect to EMR) .
You can also choose to run a spark standalone cluster but every one of these options (maybe exect HDInsight preprovisioned DSS instance requires a decent Linux system knowledge and a clear understanding of Hadoop and Spark concepts.