Survey banner
The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

Hadoop or spark with virtualbox

Hadoop or spark with virtualbox

I have DSS runing on a VM and I would like to install Hadoop or spark as jobs are getting very slow on dss. Is it possible ? How and where do I install it ?

0 Kudos
2 Replies


Running Hadoop in your VM is going to be very complex and will not improve your execution speed, as the idea is to leverage the parallelism offered by a cluster of machines.

If you already have a working Hadoop cluster, you can theoretically connect to it from DSS (see Setting up Hadoop integration) but for that you need to give access to the VM to the Hadoop cluster, which is difficult to do with a local VM. You'd need to refer to the documentation of your Hadoop distribution.

Same goes for Spark, see also Setting up Spark integration.

If you need Hadoop and Spark, you may want to install them and DSS on some cloud servers (or your own servers) instead of a VM on your computer.

0 Kudos


DSS can be configured to use a (already existing ) Hadoop and spark cluster . 

Hadoop and spark are effective when they consume external resources and their use is  dedicated to fairly big datasets ( multi hundred gigabytes or multi terabytes  datasets) . 

If your use case implies such requirements you will need to additional vm (mimum 32 GB of RAM ) on wihich you can install a hadoop distribution . Four on premise installations you can chose among  these  3 distributions : 

- cloudera 

- hortonworks 

- MapR 


If your DSS is hosted on cloud you can also try to use Azure HDinsight (which provisions a DSS for you along with the cluster )  and EMR  if you are on amazon (but you might need a very  clear understanding of Hadoop integration to  connect to EMR) .

You can also choose to run a spark standalone cluster  but every one of these options (maybe exect HDInsight preprovisioned DSS instance requires a decent Linux system knowledge and a clear understanding of Hadoop and Spark concepts. 

Once it is done you can check the documentation of  DSS  hadoop and spark integration . 

Good luck 

0 Kudos


Labels (2)
A banner prompting to get Dataiku