Hadoop or spark with virtualbox

Options
UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
Hi,

I have DSS runing on a VM and I would like to install Hadoop or spark as jobs are getting very slow on dss. Is it possible ? How and where do I install it ?

thanks
Tagged:

Answers

  • AdrienL
    AdrienL Dataiker, Alpha Tester Posts: 196 Dataiker
    Options

    Hi,

    Running Hadoop in your VM is going to be very complex and will not improve your execution speed, as the idea is to leverage the parallelism offered by a cluster of machines.

    If you already have a working Hadoop cluster, you can theoretically connect to it from DSS (see Setting up Hadoop integration) but for that you need to give access to the VM to the Hadoop cluster, which is difficult to do with a local VM. You'd need to refer to the documentation of your Hadoop distribution.

    Same goes for Spark, see also Setting up Spark integration.

    If you need Hadoop and Spark, you may want to install them and DSS on some cloud servers (or your own servers) instead of a VM on your computer.

  • Jbelafa
    Jbelafa Dataiker Posts: 21 Dataiker
    Options

    Hi

    DSS can be configured to use a (already existing ) Hadoop and spark cluster .

    Hadoop and spark are effective when they consume external resources and their use is dedicated to fairly big datasets ( multi hundred gigabytes or multi terabytes datasets) .

    If your use case implies such requirements you will need to additional vm (mimum 32 GB of RAM ) on wihich you can install a hadoop distribution . Four on premise installations you can chose among these 3 distributions :

    - cloudera

    - hortonworks

    - MapR

    If your DSS is hosted on cloud you can also try to use Azure HDinsight (which provisions a DSS for you along with the cluster ) and EMR if you are on amazon (but you might need a very clear understanding of Hadoop integration to connect to EMR) .

    You can also choose to run a spark standalone cluster but every one of these options (maybe exect HDInsight preprovisioned DSS instance requires a decent Linux system knowledge and a clear understanding of Hadoop and Spark concepts.

    Once it is done you can check the documentation of DSS hadoop and spark integration .

    Good luck

Setup Info
    Tags
      Help me…