Join us on July 16th as we explore real-world Reinforcement Learning Learn more

Re: How to connect to Hadoop?

Level 1
Re: How to connect to Hadoop?

Hi Team,

I have been trying to connect from DSS to HDFS.

On my machine1(ec2 instance), DSS already configured with Java 11.
On my machine2(ec2 instance), Hadoop cluster (resource manager,node manager,name node, data node, secondary name node) running with Java 8 and Apache Hadoop 3.2.1

1> As per your post, I have installed "hadoop-hdfs-client-3.2.1.jar" from https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs-client/3.2.1/ on machine1.

2> What are the hadoop configuration files need to use with contents, so that client processes can find and connect to the hadoop cluster ?

Please elaborate, all the steps after the above point 2, as per your post.


As per Dataiku documentation, both of the above (step 1 and step 2) operations are typically best done through our cluster manager interface,
by adding the DSS machine to the set of hosts managed by the cluster manager, and configuring “client” or “gateway” roles for it.

How to achieve the above using "cluster manager interface" ?
How to create and use "HDFS gateway" roles ?

As per Dataiku documentation, need to ENABLE Hadoop on DSS interface.
What are the steps to be followed to enable hadoop, without doing Kerberos related setup ?

Please let me know, if any step is missing to create the connection between DSS and HDFS ?
Please let me know, if you need any other information from me.

Thanks,

Sanjeeb

0 Kudos
2 Replies
Dataiker
Dataiker

Hi there,

Even though Hadoop can be deployed manually, it is a very complex job and requires knowledge about the single components you are deploying and how to configure them. 

DSS relies on the node where it is deployed to be able to connect to the cluster autonomously, i.e. there is nothing that DSS does to enable the connection to the cluster, which needs to be already setup and working.

I recommend using a distribution, like Cloudera, that will ease the deployment burden. Once you've deployed the cluster (can also be a single node), you can use the regular procedure depicted in our documentation here to connect to the cluster. 

I found this link for you that seems to help in automating the deployment of a cluster in AWS.

 

Take care,

Omar
Architect @ Dataiku

 

Level 1
Author

Hi Omar,

      Thanks for your response.

      I have been trying to connect from DSS interface to HDFS, by running a single node cluster.

   If I would start using Cloudera, I don't think it's completely free. 

Thanks,

Sanjeeb

 

    

 

0 Kudos