Wednesday, 2 September 2015

INSTALL APACHE HADOOP IN UBUNTU (SINGLE NODE SETUP)


INSTALL APACHE HADOOP IN UBUNTU (SINGLE NODE SETUP)


1. Installing Oracle Java 8
Apache Hadoop is java framework, we need java installed on our machine to get it run over operating system. Hadoop supports all java version greater than 5 (i.e. Java 1.5). So, Here you can also try Java 6, 7 instead of Java 8.


girdhar@dcsa:~$ sudo add-apt-repository ppa:webupd8team/java
girdhar@dcsa:~$ sudo apt-get update
girdhar@dcsa:~$ sudo apt-get install oracle-java8-installer

It will install java source in your machine at /usr/lib/jvm/java-8-oracle. To verify your java installation, you have to run the following command
girdhar@dcsa:~$ java -version


2. Creating a Hadoop user for accessing HDFS and MapReduce
To avoid security issues, I recommend to setup new Hadoop user group and user account to deal with all Hadoop related activities.
We will create hadoop as system group and hduser as system user by,

girdhar@dcsa:~$ sudo addgroup hadoop
girdhar@dcsa:~$ sudo adduser --ingroup hadoop hduser

3. Installing SSH
SSH (“Secure SHell”) is a protocol for securely accessing one machine from another. Hadoop uses SSH for accessing another slaves nodes to start and manage all HDFS and MapReduce daemons. Install SSH by running following command:

girdhar@dcsa:~$ sudo apt-get install openssh-server


Now, we have installed SSH over Ubuntu machine so we will be able to connect with this machine as well as from this machine remotely.

Configuring SSH
Once you installed SSH on your machine, you can connect to other machine or allow other machines to connect with this machine. However we have this single machine, we can try connecting with this same machine by SSH. The following commands are used for generating a key value pair using SSH to:
  • copy the public keys form id_rsa.pub to authorized_keys, and provide owner, read and write permissions to authorized_keys file respectively.
# First login with hduser (and from now use only hduser account for further steps)
girdhar@dcsa:~$ sudo su hduser

# Generate ssh key for hduser account
hduser@dcsa:~$ ssh-keygen -t rsa -P ""


## Copy id_rsa.pub to authorized keys from hduser
hduser@dcsa:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

In case you are configuring SSH for another machine (i.e. from master node to slave node), you have to update the above command by adding the hostname of slave machine.


4. Disabling IPv6
Since Hadoop doesn’t work on IPv6, we should disable it. One of another reason is also that it has been developed and tested on IPv4 stacks. Hadoop nodes will be able to communicate if we are having IPv4 cluster. (Once you have disabled IPV6 on your machine, you need to reboot your machine in order to check. its effect. To reboot with command use sudo reboot)
For getting your IPv6 disable in your Linux machine, you need to update /etc/sysctl.conf by adding following line of codes at end of the file,


# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1


Installalling Hadoop Framework: -

1. Download latest Apache Hadoop source from Apache mirrors
First you need to download Apache Hadoop 2.6.0
(http://hadoop.apache.org/releases.html#Download) (i.e. hadoop-2.6.0.tar.gz) or latest version source from Apache download Mirrors. You can also try stable hadoop to get all latest features as well as recent bugs solved with Hadoop source. Choose location where you want to place all your hadoop installation, I have chosen /usr/local/hadoop


## Locate to hadoop installation parent dir
hduser@dcsa:~$ cd /usr/local/

## Extract Hadoop source
sudo tar -xzvf hadoop-2.6.0.tar.gz

## Move hadoop-2.6.0 to hadoop folder
sudo mv hadoop-2.6.0 /usr/local/hadoop

## Assign ownership of this folder to Hadoop user
sudo chown hduser:hadoop -R /usr/local/hadoop

## Create Hadoop temp directories for Namenode and Datanode
sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

## Again assign ownership of this Hadoop temp folder to Hadoop user
sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/


2. Update Hadoop configuration files
## User profile : Update $HOME/.bashrc
hduser@dcsa:~$ sudo gedit .bashrc


## Update hduser configuration file by appending the
## following environment variables at the end of this file.

# -- HADOOP ENVIRONMENT VARIABLES START -- #
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #


Configuration file : hadoop-env.sh
## To edit file, fire the below given command
hduser@dcsa:/usr/local/hadoop/etc/hadoop$ sudo gedit hadoop-env.sh

## Update JAVA_HOME variable,
JAVA_HOME=/usr/lib/jvm/java-8-oracle


Configuration file : core-site.xml
## To edit file, fire the below given command
hduser@dcsa:/usr/local/hadoop/etc/hadoop$ sudo gedit core-site.xml


## Paste these lines into <configuration> tag
<property>
               <name>fs.default.name</name>
               <value>hdfs://localhost:9000</value>
</property>




Configuration file : hdfs-site.xml
## To edit file, fire the below given command
hduser@dcsa:/usr/local/hadoop/etc/hadoop$ sudo gedit hdfs-site.xml


## Paste these lines into <configuration> tag
<property>
               <name>dfs.replication</name>
               <value>1</value>
</property>
<property>
               <name>dfs.namenode.name.dir</name>
               <value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
               <name>dfs.datanode.data.dir</name>
               <value>file:/usr/local/hadoop_tmp/hdfs/datanode</value>
</property>


Configuration file : yarn-site.xml
## To edit file, fire the below given command
hduser@dcsa:/usr/local/hadoop/etc/hadoop$ sudo gedit yarn-site.xml


## Paste these lines into <configuration> tag
<property>
               <name>yarn.nodemanager.aux-services</name>
               <value>mapreduce_shuffle</value>
</property>
<property>
               <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
               <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>


Configuration file : mapred-site.xml
## Copy template of mapred-site.xml.template file
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml


## To edit file, fire the below given command
hduser@dcsa:/usr/local/hadoop/etc/hadoop$ sudo gedit mapred-site.xml


## Paste these lines into <configuration> tag
<property>
               <name>mapreduce.framework.name</name>
               <value>yarn</value>
</property>


3. Format Namenode
hduser@dcsa:hdfs namenode -format


4. Start all Hadoop daemons
##Start hdfs daemons
hduser@dcsa:/usr/local/hadoop$ start-dfs.sh


##Start MapReduce daemons:
hduser@dcsa:/usr/local/hadoop$ start-yarn.sh


Instead both of these above command you can also use start-all.sh, but its now deprecated so its not
recommended to be used for better Hadoop operations.


5. Track/Monitor/Verify
##Verify Hadoop daemons:
hduser@dcsa: jps


##Monitor Hadoop ResourseManage and Hadoop NameNode
If you wish to track Hadoop MapReduce as well as HDFS, you can try exploring Hadoop web view of ResourceManager and NameNode which are usually used by hadoop administrators. Open your default browser and visit to the following links.
For ResourceManager – Http://localhost:8088 (http://localhost:8088)
For NameNode – Http://localhost:50070 (http://localhost:50070)