In previous articles, we have seen a quick introduction to Big Data and Hadoop and its ecosystem. Now it is time to install Hadoop in our local machines to be able to start playing with Hadoop and with some of the tools in its ecosystem.
In my case, I am going to install it on a macOS system, if any of you is running a GNU/Linux system, you can ignore the initial step (the homebrew installation) and take a look at the configurations because they should be similar.
Installing Hadoop
As I have said before, we are going to use homebrew to install Hadoop, concretely version 3.3.3 is going to be installed. Because we are using the package manager the installation is going to be pretty simple but, if you prefer to do it manually, you just need to go to the Hadoop download page, and download and extract the desired version. The configuration steps after that should be the same.
To install Hadoop using homebrew we just need to execute:
brew install hadoop
A very important consideration is that Hadoop does not work with the latest version of Java neither 17 nor 18. It is necessary to have the Java 11 version installed. We can still use whatever version we want on our system, but to execute the Hadoop commands we should set the correct Java version in our terminal. As a side note, as far as I know, it should work with Java 8 too but I have not tested it.
Allowing SSH connections to localhost
One previous consideration we need to have is that for Hadoop to start all nodes properly it needs to be able to connect using SSH to our machine, in this case, localhost. By default, in macOS this setting is disabled, to be able to enable it, we just need to go to “System Preferences -> Shared -> Remote Session“.
Setting environment variables
Reading about this I have only found confusion, some tutorials do not even mention the addition of the environment variables, others only declare the HADOOP_HOME variable, and others declare a bunch of stuff. After a few tests, I have collected a list of environment variables we can (we should) define. They allow us to start Hadoop without problems. Is it possible my list has more than required? Yes, absolutely. But, for now, I think the list is comprehensive and short enough, my plan as I have said before is to play with Hadoop and tools around it, not to become a Hadoop administrator (yet). There are some of them, I have just as a reminder that they exist, for example, the two related to the native libraries: HADOOP_COMMON_LIB_NATIVE_DIR and HADOOP_OPTS, they can be skipped if we want, there will be a warning when running Hadoop but nothing important (see comment below about native libraries). We can add the variables to our “~/.zshrc” file.
The list contains the next variables:
# HADOOP env variables export HADOOP_HOME="/usr/local/Cellar/hadoop/3.3.3/libexec"
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
Configuring Hadoop:
File $HADOOP_HOME/etc/hadoop/hadoop-env.sh
In this file, we need to find the JAVA_HOME path and fill it with the appropriate value, but if we carefully read the file, one of the comments says “variable is REQUIRED on ALL platforms except OS X!“. Despite the comment, we are going to set it. If we do not set it, we will have problems running the NodeManager. It will not start and it will show in the logs an error related to the Java version in use.
File $HADOOP_HOME/etc/hadoop/core-site.xml
Here, we need to configure the HDFS port and a folder for temporary directories. This folder can be wherever you want, in my case, it will reside in a sub-folder of my $HOME directory.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/user/.hadoop/hdfs/tmp</value>
<description>A base for other temporary directories</description>
</property>
</configuration>
File $HADOOP_HOME/etc/hadoop/mapred-site.xml
We need to add a couple of extra properties:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
File $HADOOP_HOME/etc/hadoop/yarn-site.xml
In a similar way, we need to add a couple of extra properties:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
File $HADOOP_HOME/etc/hadoop/hdfs-site.xml
And, again, one more property:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Formatting the HDFS Filesystem
Before we can start Hadoop to be able to use HDFS, we need to format the file system. We can do this by running the following command:
hdfs namenode -format
Starting Hadoop
Now, finally, we can start Hadoop. If we have added $HADOOP_HOME/bin and $HADOOP_HOME/sbin properly to our $PATH, we should be able to execute the command start-all.sh. If the command is not available, have you restarted the terminal after editing your .zshrc file or executed the command source .zshrc
?
If everything works, we should be able to see something like:

jps
command:
Some useful URLs are:
- NameNode Web UI: http://localhost:9870
- ResourceManager Web UI: http://localhost:8088
- NodeManager Web UI: http://localhost:8042



What are the native libraries?
If someone has paid close attention to the screenshot representing the starting of Hadoop, I am sure they have observed a warning line like:
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Despite us setting the variables HADOOP_COMMON_LIB_NATIVE_DIR and HADOOP_OPTS on the environment, the truth is the lib folder is not present in our installation.
Hadoop has native implementations of certain components for performance reasons and the non-availability of Java implementations. These components are available in a single, dynamically-linked native library called the native Hadoop library. On the *nix platforms the library is named libhadoop.so. On mac systems it is named libhadoop.a. Hadoop runs fine while using the java classes. But you will gain speed with native. More importantly, some compression codecs are only supported in native, if any other application is dependent on Hadoop or your jars depend on these codecs then jobs will fail unless native libraries are available.
The installation of the native Java libraries is beyond this article.
The last thing is to stop Hadoop, to do this we just need to run the command stop-all.sh:
