I’ve used a RaspberryPi 4 board to create a single-node cluster for storage (HDFS) and computing (Yarn), and then tested that “cluster” with a spark job.
⚠️ Disclaimer: This setup is meant as a tech-demo and a minimal usable setup. It does not have the adequate security or reliability levels required for production environments. Use only on trusted, isolated networks!
I booted up the RaspberryPi 4 computer with an image of Ubuntu Server 20.04 for ARM64 architecture.
Then I installed Java 11:
sudo apt-get install openjdk-11-jre
For reference, Hadoop 3.3 requires java 8 or java 11 - the other versions of the JVM are not supported. Details here.
Then I had to download and uncompress the Hadoop 3.3.1 archive (latest release at the time of writing). Pay attention to the architecture included in the download link (aarch64 is equivalent to ARM64).
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1-aarch64.tar.gz
tar xf hadoop-3.3.1-aarch64.tar.gz
cd hadoop-3.3.1
Now the configuration step: I had to edit the core-site.xml
, hdfs-site.xml
and yarn-site.xml
files:
The following sample is for the etc/hadoop/core-site.xml
configuration file.
The fs.defaultFS
key specifies the IP address of the board (its internal
address is 192.168.111.112
, replace it to match your network) and the port
used to access the data over HDFS.
The hadoop.tmp.dir
key specifies where all the files will be stored.
In my case, it specifies the mount-point of an external hard-drive - to avoid
stressing the RaspberryPi’s sdcard.
Reference documentation and default values for this file available here.
vim etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.111.112:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/ubuntu/media/hadoop</value>
</property>
</configuration>
The following sample is for the etc/hadoop/hdfs-site.xml
configuration file.
The dfs.replication
key specifies the number of copies a file should have
inside the cluster, and having a single-node cluster we cannot have it
replicated across multiple servers.
The dfs.namenode.rpc-address
key should include the IP address and port
where the HDFS is listening.
Reference documentation and default values for this file available here.
vim etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.rpc-address</name>
<value>192.168.111.112:8020</value>
</property>
</configuration>
The following sample is for the etc/hadoop/yarn-site.xml
configuration file.
The clients using the “cluster” will use the yarn.resourcemanager.hostname
value
to find the address of the master, where the compute tasks will be submitted.
Reference documentation and default values for this file available here.
vim etc/hadoop/yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.111.112</value>
</property>
</configuration>
For the Hadoop cluster to start correctly, we must configure key-based ssh connections to the board.
sudo systemctl start sshd # enable the ssh daemon, if it wasn't already running
ssh-keygen # generate ssh key for local user
ssh-copy-id localhost # set up ssh login to localhost without password
Another prerequisite of Hadoop is to have the JAVA_HOME
variable pointing to
a JRE. For this we’ll edit the .bashrc
file:
vim ~/.basrc
Delete the part that skips over the .bashrc
content when running inside a
non-interactive sessions, and export the JAVA_HOME
variable. Follow the next
diff snippet for more details:
--- a/.bashrc
+++ b/.bashrc
@@ -2,11 +2,7 @@
# see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
# for examples
-# If not running interactively, don't do anything
-case $- in
- *i*) ;;
- *) return;;
-esac
+export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-arm64
# don't put duplicate lines or lines starting with space in the history.
# See bash(1) for more options
Log out of the console, and log back in. Go back to the hadoop directory, start the hadoop services, and do a namenode format:
cd hadoop-3.3.1
./sbin/start-all.sh
#WARNING: Attempting to start all Apache Hadoop daemons as ubuntu in 10 seconds.
#WARNING: This is not a recommended production deployment configuration.
#WARNING: Use CTRL-C to abort.
#Starting namenodes on [ubuntu]
#Starting datanodes
#Starting secondary namenodes [ubuntu]
#Starting resourcemanager
#Starting nodemanagers
./bin/hdfs namenode -format
Now that the namenode is formatted, it needs to be restarted:
./sbin/stop-all.sh
./sbin/start-all.sh
If no errors are displayed, the HDFS cluster is ready. We can test it as follows:
./bin/hdfs dfs -ls /
# HDFS should be empty.
./bin/hdfs dfs -mkdir /stuff
# Make a folder in the HDFS root location
./bin/hdfs dfs -ls /
# Found 1 items
# drwxr-xr-x - ubuntu supergroup 0 2021-09-19 10:00 /stuff
To inspect the status, we can also use the web interfaces:
http://IP_ADDRESS:8088
http://IP_ADDRESS:8042
http://IP_ADDRESS:9864
http://IP_ADDRESS:9870
Time to leave the RaspberryPi and go to a regular computer.
Once again we’ll need the Java Environment and Hadoop available locally.
sudo apt-get install openjdk-11-jre
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
tar xf hadoop-3.3.1.tar.gz
And we can ask the list of the files present on the HDFS cluster (note that the 192.168.111.112 address must be updated to match your network):
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 ./hadoop-3.3.1/bin/hdfs dfs -ls hdfs://192.168.111.112/
# Found 1 items
# drwxr-xr-x - ubuntu supergroup 0 2021-09-19 10:00 hdfs://192.168.111.112/stuff
We can also attempt to upload a file in the HDFS cluster.
# Download a book from the Gutenberg Project -
# The Adventures of Sherlock Holmes, by Arthur Conan Doyle
wget https://www.gutenberg.org/files/1661/1661-0.txt -O book.txt
# Upload it
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 HADOOP_USER_NAME=ubuntu ./hadoop-3.3.1/bin/hdfs dfs -put book.txt hdfs://192.168.111.112:8020/book.txt
If the local user doesn’t match the remote user of the server, the upload will fail.
As a workaround, you have to set the HADOOP_USER_NAME
to point to the remote username.
Then, to check that the upload went fine:
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 ./hadoop-3.3.1/bin/hdfs dfs -ls hdfs://192.168.111.112/
# Found 2 items
#-rw-r--r-- 1 ubuntu supergroup 607430 2021-09-19 10:00 hdfs://192.168.111.112/book.txt
#drwxr-xr-x - ubuntu supergroup 0 2021-09-19 10:00 hdfs://192.168.111.112/stuff
If nothing strange happens, it means that we’ve configured correctly both the client and the server for HDFS.
The following steps should be done on the “client” node (the one submitting the tasks for execution on the Hadoop cluster).
The first step is to install Scala from the distribution’s package manager. My version of Ubuntu provides Scala v2.11.
sudo apt-get install scala
scala -version
# Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
Now we download the Spark library, to match the Scala version (Scala 2.11 is compatible with Spark 2.x, while Scala 2.12 is compatible with Spark 3.x).
wget https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
tar xf spark-2.4.8-bin-hadoop2.7.tgz
Because the code will run on the Hadoop cluster, we need to copy locally the files from the Hadoop node. Note that the IP address and the paths must match your local network and users and paths.
mkdir configs
cd configs
scp ubuntu@192.168.111.112:hadoop-3.3.1/etc/hadoop/core-site.xml .
scp ubuntu@192.168.111.112:hadoop-3.3.1/etc/hadoop/hdfs-site.xml .
scp ubuntu@192.168.111.112:hadoop-3.3.1/etc/hadoop/yarn-site.xml .
Now, let’s use the following snippet of Scala code to count how many times
whispering is mentioned in that book: the code pulls the book.txt
file
from HDFS, splits the lines into words, and counts how many words start
with “whisper”. In the end it creates a file in HDFS named result.txt
where it stores the count.
Once again, update the HDFS addresses to match your environment.
// bookreader.scala
package bookreader
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.io.PrintWriter
object BookReader {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf() setAppName("bookreader")
val sc: SparkContext = new SparkContext(conf)
val prefixOccurences: Int =
sc.
textFile("hdfs://192.168.111.112:8020/book.txt").
flatMap(_.split(' ').filter(_.startsWith("whisper"))).
count().
toInt
val fs = FileSystem.get(new Configuration())
val output = fs.create(new Path("hdfs://192.168.111.112:8020/result.txt"))
val writer = new PrintWriter(output)
writer.write("Number of whispers found:" + prefixOccurences + ".\n")
writer.close()
sc.stop()
}
}
To compile the application we’ll use the scalac
compiler, with hardcoded classpaths.
I’ve replaced the path where I’ve uncompressed the archive with ...
- update it to match your environment. This command will generate an app.jar
.
scalac -classpath .../spark-2.4.8-bin-hadoop2.7/jars/spark-core_2.11-2.4.8.jar:.../spark-2.4.8-bin-hadoop2.7/jars/hadoop-common-2.7.3.jar:.../spark-2.4.8-bin-hadoop2.7/jars/jackson-core-2.6.7.jar:.../spark-2.4.8-bin-hadoop2.7/jars/jackson-annotations-2.6.7.jar bookreader.scala -d app.jar
Now we deploy the app.jar
in the cluster: we must specify the following variables:
HADOOP_CONF_DIR
: the directory where we copied the core-site.xml
, hdfs-site.xml
and yarn-site.xml
from the cluster, on the client machine.HADOOP_USER_NAME
: the username under which the hadoop daemons were started, if differs from the username used on the client machine.SPARK_LOCAL_IP
: the IP of the client machine.I’ve replaced the path of the hadoop config-directory and of the spark libraries with ...
- update it to match your environment.
HADOOP_CONF_DIR=.../configs HADOOP_USER_NAME=ubuntu SPARK_LOCAL_IP=192.168.111.62 .../spark-2.4.8-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster --class bookreader.BookReader app.jar
After we receive confirmation that the job has finished, we can inspect the result:
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 ./hadoop-3.3.1/bin/hadoop fs -get hdfs://192.168.111.112:8020/result.txt
cat result.txt
# Number of whispers found:18