Single-node cluster setup with Hadoop, Yarn and Spark on RaspberryPi4

Introduction

I’ve used a RaspberryPi 4 board to create a single-node cluster for storage (HDFS) and computing (Yarn), and then tested that “cluster” with a spark job.

⚠️ Disclaimer: This setup is meant as a tech-demo and a minimal usable setup. It does not have the adequate security or reliability levels required for production environments. Use only on trusted, isolated networks!

Configuring Hadoop

I booted up the RaspberryPi 4 computer with an image of Ubuntu Server 20.04 for ARM64 architecture.

Then I installed Java 11:

sudo apt-get install openjdk-11-jre

For reference, Hadoop 3.3 requires java 8 or java 11 - the other versions of the JVM are not supported. Details here.

Then I had to download and uncompress the Hadoop 3.3.1 archive (latest release at the time of writing). Pay attention to the architecture included in the download link (aarch64 is equivalent to ARM64).

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1-aarch64.tar.gz
tar xf hadoop-3.3.1-aarch64.tar.gz
cd hadoop-3.3.1

Now the configuration step: I had to edit the core-site.xml, hdfs-site.xml and yarn-site.xml files:

core-site.xml file

The following sample is for the etc/hadoop/core-site.xml configuration file. The fs.defaultFS key specifies the IP address of the board (its internal address is 192.168.111.112, replace it to match your network) and the port used to access the data over HDFS. The hadoop.tmp.dir key specifies where all the files will be stored. In my case, it specifies the mount-point of an external hard-drive - to avoid stressing the RaspberryPi’s sdcard.

Reference documentation and default values for this file available here.

vim etc/hadoop/core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://192.168.111.112:8020</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/ubuntu/media/hadoop</value>
  </property>
</configuration>

hdfs-site.xml file

The following sample is for the etc/hadoop/hdfs-site.xml configuration file. The dfs.replication key specifies the number of copies a file should have inside the cluster, and having a single-node cluster we cannot have it replicated across multiple servers. The dfs.namenode.rpc-address key should include the IP address and port where the HDFS is listening.

Reference documentation and default values for this file available here.

vim etc/hadoop/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address</name>
    <value>192.168.111.112:8020</value>
  </property>
</configuration>

yarn-site.xml file

The following sample is for the etc/hadoop/yarn-site.xml configuration file. The clients using the “cluster” will use the yarn.resourcemanager.hostname value to find the address of the master, where the compute tasks will be submitted.

Reference documentation and default values for this file available here.

vim etc/hadoop/yarn-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>192.168.111.112</value>
  </property>
</configuration>

SSH key setup and bash environment

For the Hadoop cluster to start correctly, we must configure key-based ssh connections to the board.

sudo systemctl start sshd     # enable the ssh daemon, if it wasn't already running
ssh-keygen                    # generate ssh key for local user
ssh-copy-id localhost         # set up ssh login to localhost without password

Another prerequisite of Hadoop is to have the JAVA_HOME variable pointing to a JRE. For this we’ll edit the .bashrc file:

vim ~/.basrc

Delete the part that skips over the .bashrc content when running inside a non-interactive sessions, and export the JAVA_HOME variable. Follow the next diff snippet for more details:

--- a/.bashrc
+++ b/.bashrc
@@ -2,11 +2,7 @@
 # see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
 # for examples
 
-# If not running interactively, don't do anything
-case $- in
-    *i*) ;;
-      *) return;;
-esac
+export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-arm64
 
 # don't put duplicate lines or lines starting with space in the history.
 # See bash(1) for more options

Log out of the console, and log back in. Go back to the hadoop directory, start the hadoop services, and do a namenode format:

cd hadoop-3.3.1

./sbin/start-all.sh
#WARNING: Attempting to start all Apache Hadoop daemons as ubuntu in 10 seconds.
#WARNING: This is not a recommended production deployment configuration.
#WARNING: Use CTRL-C to abort.
#Starting namenodes on [ubuntu]
#Starting datanodes
#Starting secondary namenodes [ubuntu]
#Starting resourcemanager
#Starting nodemanagers

./bin/hdfs namenode -format

Now that the namenode is formatted, it needs to be restarted:

./sbin/stop-all.sh
./sbin/start-all.sh

If no errors are displayed, the HDFS cluster is ready. We can test it as follows:

./bin/hdfs dfs -ls /
# HDFS should be empty.

./bin/hdfs dfs -mkdir /stuff
# Make a folder in the HDFS root location

./bin/hdfs dfs -ls /
# Found 1 items
# drwxr-xr-x   - ubuntu supergroup          0 2021-09-19 10:00 /stuff

To inspect the status, we can also use the web interfaces:

http://IP_ADDRESS:8088
http://IP_ADDRESS:8042
http://IP_ADDRESS:9864
http://IP_ADDRESS:9870

Accessing HDFS from another computer

Time to leave the RaspberryPi and go to a regular computer.

Once again we’ll need the Java Environment and Hadoop available locally.

sudo apt-get install openjdk-11-jre
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
tar xf hadoop-3.3.1.tar.gz

And we can ask the list of the files present on the HDFS cluster (note that the 192.168.111.112 address must be updated to match your network):

JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 ./hadoop-3.3.1/bin/hdfs dfs -ls hdfs://192.168.111.112/
# Found 1 items
# drwxr-xr-x   - ubuntu supergroup          0 2021-09-19 10:00 hdfs://192.168.111.112/stuff

We can also attempt to upload a file in the HDFS cluster.

# Download a book from the Gutenberg Project -
# The Adventures of Sherlock Holmes, by Arthur Conan Doyle
wget https://www.gutenberg.org/files/1661/1661-0.txt -O book.txt

# Upload it
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 HADOOP_USER_NAME=ubuntu ./hadoop-3.3.1/bin/hdfs dfs -put book.txt hdfs://192.168.111.112:8020/book.txt

If the local user doesn’t match the remote user of the server, the upload will fail. As a workaround, you have to set the HADOOP_USER_NAME to point to the remote username.

Then, to check that the upload went fine:

JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 ./hadoop-3.3.1/bin/hdfs dfs -ls hdfs://192.168.111.112/
# Found 2 items
#-rw-r--r--   1 ubuntu supergroup     607430 2021-09-19 10:00 hdfs://192.168.111.112/book.txt
#drwxr-xr-x   - ubuntu supergroup          0 2021-09-19 10:00 hdfs://192.168.111.112/stuff

If nothing strange happens, it means that we’ve configured correctly both the client and the server for HDFS.

Scala and Spark

The following steps should be done on the “client” node (the one submitting the tasks for execution on the Hadoop cluster).

The first step is to install Scala from the distribution’s package manager. My version of Ubuntu provides Scala v2.11.

sudo apt-get install scala
scala -version
# Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Now we download the Spark library, to match the Scala version (Scala 2.11 is compatible with Spark 2.x, while Scala 2.12 is compatible with Spark 3.x).

wget https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
tar xf spark-2.4.8-bin-hadoop2.7.tgz

Because the code will run on the Hadoop cluster, we need to copy locally the files from the Hadoop node. Note that the IP address and the paths must match your local network and users and paths.

mkdir configs
cd configs
scp ubuntu@192.168.111.112:hadoop-3.3.1/etc/hadoop/core-site.xml .
scp ubuntu@192.168.111.112:hadoop-3.3.1/etc/hadoop/hdfs-site.xml .
scp ubuntu@192.168.111.112:hadoop-3.3.1/etc/hadoop/yarn-site.xml .

Now, let’s use the following snippet of Scala code to count how many times whispering is mentioned in that book: the code pulls the book.txt file from HDFS, splits the lines into words, and counts how many words start with “whisper”. In the end it creates a file in HDFS named result.txt where it stores the count.

Once again, update the HDFS addresses to match your environment.

// bookreader.scala

package bookreader
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.io.PrintWriter

object BookReader {

  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf() setAppName("bookreader")
    val sc: SparkContext = new SparkContext(conf)

    val prefixOccurences: Int =
      sc.
      textFile("hdfs://192.168.111.112:8020/book.txt").
      flatMap(_.split(' ').filter(_.startsWith("whisper"))).
      count().
      toInt

    val fs = FileSystem.get(new Configuration())
    val output = fs.create(new Path("hdfs://192.168.111.112:8020/result.txt"))
    val writer = new PrintWriter(output)
    writer.write("Number of whispers found:" + prefixOccurences + ".\n")
    writer.close()

    sc.stop()
  }
}

To compile the application we’ll use the scalac compiler, with hardcoded classpaths.

I’ve replaced the path where I’ve uncompressed the archive with ... - update it to match your environment. This command will generate an app.jar.

scalac -classpath .../spark-2.4.8-bin-hadoop2.7/jars/spark-core_2.11-2.4.8.jar:.../spark-2.4.8-bin-hadoop2.7/jars/hadoop-common-2.7.3.jar:.../spark-2.4.8-bin-hadoop2.7/jars/jackson-core-2.6.7.jar:.../spark-2.4.8-bin-hadoop2.7/jars/jackson-annotations-2.6.7.jar bookreader.scala -d app.jar

Now we deploy the app.jar in the cluster: we must specify the following variables:

HADOOP_CONF_DIR: the directory where we copied the core-site.xml, hdfs-site.xml and yarn-site.xml from the cluster, on the client machine.
HADOOP_USER_NAME: the username under which the hadoop daemons were started, if differs from the username used on the client machine.
SPARK_LOCAL_IP: the IP of the client machine.

I’ve replaced the path of the hadoop config-directory and of the spark libraries with ... - update it to match your environment.

HADOOP_CONF_DIR=.../configs HADOOP_USER_NAME=ubuntu SPARK_LOCAL_IP=192.168.111.62 .../spark-2.4.8-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster --class bookreader.BookReader app.jar

After we receive confirmation that the job has finished, we can inspect the result:

JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 ./hadoop-3.3.1/bin/hadoop fs -get hdfs://192.168.111.112:8020/result.txt

cat result.txt 
# Number of whispers found:18