Setting Up an HDFS Cluster with Docker Compose: A Step-by-Step Guide

Dirk Steynberg
4 min readAug 12, 2024

--

As a data engineer, I’ve always been fascinated by the power of distributed systems. Recently, I embarked on a journey to set up a Hadoop Distributed File System (HDFS) cluster using Docker Compose. This article shares my experience and provides a step-by-step guide for those looking to replicate this setup.

Why Docker Compose for HDFS?

Docker Compose offers a convenient way to define and run multi-container Docker applications. For an HDFS cluster, which consists of multiple nodes (NameNode and DataNodes), Docker Compose provides an ideal solution for creating a reproducible and easily manageable environment.

The Setup

Our HDFS cluster will consist of one NameNode and two DataNodes. Here’s what you’ll need:

  1. Docker and Docker Compose installed on your system
  2. Basic understanding of HDFS and Docker concepts

Let’s dive into the setup process!

1. Project Structure

First, create a directory structure for your project:

hdfs-docker-cluster/

├── docker-compose.yml
├── hadoop_config/
│ ├── core-site.xml
│ ├── hdfs-site.xml
│ └── ... (other Hadoop configuration files)
├── start-hdfs.sh
├── init-datanode.sh
├── hadoop_namenode/
├── hadoop_datanode1/
└── hadoop_datanode2/

2. Docker Compose Configuration

Create a docker-compose.yml file with the following content:

version: '3'

services:
namenode:
image: apache/hadoop:3.3.5
container_name: namenode
hostname: namenode
user: root
environment:
- HADOOP_HOME=/opt/hadoop
volumes:
- ./hadoop_namenode:/opt/hadoop/data/nameNode
- ./hadoop_config:/opt/hadoop/etc/hadoop
- ./start-hdfs.sh:/start-hdfs.sh
ports:
- "9870:9870"
command: [ "/bin/bash", "/start-hdfs.sh" ]
networks:
hdfs_network:
ipv4_address: 172.20.0.2

datanode1:
image: apache/hadoop:3.3.5
container_name: datanode1
hostname: datanode1
user: root
environment:
- HADOOP_HOME=/opt/hadoop
volumes:
- ./hadoop_datanode1:/opt/hadoop/data/dataNode
- ./hadoop_config:/opt/hadoop/etc/hadoop
- ./init-datanode.sh:/init-datanode.sh
depends_on:
- namenode
command: [ "/bin/bash", "/init-datanode.sh" ]
networks:
hdfs_network:
ipv4_address: 172.20.0.3

datanode2:
# Similar configuration to datanode1, with different container_name and IP

networks:
hdfs_network:
ipam:
driver: default
config:
- subnet: 172.20.0.0/16

3. Initialization Scripts

Create two scripts to initialize the NameNode and DataNodes:

start-hdfs.sh for the NameNode:

#!/bin/bash
if [ ! -d "/opt/hadoop/data/nameNode/current" ]; then
echo "Formatting NameNode..."
hdfs namenode -format
fi
hdfs namenode

init-datanode.sh for the DataNodes:

#!/bin/bash
rm -rf /opt/hadoop/data/dataNode/*
chown -R hadoop:hadoop /opt/hadoop/data/dataNode
chmod 755 /opt/hadoop/data/dataNode
hdfs datanode

4. Hadoop Configuration

In the hadoop_config/ directory, place your Hadoop configuration files. The key files are core-site.xml and hdfs-site.xml. Ensure these are properly configured for your HDFS cluster.

5. Launch the Cluster

Once the containers are up and running, you can verify the cluster’s functionality:

  1. Access the NameNode web interface at http://localhost:9870
  2. Use HDFS commands through the NameNode container:
docker exec -it namenode hdfs dfs -ls /

Add a file to HDFS:

echo "Hello, HDFS" > test.txt
docker cp test.txt namenode:/tmp/
docker exec -it namenode hdfs dfs -put /tmp/test.txt /

View the file in HDFS:

docker exec -it namenode hdfs dfs -cat /test.txt

You can also view the Cluster status in the Web UI!

Fig1. Screenshot of the Hadoop Webserver Datanode Information

Here we can see our two datanodes, and also the usage histogram of the processes we tested.

We can even have a look at the files we created and moved in the browse directory in the webserver:

Fig2. Screenshot of the Hadoop Webserver Browse Directory

Lessons Learned

Setting up this HDFS cluster with Docker Compose taught me several valuable lessons:

  1. Initialization is key: Proper initialization of the NameNode and DataNodes is crucial for a smooth setup.
  2. Configuration management: Keeping Hadoop configuration files separate and mounting them as volumes provides flexibility and ease of management.
  3. Networking considerations: Defining a custom network with fixed IP addresses ensures consistent communication between nodes.
  4. Persistence challenges: Managing data persistence in a containerized environment requires careful consideration, especially for production use cases.

Conclusion

This project demonstrates the power of Docker Compose in setting up complex distributed systems like HDFS. While this setup is perfect for learning and experimentation, remember that it’s not suitable for production environments without further modifications.I hope this guide helps you in your journey of exploring HDFS and containerized applications. Happy coding!

P.S. If you’d like to see how I got this set up using an at-home Raspberry Pi cluster, check it out here!

--

--

Dirk Steynberg

Seasoned Data Engineer with 7+ years experience. Expert in cloud solutions, data lakes, and business growth. Skilled in Python, AWS, and team leadership.