Process of Deploying Apache Spark Clusters Using Docker

Introduction

Apache Spark is a powerful unified analytics engine for large-scale data processing. This article will explain in detail how to use Docker and Docker Compose to quickly deploy a Spark cluster with one Master node and two Worker nodes. This method not only simplifies the cluster construction process, but also provides advantages such as resource isolation and ease of expansion.

Prerequisites

Before you begin, make sure that the following components are ready in your environment:

Install and run Docker Engine.
Install Docker Compose to define and run multi-container applications.
The host can connect to the Docker Hub to download the required image.

Docker Compose file configuration

create

First, create aand add the following content:

version: '3'
services:
  master:
    image: bitnami/spark:3.5.4
    container_name: master
    user: root
    environment:
      - SPARK_MODE=master
      - SPARK_MASTER_WEBUI_PORT=8080
      - SPARK_MASTER_PORT=7077
    ports:
      - '8080:8080'
      - '7077:7077'
    volumes:
      - ./python:/python
  worker1:
    image: bitnami/spark:3.5.4
    container_name: worker1
    user: root
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
    depends_on:
      - master
  worker2:
    image: bitnami/spark:3.5.4
    container_name: worker2
    user: root
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
    depends_on:
      - master
networks:
  default:
    driver: bridge

Start Spark cluster

Enter SaveTo start the cluster by executing the following command:

docker compose up -d

This will build and launch all containers in the background mode.

Verify cluster status

After successful startup, you can access it through your browser{Your virtual machine IP}:8080Check Spark Master's Web UI and confirmworker1andworker2Whether the connection has been successfully.

Run Spark jobs

To test the cluster functionality, a simple Python script can be submittedCalculate the approximate value of Pi Pi. The script content is as follows:

from  import SparkSession
if __name__ == "__main__":
    spark = ("Pi Calculator").getOrCreate()
    numSamples = 100000000
    count = (range(1, numSamples)).count()
    print(f"Pi is roughly {4.0 * count / numSamples}")
    ()

Place this script on./pythonIn the directory, and submit the job through the following command:

docker-compose exec master /opt/bitnami/spark/bin/spark-submit --master spark://master:7077 /python/

Conclusion

Through the above steps, you have successfully deployed a Spark cluster with one Master node and two Worker nodes using Docker. This deployment method is not only fast and convenient, but also makes it easy to adjust configuration according to needs (such as increasing the number of Worker nodes or adjusting resource configuration). Hope this tutorial will help you quickly get started with Docker deployment of Spark clusters!

This is the end of this article about using Docker to deploy Apache Spark cluster. For more related content on Docker to deploy Apache Spark cluster, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!