Introduction
Apache Spark is a powerful unified analytics engine for large-scale data processing. This article will explain in detail how to use Docker and Docker Compose to quickly deploy a Spark cluster with one Master node and two Worker nodes. This method not only simplifies the cluster construction process, but also provides advantages such as resource isolation and ease of expansion.
Prerequisites
Before you begin, make sure that the following components are ready in your environment:
- Install and run Docker Engine.
- Install Docker Compose to define and run multi-container applications.
- The host can connect to the Docker Hub to download the required image.
Docker Compose file configuration
create
First, create aand add the following content:
version: '3' services: master: image: bitnami/spark:3.5.4 container_name: master user: root environment: - SPARK_MODE=master - SPARK_MASTER_WEBUI_PORT=8080 - SPARK_MASTER_PORT=7077 ports: - '8080:8080' - '7077:7077' volumes: - ./python:/python worker1: image: bitnami/spark:3.5.4 container_name: worker1 user: root environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://master:7077 - SPARK_WORKER_MEMORY=1G - SPARK_WORKER_CORES=1 depends_on: - master worker2: image: bitnami/spark:3.5.4 container_name: worker2 user: root environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://master:7077 - SPARK_WORKER_MEMORY=1G - SPARK_WORKER_CORES=1 depends_on: - master networks: default: driver: bridge
Start Spark cluster
Enter SaveTo start the cluster by executing the following command:
docker compose up -d
This will build and launch all containers in the background mode.
Verify cluster status
After successful startup, you can access it through your browser{Your virtual machine IP}:8080
Check Spark Master's Web UI and confirmworker1
andworker2
Whether the connection has been successfully.
Run Spark jobs
To test the cluster functionality, a simple Python script can be submittedCalculate the approximate value of Pi Pi. The script content is as follows:
from import SparkSession if __name__ == "__main__": spark = ("Pi Calculator").getOrCreate() numSamples = 100000000 count = (range(1, numSamples)).count() print(f"Pi is roughly {4.0 * count / numSamples}") ()
Place this script on./python
In the directory, and submit the job through the following command:
docker-compose exec master /opt/bitnami/spark/bin/spark-submit --master spark://master:7077 /python/
Conclusion
Through the above steps, you have successfully deployed a Spark cluster with one Master node and two Worker nodes using Docker. This deployment method is not only fast and convenient, but also makes it easy to adjust configuration according to needs (such as increasing the number of Worker nodes or adjusting resource configuration). Hope this tutorial will help you quickly get started with Docker deployment of Spark clusters!
This is the end of this article about using Docker to deploy Apache Spark cluster. For more related content on Docker to deploy Apache Spark cluster, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!