Running Spark on Docker using bitnami image - python

I'm trying to run spark in a docker container from a python app which is located in another container:
version: '3'
services:
spark-master:
image: docker.io/bitnami/spark:2
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- type: bind
source: ./conf/log4j.properties
target: /opt/bitnami/spark/conf/log4j.properties
ports:
- '8080:8080'
- '7077:7077'
networks:
- spark
container_name: spark
spark-worker-1:
image: docker.io/bitnami/spark:2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- type: bind
source: ./conf/log4j.properties
target: /opt/bitnami/spark/conf/log4j.properties
ports:
- '8081:8081'
container_name: spark-worker
networks:
- spark
depends_on:
- spark-master
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
ports:
- 22181:2181
container_name: zookeeper
networks:
- rmoff_kafka
kafka:
image: confluentinc/cp-kafka:5.5.0
depends_on:
- zookeeper
ports:
- 9092:9092
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
container_name: kafka
networks:
- rmoff_kafka
app:
build:
context: ./
depends_on:
- kafka
ports:
- 5000:5000
container_name: app
networks:
- rmoff_kafka
networks:
spark:
driver: bridge
rmoff_kafka:
name: rmoff_kafka
When I try to create a SparkSession:
conf = SparkConf()
conf.setAll(
[
(
"spark.master",
os.environ.get("SPARK_MASTER_URL", "spark://spark:7077"),
),
("spark.driver.host", os.environ.get("SPARK_DRIVER_HOST", "local[*]")),
("spark.submit.deployMode", "client"),
('spark.ui.showConsoleProgress', 'true'),
("spark.driver.bindAddress", "0.0.0.0"),
("spark.app.name", app_name)
]
)
spark_session = SparkSession.builder.config(conf=conf).getOrCreate()
I get an error related with Java:
JAVA_HOME is not set
Exception: Java gateway process exited before sending its port number
I supose I have to install Java or set Java Home environment variable, but I don't know how to exactly tackle the problem. Should I install java in the spark container or the container from I run the python script?

Add in your app Dockerfile install of java
# Install OpenJDK-11
RUN apt-get update && \
apt-get install -y openjdk-11-jre-headless && \
apt-get clean;

Related

Airflow getting error No module named 'httplib2'

I am getting an error when loading airflow over docker. I have the package already installed on my anaconda env. I am new to airflow and following this course, under the video, there is also a link to GitHub, where is the code that I am using to recreate the task:
https://www.youtube.com/watch?v=wAyu5BN3VpY&t=717s [29:30min]
version: '3'
services:
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
ports:
- "5432:5432"
webserver:
image: puckel/docker-airflow:1.10.1
build:
context: https://github.com/puckel/docker-airflow.git#1.10.1
dockerfile: Dockerfile
args:
AIRFLOW_DEPS: gcp_api,s3
PYTHON_DEPS: sqlalchemy==1.2.0
restart: always
depends_on:
- postgres
environment:
- LOAD_EX=n
- EXECUTOR=Local
- FERNET_KEY=jsDPRErfv8Z_eVTnGfF8ywd19j4pyqE3NpdUBA_oRTo=
volumes:
- ./examples/gcloud-example/dags:/usr/local/airflow/dags
# Uncomment to include custom plugins
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8080:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
#!/bin/bash
docker-compose -f docker-compose-gcloud.yml up --abort-on-container-exit

Run a command in Docker after certain container is up

Have a Docker Compose with 4 containers.
How do I run a python script (python manage.py setup) in container in server_1 when postgres_1 is up, but only once (state should be persisted somewhere, maybe via volume?)
I persist PostgreSQL data to disk via volume.
Is this any nice way?
Want to make up setup and running of software very easy, just using docker-compose up. Should not matter, if this is first run or further runs. First run needs python manage.py setup invocation.
Is there a nice way of doing it?
Idea was to check for existence of file flag in mounted volume, but don't know how to wait in server_1 for postgres_1 to be up.
Here is my docker-compose.yml
version: '3'
services:
server:
build:
context: .
dockerfile: docker/backend/Dockerfile
restart: always
working_dir: /srv/scanmycode/
entrypoint: python
command: /srv/scanmycode/manage.py runserver
ports:
- 5000:5000
volumes:
- ./data1:/srv/scanmycode/quantifiedcode/data/
- ./data2:/srv/scanmycode/quantifiedcode/backend/data/
links:
- "postgres"
postgres:
image: postgres:13.2
restart: unless-stopped
environment:
POSTGRES_DB: qc
POSTGRES_USER: qc
POSTGRES_PASSWORD: qc
PGDATA: /var/lib/postgresql/data/pgdata
ports:
- "5432:5432"
volumes:
-
type: bind
source: ./postgres-data
target: /var/lib/postgresql/data
worker_1:
build:
context: .
dockerfile: docker/worker/Dockerfile
args:
- GIT_TOKEN
hostname: worker_1
restart: on-failure
depends_on:
- rabbitmq3
working_dir: /srv/scanmycode/
entrypoint: python
command: /srv/scanmycode/manage.py runworker
volumes:
- ./data1:/srv/scanmycode/quantifiedcode/data/
- ./data2:/srv/scanmycode/quantifiedcode/backend/data/
links:
- "rabbitmq3"
- "server"
- "postgres"
rabbitmq3:
container_name: "rabbitmq"
image: rabbitmq:3.8-management-alpine
environment:
- RABBITMQ_DEFAULT_USER=qc
- RABBITMQ_DEFAULT_PASS=qc
ports:
- 5672:5672
- 15672:15672
healthcheck:
test: [ "CMD", "nc", "-z", "localhost", "5672" ]
interval: 5s
timeout: 15s
retries: 1
Used this:
version: '3'
services:
server:
build:
context: .
dockerfile: docker/backend/Dockerfile
restart: always
depends_on:
- postgres
working_dir: /srv/scanmycode/
entrypoint: sh
command: -c "if [ -f /srv/scanmycode/setup_state/setup_done ]; then python /srv/scanmycode/manage.py runserver; else python /srv/scanmycode/manage.py setup && mkdir -p /srv/scanmycode/setup_state && touch /srv/scanmycode/setup_state/setup_done; fi"
ports:
- 5000:5000
volumes:
- ./data1:/srv/scanmycode/quantifiedcode/data/
- ./data2:/srv/scanmycode/quantifiedcode/backend/data/
- ./setup_state:/srv/scanmycode/setup_state
links:
- "postgres"
postgres:
image: postgres:13.2
restart: unless-stopped
environment:
POSTGRES_DB: qc
POSTGRES_USER: qc
POSTGRES_PASSWORD: qc
PGDATA: /var/lib/postgresql/data/pgdata
ports:
- "5432:5432"
volumes:
- db-data:/var/lib/postgresql/data
worker_1:
build:
context: .
dockerfile: docker/worker/Dockerfile
hostname: worker_1
restart: on-failure
depends_on:
- rabbitmq3
- postgres
- server
working_dir: /srv/scanmycode/
entrypoint: python
command: /srv/scanmycode/manage.py runworker
volumes:
- ./data1:/srv/scanmycode/quantifiedcode/data/
- ./data2:/srv/scanmycode/quantifiedcode/backend/data/
links:
- "rabbitmq3"
- "server"
- "postgres"
rabbitmq3:
container_name: "rabbitmq"
image: rabbitmq:3.8-management-alpine
environment:
- RABBITMQ_DEFAULT_USER=qc
- RABBITMQ_DEFAULT_PASS=qc
ports:
- 5672:5672
- 15672:15672
healthcheck:
test: [ "CMD", "nc", "-z", "localhost", "5672" ]
interval: 5s
timeout: 15s
retries: 1
volumes:
db-data:
driver: local

Create a new kafka topic using python from a container connecting to another container

I would like to create a new kafka topic using python, and I get an error when I try to create a KafkaAdminClient using server="kafka:9092":
self._kafka_admin = KafkaAdminClient(
bootstrap_servers=[server],
api_version=(0, 10, 2),
api_version_auto_timeout_ms=120000)
The error I get:
Traceback (most recent call last):
File "main.py", line 47, in <module>
kafka_manager = KafkaManager("kafka:9092")
File "/app/src/kafka/kafka_manager.py", line 24, in __init__
self._kafka_admin = KafkaAdminClient(
File "/usr/local/lib/python3.8/site-packages/kafka/admin/client.py", line 211, in __init__
self._client.check_version(timeout=(self.config['api_version_auto_timeout_ms'] / 1000))
File "/usr/local/lib/python3.8/site-packages/kafka/client_async.py", line 900, in check_version
raise Errors.NoBrokersAvailable()
kafka.errors.NoBrokersAvailable: NoBrokersAvailable
Moreover, I've build the next docker-compose file:
version: '3'
services:
spark-master:
image: docker.io/bitnami/spark:2
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- type: bind
source: ./conf/log4j.properties
target: /opt/bitnami/spark/conf/log4j.properties
ports:
- '8080:8080'
- '7077:7077'
networks:
- spark
container_name: spark
spark-worker-1:
image: docker.io/bitnami/spark:2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://localhost:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- type: bind
source: ./conf/log4j.properties
target: /opt/bitnami/spark/conf/log4j.properties
ports:
- '8081:8081'
container_name: spark-worker
networks:
- spark
depends_on:
- spark-master
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
ports:
- 22181:2181
container_name: zookeeper
networks:
- rmoff_kafka
kafka:
image: confluentinc/cp-kafka:5.5.0
depends_on:
- zookeeper
ports:
- 9092:9092
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
container_name: kafka
networks:
- rmoff_kafka
app:
build:
context: ./
depends_on:
- kafka
ports:
- 5000:5000
container_name: app
networks:
- rmoff_kafka
networks:
spark:
driver: bridge
rmoff_kafka:
name: rmoff_kafka
Finally, the structure is something like this, 2 containers (1 for python app, another 1 for kafka):
And the docker ps result to see the details of the containers:
The solution was changing the host name to the name of the container (kafka), and adding an sleep() command in python in order to wait for the kafka container to start.

How to run Python Django and Celery using docker-compose?

I've a Python application using Django and Celery, and I trying to run using docker and docker-compose because i also using Redis and Dynamodb
The problem is the following:
I'm not able to execute both services WSGI and Celery, cause just the first instruction works fine..
version: '3.3'
services:
redis:
image: redis:3.2-alpine
volumes:
- redis_data:/data
ports:
- "6379:6379"
dynamodb:
image: dwmkerr/dynamodb
ports:
- "3000:8000"
volumes:
- dynamodb_data:/data
jobs:
build:
context: nubo-async-cfe-seces
dockerfile: Dockerfile
environment:
- REDIS_HOST=redisrvi
- PYTHONUNBUFFERED=0
- CC_DYNAMODB_NAMESPACE=None
- CC_DYNAMODB_ACCESS_KEY_ID=anything
- CC_DYNAMODB_SECRET_ACCESS_KEY=anything
- CC_DYNAMODB_HOST=dynamodb
- CC_DYNAMODB_PORT=8000
- CC_DYNAMODB_IS_SECURE=False
command: >
bash -c "celery worker -A tasks.async_service -Q dynamo-queue -E --loglevel=ERROR &&
uwsgi --socket 0.0.0.0:8080 --protocol=http --wsgi-file nubo_async/wsgi.py"
depends_on:
- redis
- dynamodb
volumes:
- .:/jobs
ports:
- "9090:8080"
volumes:
redis_data:
dynamodb_data:
Has anyone had the same problem?
You may refer to docker-compose of Saleor project. I would suggest to let celery run its daemon only depend on redis as the broker. See the configuration of docker-compose.yml file:
services:
web:
build:
context: .
dockerfile: ./Dockerfile
args:
STATIC_URL: '/static/'
restart: unless-stopped
networks:
- saleor-backend-tier
env_file: common.env
depends_on:
- db
- redis
celery:
build:
context: .
dockerfile: ./Dockerfile
args:
STATIC_URL: '/static/'
command: celery -A saleor worker --app=saleor.celeryconf:app --loglevel=info
restart: unless-stopped
networks:
- saleor-backend-tier
env_file: common.env
depends_on:
- redis
See also that the connection from both services to redis are set separately by the environtment vatables as shown on the common.env file:
CACHE_URL=redis://redis:6379/0
CELERY_BROKER_URL=redis://redis:6379/1
Here's the docker-compose as suggested by #Satevg, run the Django and Celery application by separate containers. Works fine!
version: '3.3'
services:
redis:
image: redis:3.2-alpine
volumes:
- redis_data:/data
ports:
- "6379:6379"
dynamodb:
image: dwmkerr/dynamodb
ports:
- "3000:8000"
volumes:
- dynamodb_data:/data
jobs:
build:
context: nubo-async-cfe-services
dockerfile: Dockerfile
environment:
- REDIS_HOST=redis
- PYTHONUNBUFFERED=0
- CC_DYNAMODB_NAMESPACE=None
- CC_DYNAMODB_ACCESS_KEY_ID=anything
- CC_DYNAMODB_SECRET_ACCESS_KEY=anything
- CC_DYNAMODB_HOST=dynamodb
- CC_DYNAMODB_PORT=8000
- CC_DYNAMODB_IS_SECURE=False
command: bash -c "uwsgi --socket 0.0.0.0:8080 --protocol=http --wsgi-file nubo_async/wsgi.py"
depends_on:
- redis
- dynamodb
volumes:
- .:/jobs
ports:
- "9090:8080"
celery:
build:
context: nubo-async-cfe-services
dockerfile: Dockerfile
environment:
- REDIS_HOST=redis
- PYTHONUNBUFFERED=0
- CC_DYNAMODB_NAMESPACE=None
- CC_DYNAMODB_ACCESS_KEY_ID=anything
- CC_DYNAMODB_SECRET_ACCESS_KEY=anything
- CC_DYNAMODB_HOST=dynamodb
- CC_DYNAMODB_PORT=8000
- CC_DYNAMODB_IS_SECURE=False
command: celery worker -A tasks.async_service -Q dynamo-queue -E --loglevel=ERROR
depends_on:
- redis
- dynamodb
volumes:
- .:/jobs
volumes:
redis_data:
dynamodb_data:

creating copies of docker on multiple ports

what i am trying to do - run airflow in docker with celery
my issue - my celery workers are in containers and i dont know how to scale them
my docker-comopose file:
version: '2'
services:
mysql:
image: mysql:latest
restart: always
ports:
- "3306:3306"
environment:
- MYSQL_RANDOM_ROOT_PASSWORD=true
- MYSQL_USER=airflow
- MYSQL_PASSWORD=airflow
- MYSQL_DATABASE=airflow
volumes:
- mysql:/var/lib/mysql
rabbitmq:
image: rabbitmq:3-management
restart: always
ports:
- "15672:15672"
- "5672:5672"
- "15671:15671"
environment:
- RABBITMQ_DEFAULT_USER=airflow
- RABBITMQ_DEFAULT_PASS=airflow
- RABBITMQ_DEFAULT_VHOST=airflow
volumes:
- rabbitmq:/var/lib/rabbitmq
webserver:
image: airflow:ver5
restart: always
volumes:
- ~/airflow/dags:/usr/local/airflow/dags
- /opt/scripts:/opt/scripts
environment:
- AIRFLOW_HOME=/usr/local/airflow
ports:
- "8080:8080"
links:
- mysql:mysql
- rabbitmq:rabbitmq
- worker:worker
- scheduler:scheduler
depends_on:
- mysql
- rabbitmq
- worker
- schedulerv
command: webserver
env_file: ./airflow.env
scheduler:
image: airflow:ver5
restart: always
volumes:
- ~/airflow/dags:/usr/local/airflow/dags
- /opt/scripts:/opt/scripts
environment:
- AIRFLOW_HOME=/usr/local/airflow
links:
- mysql:mysql
- rabbitmq:rabbitmq
depends_on:
- mysql
- rabbitmq
command: scheduler
env_file: ./airflow.env
worker:
image: airflow:ver5
restart: always
volumes:
- ~/airflow/dags:/usr/local/airflow/dags
- /opt/scripts:/opt/scripts
environment:
- AIRFLOW_HOME=/usr/local/airflow
ports:
- "8793:8793"
links:
- mysql:mysql
- rabbitmq:rabbitmq
depends_on:
- mysql
- rabbitmq
command: worker
env_file: ./airflow.env
So i run the docker-compose command using the above file and it starts an instance of worker on port 8793 on localhost as i am mapping from docker port to localhost. Now what i want to do is scale the number of workers i have and to do that i use the following command:
docker-compose -f docker-compose.yml scale worker=5
but that gives out an error as an instance of worker is already running on 8793. Is there a way to dynamically allocate port to new instances of worker containers as i scale up?
You could allow your worker nodes to expose the worker port to the host machine on a random port number:
worker:
image: airflow:ver5
restart: always
volumes:
- ~/airflow/dags:/usr/local/airflow/dags
- /opt/scripts:/opt/scripts
environment:
- AIRFLOW_HOME=/usr/local/airflow
ports:
- "8793"
links:
- mysql:mysql
- rabbitmq:rabbitmq
depends_on:
- mysql
- rabbitmq
command: worker
env_file: ./airflow.env
Setting port: to - 80 will expose port 80, in the container, to a random port on the host.
Because Docker Compose uses networks, you can actually omit this publish step altogether, and it would work. So simply remove ports: from the worker

Categories