Running Apache Beam python pipelines in Kubernetes - python

This question might seem like a duplicate of this.
I am trying to run Apache Beam python pipeline using flink on an offline instance of Kubernetes. However, since I have user code with external dependencies, I am using the Python SDK harness as an External Service - which is causing errors (described below).
The kubernetes manifest I use to launch the beam python SDK:
apiVersion: apps/v1
kind: Deployment
metadata:
name: beam-sdk
spec:
replicas: 1
selector:
matchLabels:
app: beam
component: python-beam-sdk
template:
metadata:
labels:
app: beam
component: python-beam-sdk
spec:
hostNetwork: True
containers:
- name: python-beam-sdk
image: apachebeam/python3.7_sdk:latest
imagePullPolicy: "Never"
command: ["/opt/apache/beam/boot", "--worker_pool"]
ports:
- containerPort: 50000
name: yay
apiVersion: v1
kind: Service
metadata:
name: beam-python-service
spec:
type: NodePort
ports:
- name: yay
port: 50000
targetPort: 50000
selector:
app: beam
component: python-beam-sdk
When I launch my pipeline with the following options:
beam_options = PipelineOptions([
"--runner=FlinkRunner",
"--flink_version=1.9",
"--flink_master=10.101.28.28:8081",
"--environment_type=EXTERNAL",
"--environment_config=10.97.176.105:50000",
"--setup_file=./setup.py"
])
I get the following error message (within the python sdk service):
NAME READY STATUS RESTARTS AGE
beam-sdk-666779599c-w65g5 1/1 Running 1 4d20h
flink-jobmanager-74d444cccf-m4g8k 1/1 Running 1 4d20h
flink-taskmanager-5487cc9bc9-fsbts 1/1 Running 2 4d20h
flink-taskmanager-5487cc9bc9-zmnv7 1/1 Running 2 4d20h
(base) [~]$ sudo kubectl logs -f beam-sdk-666779599c-w65g5
2020/02/26 07:56:44 Starting worker pool 1: python -m apache_beam.runners.worker.worker_pool_main --service_port=50000 --container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1', '--logging_endpoint=localhost:39283', '--artifact_endpoint=localhost:41533', '--provision_endpoint=localhost:42233', '--control_endpoint=localhost:44977']
2020/02/26 09:09:07 Initializing python harness: /opt/apache/beam/boot --id=1-1 --logging_endpoint=localhost:39283 --artifact_endpoint=localhost:41533 --provision_endpoint=localhost:42233 --control_endpoint=localhost:44977
2020/02/26 09:11:07 Failed to obtain provisioning information: failed to dial server at localhost:42233
caused by:
context deadline exceeded
I have no idea what the logging- or artifact endpoint (etc.) is. And by inspecting the source code it seems like that the endpoints has been hard-coded to be located at localhost.

(You said in a comment that the answer to the referenced post is valid, so I'll just address the specific error you ran into in case someone else hits it.)
Your understanding is correct; the logging, artifact, etc. endpoints are essentially hardcoded to use localhost. These endpoints are meant to be only used internally by Beam and are not configurable. So the Beam worker is implicitly assumed to be on the same host as the Flink task manager. Typically, this is accomplished by making the Beam worker pool a sidecar of the Flink task manager pod, rather than a separate service.

Related

Failed to pull image "docker4nitin/tg_bot:v2": rpc error: code = Unknown desc = context deadline exceeded

I have created a image "docker4nitin/tg_bot:v2" and pushed into the dockerhub. I have created a k8 deployment file to deploy it but it is failing to pull the image.
This is deployment.yml file
deployment.yml file
Please fix this issue. I'm literally tired of finding the solution but did not find any.
I want to deploy it in K8
Everything is fine with your image. I just tested it by creating following deployment and kubernetes pulls it fine.
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-deployment
spec:
selector:
matchLabels:
app: test
replicas: 1 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: test
spec:
containers:
- name: tgbot
image: docker4nitin/tg_bot:v2
ports:
- containerPort: 80
Here are the events that occurred when I pulled it on my end.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m1s bin-packing-scheduler Successfully assigned default/test-deployment-6bf6bbbfb5-mlrgp to ip-10-250-0-97.eu-central-1.compute.internal
Normal Pulling 4m1s kubelet Pulling image "docker4nitin/tg_bot:v2"
Normal Pulled 3m43s kubelet Successfully pulled image "docker4nitin/tg_bot:v2" in 17.153014569s
Normal Created 2m8s (x5 over 3m42s) kubelet Created container tgbot
Normal Started 2m8s (x5 over 3m41s) kubelet Started container tgbot
Normal Pulled 2m8s (x4 over 3m40s) kubelet Container image "docker4nitin/tg_bot:v2" already present on machine
Warning BackOff 2m7s (x9 over 3m39s) kubelet Back-off restarting failed container
if you pay attention, image pull works fine but there's something wrong with your container itself that is causing the backoff of the container. Maybe posting the specific error here is more useful.

Kubernetes : No difference in performances between 1 Pod and 3 Pods?

I have a very simple Kubernetes Cluster that uses GKE (Google Cloud Platform) with 1 node (4 vCPU / 16 Go RAM)
This morning I was load testing an API (written in Python) on this cluster with Locust. I only have one endpoint on my API, so I prepared a locust file to run differents configurations on that endpoint (with random parameters etc).
I first ran my Locust test with 10 users, 1 user generated every seconds (until limit reached), and the request wait time is between 1 and 2.5 seconds. I tested it first on just 1 pod and then I ran the exact same test on 3 pods. I noticed almost no improvement with 3 pods and the response times were just more spreaded.
Here are the results for 1 Pod :
And the charts:
And here are the results for 3 Pods
And the charts:
I am fairly new to K8s and I don't really understand everything, but my initial though was that increasing the number of running pods should improve the preformance right ? I should be able to handle 3 times more requests in the same time, but the results are far from that excepted conclusion.
And here is the yaml file defining the deployment on the K8s cluster (note: I set myself the replicas to 1 or 3 because for some reasons the autoscaler don't work) :
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: app_name
name: app_name
namespace: default
spec:
replicas: 3 or 1
selector:
matchLabels:
app: app_name
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app: app_name
spec:
containers:
- image: >-
gcr.io/path_to_image
imagePullPolicy: IfNotPresent
name: app_name-v2-sha256-1
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
labels:
app: app_name
name: app_name-hpa-cbrt
namespace: default
spec:
maxReplicas: 3
metrics:
- resource:
name: cpu
targetAverageUtilization: 80
type: Resource
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app_name
Additional info : The API calls a MySQL database and perform a request that get approx. 70k rows and do some stuff on those results. It only returns a json with few rows at the end.
Thanks in advance for any help you could provide !
Edit : Here is the definition of the load balancer, it's GCP generated and untouched
apiVersion: v1
kind: Service
metadata:
annotations:
cloud.google.com/neg: '{"ingress":true}'
creationTimestamp: "2021-05-11T10:03:16Z"
finalizers:
- service.kubernetes.io/load-balancer-cleanup
labels:
app: app_name
app.kubernetes.io/managed-by: gcp-cloud-build-deploy
app.kubernetes.io/name: app_name
app.kubernetes.io/version: b8a19c9rth54erthe4rth8459633451fe8e73e038
name: app_service
namespace: default
resourceVersion: "87954231"
selfLink: /api/v1/namespaces/default/services/app_service
uid: service_uid
spec:
clusterIP: 661.661.661.661
externalTrafficPolicy: Cluster
ports:
- nodePort: 30129
port: 8080
protocol: TCP
targetPort: 8080
selector:
app: app_name
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 666.666.666.666

How do I run Beam Python pipelines using Flink deployed on Kubernetes?

Does anybody know how to run Beam Python pipelines with Flink when Flink is running as pods in Kubernetes?
I have successfully managed to run a Beam Python pipeline using the Portable runner and the job service pointing to a local Flink server running in Docker containers.
I was able to achieve that mounting the Docker socket in my Flink containers, and running Flink as root process, so the class DockerEnvironmentFactory can create the Python harness container.
Unfortunately, I can't use the same solution when Flink is running in Kubernetes. Moreover, I don't want to create the Python harness container using the Docker command from my pods.
It seems that Bean runner automatically selects Docker for executing Python pipelines. However, I noticed there is an implementation called ExternalEnvironmentFactory, but I am not sure how to use it.
Is there a way to deploy a side container and use a different factory to run the Python harness process? What is the correct approach?
This is the patch for DockerEnvironmentFactory:
diff -pr beam-release-2.15.0/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerEnvironmentFactory.java beam-release-2.15.0-1/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerEnvironmentFactory.java
*** beam-release-2.15.0/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerEnvironmentFactory.java 2019-08-14 22:33:41.000000000 +0100
--- beam-release-2.15.0-1/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerEnvironmentFactory.java 2019-09-09 16:02:07.000000000 +0100
*************** package org.apache.beam.runners.fnexecut
*** 19,24 ****
--- 19,26 ----
import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.MoreObjects.firstNonNull;
+ import java.net.InetAddress;
+ import java.net.UnknownHostException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.time.Duration;
*************** public class DockerEnvironmentFactory im
*** 127,133 ****
ImmutableList.<String>builder()
.addAll(gcsCredentialArgs())
// NOTE: Host networking does not work on Mac, but the command line flag is accepted.
! .add("--network=host")
// We need to pass on the information about Docker-on-Mac environment (due to missing
// host networking on Mac)
.add("--env=DOCKER_MAC_CONTAINER=" + System.getenv("DOCKER_MAC_CONTAINER"));
--- 129,135 ----
ImmutableList.<String>builder()
.addAll(gcsCredentialArgs())
// NOTE: Host networking does not work on Mac, but the command line flag is accepted.
! .add("--network=flink")
// We need to pass on the information about Docker-on-Mac environment (due to missing
// host networking on Mac)
.add("--env=DOCKER_MAC_CONTAINER=" + System.getenv("DOCKER_MAC_CONTAINER"));
*************** public class DockerEnvironmentFactory im
*** 222,228 ****
private static ServerFactory getServerFactory() {
ServerFactory.UrlFactory dockerUrlFactory =
! (host, port) -> HostAndPort.fromParts(DOCKER_FOR_MAC_HOST, port).toString();
if (RUNNING_INSIDE_DOCKER_ON_MAC) {
// If we're already running in a container, we need to use a fixed port range due to
// non-existing host networking in Docker-for-Mac. The port range needs to be published
--- 224,230 ----
private static ServerFactory getServerFactory() {
ServerFactory.UrlFactory dockerUrlFactory =
! (host, port) -> HostAndPort.fromParts(getCanonicalHostName(), port).toString();
if (RUNNING_INSIDE_DOCKER_ON_MAC) {
// If we're already running in a container, we need to use a fixed port range due to
// non-existing host networking in Docker-for-Mac. The port range needs to be published
*************** public class DockerEnvironmentFactory im
*** 237,242 ****
--- 239,252 ----
}
}
+ private static String getCanonicalHostName() throws RuntimeException {
+ try {
+ return InetAddress.getLocalHost().getCanonicalHostName();
+ } catch (UnknownHostException e) {
+ throw new RuntimeException(e);
+ }
+ }
+
/** Provider for DockerEnvironmentFactory. */
public static class Provider implements EnvironmentFactory.Provider {
private final boolean retainDockerContainer;
*************** public class DockerEnvironmentFactory im
*** 269,275 ****
public ServerFactory getServerFactory() {
switch (getPlatform()) {
case LINUX:
! return ServerFactory.createDefault();
case MAC:
return DockerOnMac.getServerFactory();
default:
--- 279,286 ----
public ServerFactory getServerFactory() {
switch (getPlatform()) {
case LINUX:
! return DockerOnMac.getServerFactory();
! // return ServerFactory.createDefault();
case MAC:
return DockerOnMac.getServerFactory();
default:
This is the Docker compose file I use to run Flink:
version: '3.4'
services:
jobmanager:
image: tenx/flink:1.8.1
command: 'jobmanager'
environment:
JOB_MANAGER_RPC_ADDRESS: 'jobmanager'
DOCKER_MAC_CONTAINER: 1
FLINK_JM_HEAP: 128
volumes:
- jobmanager-data:/data
- /var/run/docker.sock:/var/run/docker.sock
ports:
- target: 8081
published: 8081
protocol: tcp
mode: ingress
networks:
- flink
taskmanager:
image: tenx/flink:1.8.1
command: 'taskmanager'
environment:
JOB_MANAGER_RPC_ADDRESS: 'jobmanager'
DOCKER_MAC_CONTAINER: 1
FLINK_TM_HEAP: 1024
TASK_MANAGER_NUMBER_OF_TASK_SLOTS: 2
networks:
- flink
volumes:
- taskmanager-data:/data
- /var/run/docker.sock:/var/run/docker.sock
- /var/folders:/var/folders
volumes:
jobmanager-data:
taskmanager-data:
networks:
flink:
external: true
This is my Python pipeline:
import apache_beam as beam
import logging
class LogElements(beam.PTransform):
class _LoggingFn(beam.DoFn):
def __init__(self, prefix=''):
super(LogElements._LoggingFn, self).__init__()
self.prefix = prefix
def process(self, element, **kwargs):
logging.info(self.prefix + str(element))
yield element
def __init__(self, label=None, prefix=''):
super(LogElements, self).__init__(label)
self.prefix = prefix
def expand(self, input):
input | beam.ParDo(self._LoggingFn(self.prefix))
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions(["--runner=PortableRunner", "--job_endpoint=localhost:8099"])
p = beam.Pipeline(options=options)
(p | beam.Create([1, 2, 3, 4, 5]) | LogElements())
p.run()
This is how I run the job service:
gradle :runners:flink:1.8:job-server:runShadow -PflinkMasterUrl=localhost:8081
Docker is automatically selected for executing the Python harness.
I can change the image used to run the Python container:
options = PipelineOptions(["--runner=PortableRunner", "--job_endpoint=localhost:8099", "--environment_type=DOCKER", "--environment_config=beam/python:latest"])
I can disable Docker and enable the ExternalEnvironmentFactory:
options = PipelineOptions(["--runner=PortableRunner", "--job_endpoint=localhost:8099", "--environment_type=EXTERNAL", "--environment_config=server"])
but I have to implement some callback answering on http://server:80.
Is there an implementation available?
To answer the question above, basically you want to add beam_worker_pool container along side with the flink task manager container in the same pods. So in the yaml file that you use to deploy flink task managers, add a new container:
- name: beam-worker-pool
image: apache/beam_python3.7_sdk:2.22.0
args: ["--worker_pool"]
ports:
- containerPort: 50000
name: pool
livenessProbe:
tcpSocket:
port: 50000
initialDelaySeconds: 30
periodSeconds: 60
volumeMounts:
- name: flink-config-volume
mountPath: /opt/flink/conf/
securityContext:
runAsUser: 9999
I found the solution. The new version of Apache Beam 2.16.0 provides an implementation to use in combination with environment type EXTERNAL. The implementation is based on worker_pool_main which has been created to support Kubernetes.
I know it is a bit outdated but there is a Flink operator for Kubernetes now.
Here are examples how to run Apache Beam with Flink using an operator:
https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/tree/master/examples/beam

Change root path for Spark Web UI?

I'm working to setup Jupyter notebook servers on Kubernetes that are able to launch pyspark. Each user is able to have a multiple servers running at once, and would access each by navigating to the appropriate host combined with a path to the server's fully-qualified name. For example: http://<hostname>/<username>/<notebook server name>.
I have a top-level function defined that allows a user create a SparkSession that points to the Kubernetes master URL and sets their pod to be the Spark driver.
This is all well and good, but I would like to enable end users to access the URL for the Spark Web UI so that they can track their jobs. The Spark on Kubernetes documentation has port forwarding as their recommended scheme for achieving this. It seems to be that for any security-minded organization, allowing any random user to setup port forwarding in this way would be unacceptable.
I would like to use an Ingress Kubernetes definition to allow external access to the driver's Spark Web UI. I've setup something like the following:
# Service
apiVersion: v1
kind: Service
metadata:
namespace: <notebook namespae>
name: <username>-<notebook server name>-svc
spec:
type: ClusterIP
sessionAffinity: None
selector:
app: <username>-<notebook server name>-notebook
ports:
- name: app-svc-port
protocol: TCP
port: 8888
targetPort: 8888
- name: spark-ui-port
protocol: TCP
port: 4040
targetPort: 4040
# Ingress
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
namespace: workspace
name: <username>-<notebook server name>-ing
annotations:
kubernetes.io/ingress.class: traefik
spec:
rules:
- host: <hostname>
http:
paths:
- path: /<username>/<notebook server name>
backend:
serviceName: <username>-<notebook server name>-svc
servicePort: app-svc-port
- path: /<username>/<notebook server name>/spark-ui
backend:
serviceName: <username>-<notebook server name>-svc
servicePort: spark-ui-port
However, under this setup, when I navigate to http://<hostname>/<username>/<notebook server name>/spark-ui/, I'm redirected to http://<hostname>/jobs. This is because /jobs is the default entry point to Spark's Web UI. However, I don't have an ingress rule for that path, and can't set such a rule since every user's Web UI would collide with each other in the load balancer (unless I have a misunderstanding, which is totally possible).
Under the Spark UI configuration settings, there doesn't seem to be a way to set a root path for the Spark session. You can change the port on which it runs, but what I'd like to do make the UI serve at something like: http://<hostname>/<username>/<notebook server name>/spark-ui/<jobs, stages, etc>. Is there really no way of changing what comes after the hostname of the URL and before the last part?
1: set your spark config
spark.ui.proxyBase: /foo
2: Set the nginx annotations in Ingress
annotations:
nginx.ingress.kubernetes.io/proxy-redirect-from: http://$host/
nginx.ingress.kubernetes.io/proxy-redirect-to: http://$host/foo/
3:Annotation to rewrite target:
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
rules:
- host: <host>
http:
paths:
- backend:
serviceName: <service>
servicePort: <port>
path: /foo(/|$)(.*)
Yes, you can achieve this. Specifically you can do this by setting the spark.ui.proxyBase property within spark-defaults.conf or at the run-time.
Example:
echo "spark.ui.proxyBase $SPARK_UI_PROXYBASE" >> /opt/spark/conf/spark-defaults.conf;
Then this should work.

uwsgi master graceful shutdown

I'm running uwsgi+flask application,
The app is running as a k8s pod.
When i deploy a new pod (a new version), the existing pod get SIGTERM.
This causes the master to stop accepting new connection at the same moment, what causes issues as the LB still pass requests to the pod (for a few more seconds).
I would like the master to wait 30 sec BEFORE stop accepting new connections (When getting SIGTERM) but couldn't find a way, is it possible?
My uwsgi.ini file:
[uwsgi]
;https://uwsgi-docs.readthedocs.io/en/latest/HTTP.html
http = :8080
wsgi-file = main.py
callable = wsgi_application
processes = 2
enable-threads = true
master = true
reload-mercy = 30
worker-reload-mercy = 30
log-5xx = true
log-4xx = true
disable-logging = true
stats = 127.0.0.1:1717
stats-http = true
single-interpreter= true
;https://github.com/containous/traefik/issues/615
http-keepalive=true
add-header = Connection: Keep-Alive
Seems like this is not possible to achieve using uwsgi:
https://github.com/unbit/uwsgi/issues/1974
The solution - (as mentioned on this kubernetes issue):
https://github.com/kubernetes/contrib/issues/1140
Is to use the prestop hook, quite ugly but will help to achieve zero downtime:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
spec:
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
lifecycle:
preStop:
exec:
command: ["/bin/sleep","5"]
The template is taken from this answer: https://stackoverflow.com/a/39493421/3659858
Another option is to use the CLI option:
--hook-master-start "unix_signal:15 gracefully_kill_them_all"
or in the .ini file (remove the double quotes):
hook-master-start = unix_signal:15 gracefully_kill_them_all
which will gracefully terminate workers after receiving a SIGTERM (signal 15).
See the following for reference.
When I tried the above though, it didn't work as expected from within a docker container. Instead, you can also use uWSGI's Master FIFO file. The Master FIFO file can be specified like:
--master-fifo <filename>
or
master-fifo = /tmp/master-fifo
Then you can simply write a q character to the file and it will gracefully shut down your workers before exiting.

Categories