I am running my backend using Python and Django with uWSGI. We recently migrated it to Kubernetes (GKE) and our pods are consuming a LOT of memory and the rest of the cluster is starving for resources. We think that this might be related to the uWSGI configuration.
This is our yaml for the pods:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-pod
namespace: my-namespace
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 10
maxUnavailable: 10
selector:
matchLabels:
app: my-pod
template:
metadata:
labels:
app: my-pod
spec:
containers:
- name: web
image: my-img:{{VERSION}}
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8000
protocol: TCP
command: ["uwsgi", "--http", ":8000", "--wsgi-file", "onyo/wsgi.py", "--workers", "5", "--max-requests", "10", "--master", "--vacuum", "--enable-threads"]
resources:
requests:
memory: "300Mi"
cpu: 150m
limits:
memory: "2Gi"
cpu: 1
livenessProbe:
httpGet:
httpHeaders:
- name: Accept
value: application/json
path: "/healthcheck"
port: 8000
initialDelaySeconds: 15
timeoutSeconds: 5
periodSeconds: 30
readinessProbe:
httpGet:
httpHeaders:
- name: Accept
value: application/json
path: "/healthcheck"
port: 8000
initialDelaySeconds: 15
timeoutSeconds: 5
periodSeconds: 30
envFrom:
- configMapRef:
name: configmap
- secretRef:
name: secrets
volumeMounts:
- name: service-account-storage-credentials-volume
mountPath: /credentials
readOnly: true
- name: csql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.11
command: ["/cloud_sql_proxy",
"-instances=my-project:region:backend=tcp:1234",
"-credential_file=/secrets/credentials.json"]
ports:
- containerPort: 1234
name: sql
securityContext:
runAsUser: 2 # non-root user
allowPrivilegeEscalation: false
volumeMounts:
- name: credentials
mountPath: /secrets/sql
readOnly: true
volumes:
- name: credentials
secret:
secretName: credentials
- name: volume
secret:
secretName: production
items:
- key: APPLICATION_CREDENTIALS_CONTENT
path: key.json
We are using the same uWSGI configuration that we had before the migration (when the backend was being executed in a VM).
Is there a best practice config for running uWSGI in K8s? Or maybe something that I am doing wrong in this particular config?
You activated 5 workers in uwsgi, that could mean 5 times the need of memory if your application is using lazy-loading techniques (my advice: load everything at startup and trust pre-fork check this). However, you could try reducing number of workers and instead raising number of threads.
Also, you should drop max-requests, this makes your app reload every 10 requests, that's non-sense in a production environment (doc). If you have troubles with memory leaks, use reload-on-rss instead.
I would do something like this, maybe less or more threads per worker depending on how your app uses it (adjust according to cpu usage/availability per pod in production):
command: ["uwsgi", "--http", ":8000", "--wsgi-file", "onyo/wsgi.py", "--workers", "2", "--threads", "10", "--master", "--vacuum", "--enable-threads"]
ps: as zerg said in comment you should of course ensure your app is not running DEBUG mode, together with low logging output.
Related
Hi I'm planning to upgrade my Airflow version from 1.11 to 1.15 which is deployed in OpenShift. As there are very large numbers of DAG's so I planned to upgrade in the bride release rather than going to Airflow 2.2
The error which I'm getting is most probably due to the fernet key:
ERROR: The `secret_key` setting under the webserver config has an insecure value - Airflow has
failed safe and refuses to start. Please change this value to a new, per-environment,
randomly generated string, for example using this command `openssl rand -hex 30`
Earlier I was using static Fernet Key and the YAML file is as follows:
apiVersion:v1
kind:Secret
metadata:
name : airflow-secret
namespace : CUSTOM_NAMESPACE
labels:
app:airflow
type: Opaque
stringData:
fernet-key: my_fernet_key
My Python Version : 3.8
My Airflow Webserver Config :
apiVersion: v1
kind: DeploymentConfig
metadata:
name: airflow-webserver
namespace: CUSTOM_NAMESPACE
labels:
app: airflow
spec:
strategy:
type: Rolling
trigger:
- type : ConfigChange
- type : ImageChange
ImageChangeParams:
automatic: true
containerNames:
- airflow-webserver
from:
kind: ImageStreamTag
namespace: CUSTOM_NAMESPACE
replicas: 1
revisionHistoryLimit : 10
paused: false
selector :
app : airflow
deploymentconfig : airflow-webserver
template:
metadata:
labels:
name: airflow-webserver
app: airflow
deploymentconfig : airflow-webserver
spec:
volumes:
- name: airflow-dags
persistentVolumeClaims:
claimName: airflow-dags
containers:
- name: airflow-webserver
image: airflow:latest
resources:
limits:
memory: 4Gi
env:
- name : FERNET_KEY
valueFrom:
secretKeyRef:
name: airflow-secrets
key : fernet-key
- name : SERVICE_ACCOUNT_NAME
valueFrom:
secretKeyRef:
name: airflow-service-account
key : service-account-name
ports:
- containerPort: 8080
protocol: TCP
volumeMounts:
- name: airflow-dags
mountPath: /opt/airflow/dags
- name: airflow-logs
mountPath: /opt/airflow/logs
My understanding is we need to somehow provide dynamic value in fernet key but for my case its static, Any Possible way to resolve the error.
Thank!
The main issue there was default value stored in airflow.cfg i.e.
secret_key = temporary_value
We can generate the secret_key by seeing the error message:
openssl rand -hex 30
suppose the value is --> 94b9d6124ff2e9a5783d94dc7aa3641ebb8929bdbbf2f3989402f9e400ac
We need to put the value into the secret_key in airflow.cfg
secret_key = 94b9d6124ff2e9a5783d94dc7aa3641ebb8929bdbbf2f3989402f9e400ac
I have a very simple Kubernetes Cluster that uses GKE (Google Cloud Platform) with 1 node (4 vCPU / 16 Go RAM)
This morning I was load testing an API (written in Python) on this cluster with Locust. I only have one endpoint on my API, so I prepared a locust file to run differents configurations on that endpoint (with random parameters etc).
I first ran my Locust test with 10 users, 1 user generated every seconds (until limit reached), and the request wait time is between 1 and 2.5 seconds. I tested it first on just 1 pod and then I ran the exact same test on 3 pods. I noticed almost no improvement with 3 pods and the response times were just more spreaded.
Here are the results for 1 Pod :
And the charts:
And here are the results for 3 Pods
And the charts:
I am fairly new to K8s and I don't really understand everything, but my initial though was that increasing the number of running pods should improve the preformance right ? I should be able to handle 3 times more requests in the same time, but the results are far from that excepted conclusion.
And here is the yaml file defining the deployment on the K8s cluster (note: I set myself the replicas to 1 or 3 because for some reasons the autoscaler don't work) :
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: app_name
name: app_name
namespace: default
spec:
replicas: 3 or 1
selector:
matchLabels:
app: app_name
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app: app_name
spec:
containers:
- image: >-
gcr.io/path_to_image
imagePullPolicy: IfNotPresent
name: app_name-v2-sha256-1
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
labels:
app: app_name
name: app_name-hpa-cbrt
namespace: default
spec:
maxReplicas: 3
metrics:
- resource:
name: cpu
targetAverageUtilization: 80
type: Resource
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app_name
Additional info : The API calls a MySQL database and perform a request that get approx. 70k rows and do some stuff on those results. It only returns a json with few rows at the end.
Thanks in advance for any help you could provide !
Edit : Here is the definition of the load balancer, it's GCP generated and untouched
apiVersion: v1
kind: Service
metadata:
annotations:
cloud.google.com/neg: '{"ingress":true}'
creationTimestamp: "2021-05-11T10:03:16Z"
finalizers:
- service.kubernetes.io/load-balancer-cleanup
labels:
app: app_name
app.kubernetes.io/managed-by: gcp-cloud-build-deploy
app.kubernetes.io/name: app_name
app.kubernetes.io/version: b8a19c9rth54erthe4rth8459633451fe8e73e038
name: app_service
namespace: default
resourceVersion: "87954231"
selfLink: /api/v1/namespaces/default/services/app_service
uid: service_uid
spec:
clusterIP: 661.661.661.661
externalTrafficPolicy: Cluster
ports:
- nodePort: 30129
port: 8080
protocol: TCP
targetPort: 8080
selector:
app: app_name
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 666.666.666.666
I have a Kubernetes cluster that is making use of an Ingress to forward on traffic to a frontend React app and a backend Flask app. My problem is that the React app only works if rewrite-target annotation is not set and the flask app only works if it is.
How can I get my flask app accessible without setting this value (commented out in below yaml).
Here is the Ingress controller:
metadata:
name: thesis-ingress
namespace: thesis
annotations:
kubernetes.io/ingress.class: "nginx"
nginx.ingress.kubernetes.io/add-base-url: "true"
# nginx.ingress.kubernetes.io/rewrite-target: /$1
nginx.ingress.kubernetes.io/service-upstream: "true"
spec:
tls:
- hosts:
- thesis
secretName: ingress-tls
rules:
- host: thesis.info
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: frontend
port:
number: 3000
- path: /backend
pathType: Prefix
backend:
service:
name: backend
port:
number: 5000
Your question didn't specify, but I'm guessing your capture group was to rewrite /backend/(.+) to /$1; on that assumption:
Be aware that annotations are per-Ingress, but all Ingress resources are unioned across the cluster to comprise the whole of the configuration. Thus, if you need one rewrite and one without, just create two Ingress resources
metadata:
name: thesis-frontend
namespace: thesis
annotations:
kubernetes.io/ingress.class: "nginx"
nginx.ingress.kubernetes.io/add-base-url: "true"
nginx.ingress.kubernetes.io/service-upstream: "true"
spec:
tls:
- hosts:
- thesis
secretName: ingress-tls
rules:
- host: thesis.info
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: frontend
port:
number: 3000
---
metadata:
name: thesis-backend
namespace: thesis
annotations:
kubernetes.io/ingress.class: "nginx"
nginx.ingress.kubernetes.io/add-base-url: "true"
nginx.ingress.kubernetes.io/rewrite-target: /$1
nginx.ingress.kubernetes.io/service-upstream: "true"
spec:
tls:
- hosts:
- thesis
secretName: ingress-tls
rules:
- host: thesis.info
- path: /backend/(.+)
backend:
service:
name: backend
port:
number: 5000
I'm trying to patch a deployment and remove its volumes using patch_namespaced_deployment (https://github.com/kubernetes-client/python) with the following arguments, but it's not working.
patch_namespaced_deployment(
name=deployment_name,
namespace='default',
body={"spec": {"template": {
"spec": {"volumes": None,
"containers": [{'name': container_name, 'volumeMounts': None}]
}
}
}
},
pretty='true'
)
How to reproduce it:
Create this file app.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/bound-by-controller: "yes"
finalizers:
- kubernetes.io/pv-protection
labels:
volume: pv0001
name: pv0001
resourceVersion: "227035"
selfLink: /api/v1/persistentvolumes/pv0001
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 5Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: myclaim
namespace: default
resourceVersion: "227033"
hostPath:
path: /mnt/pv-data/pv0001
type: ""
persistentVolumeReclaimPolicy: Recycle
volumeMode: Filesystem
status:
phase: Bound
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: pv-deploy
spec:
replicas: 1
selector:
matchLabels:
app: mypv
template:
metadata:
labels:
app: mypv
spec:
containers:
- name: shell
image: centos:7
command:
- "bin/bash"
- "-c"
- "sleep 10000"
volumeMounts:
- name: mypd
mountPath: "/tmp/persistent"
volumes:
- name: mypd
persistentVolumeClaim:
claimName: myclaim
- kubectl apply -f app.yaml
- kubectl describe deployment.apps/pv-deploy (to check the volumeMounts and Volumes)
- kubectl patch deployment.apps/pv-deploy --patch '{"spec": {"template": {"spec": {"volumes": null, "containers": [{"name": "shell", "volumeMounts": null}]}}}}'
- kubectl describe deployment.apps/pv-deploy (to check the volumeMounts and Volumes)
- Delete the application now: kubectl delete -f app.yaml
- kubectl create -f app.yaml
- Patch the deployment using the python library function as stated above. The *VolumeMounts* section is removed but the Volumes still exist.
** EDIT **
Running the kubectl patch command works as expected. But
after executing the Python script and running a describe deployment command, the persistentVolumeClaim is replaced with an emptyDir like this
Volumes:
mypd:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
What you're trying to do is called a strategic merge patch (https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/). As you can see in the documentation, With a strategic merge patch, a list is either replaced or merged depending on its patch strategy so this may be why you're seeing this behavior.
I think you should go with replace https://jamesdefabia.github.io/docs/user-guide/kubectl/kubectl_replace/ and instead of trying to manage a part of your deployment object, replace it with a new one.
i am rather new to both kubernetes and tensorflow, trying to run basic kubeflow distributed-tensorflow example from this link (https://github.com/learnk8s/distributed-tensorflow-on-k8s). I am currently running local bare-metal kubernetes cluster with 2-nodes (1-master & 1-worker). Everything works fine when i run it in minikube (following the documentation), both training and serving run successfully. But running the job on local cluster is giving me this error!
Any help would be appreciated.
For this setup, i created a pod for nfs-storage that would be used by the jobs. Because local cluster doesn't have dynamic provisioning enabled, i created persistent volume manually (the files used are attached).
Nfs pod-storage file:
kind: Service
apiVersion: v1
metadata:
name: nfs-service
spec:
selector:
role: nfs-service
ports:
# Open the ports required by the NFS server
- name: nfs
port: 2049
- name: mountd
port: 20048
- name: rpcbind
port: 111
---
kind: Pod
apiVersion: v1
metadata:
name: nfs-server-pod
labels:
role: nfs-service
spec:
containers:
- name: nfs-server-container
image: cpuguy83/nfs-server
securityContext:
privileged: true
args:
# Pass the paths to share to the Docker image
- /exports
Persistent Volume & PVC file:
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs
spec:
storageClassName: "standard"
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
server: 10.96.72.11
path: "/"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nfs
spec:
accessModes:
- ReadWriteMany
storageClassName: "standard"
resources:
requests:
storage: 10Gi
TFJob File:
apiVersion: kubeflow.org/v1beta1
kind: TFJob
metadata:
name: tfjob1
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
When i run the job, it give me this error
error: unable to recognize "kube/tfjob.yaml": no matches for kind "TFJob" in version "kubeflow.org/v1alpha1"
After searching a little, someone pointed "v1alpha1" could be out-dated so you should use "v1beta1" (strangely this "v1alpha1" was working with my minikube setup so i am very confused!). But with that although the tfjob gets created, i do not see any new containers starting as opposed to the minikube run, where new pods start and finish successfully. When i describe the Tfjob, i see this error
Type Reason Age From Message
---- ------ ---- ---- -------
Warning InvalidTFJobSpec 22s tf-operator Failed to marshal the object to TFJob; the spec is invalid: failed to marshal the object to TFJob"
Since the only difference is the nfs-storage, i think there might be something wrong with my manual setup. Please let me know if i messed up somewhere because i do not have enough background!
I found the issue that was causing specific error. First, the api-version changed so i had to move from v1alpha1 to v1beta2. Second, the tutorial i followed was using kubeflow v0.1.2 (rather old) and the syntax for defining tfjob in the yaml file has changed ever since (not exactly sure in which version the change happened!). So by looking at the latest example in the git i was able to update the job spec. Here are the files for someone interested!
Tutorial version:
apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
name: tfjob1
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
updated version:
apiVersion: kubeflow.org/v1beta2
kind: TFJob
metadata:
name: tfjob1
spec:
tfReplicaSpecs:
Chief:
replicas: 1
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
Worker:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
PS:
replicas: 1
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure