Kubernetes python pod profiling

Kubernetes python pod profiling - python

I am running python pod on Kubernetes and only one main pod keeps restarting.
Memory is increasing continuously so that k8s restart it as implemented liveness, readiness there.
Using flask with python-3.5 & socket.io.
Is there any way i can do profiling on Kubernetes pod without doing code changes using installing any agent or any how. Please let me know.
I am getting Terminated with code 137.
Thanks in advance

You are using GKE right ?
You should use stackriver monitoring in order to profile, capture, analyze metrics and strackdriver logs in order to understand what's happening.
Stackdriver Kubernetes Engine Monitoring is the default option, starting with GKE version 1.14. It's really intuitive but some knowledge and understanding of the platform is required. You should be able to create a graph based on memory utilization.
Have a look at the documentation:
Stackdriver support for GKE
Stackdriver monitoring

If you want an open source solution, you can do this with Robusta. Disclaimer: I wrote this.
Essentially it injects tracemalloc into your pod on demand and sends you the results in Slack. No restart needed.

Related

Getting data from DataDog API to Snowflake with Airflow

Usually things go other way - you use DataDog to monitor Airflow, but in my case I need to access DataDog metrics over some DataDog API from Airflow so I can send it to Snowflake table.
Idea it to use this table to build alerting system in ThoughtSpot Cloud when Kafka lag happens since ThoughtSpot Cloud doesn't support calling API, at least not from cloud version.
I'm googling over this options endlessly but not finding any more optimal and less complicated solution. Any advices is highly appreciated.

http request on azure VM for python

I know this is not a direct code related question but more a best practices. I have several azure HTTPfunctions running but they timeout due to long calculations. I have added Durable orchestrations but even they time out.
As certain processes are long and time consuming (aka training an AI model) I have switched to Azure VM. What I would like to add to this is the possibility to start an Python task from an HTTP request on my azure VM.
basically doing the exact same as the Azure HTTPFunctions. What would be the best way to do this, any great documentation or recommendations much appreciated. So running an API on my VM in Python.

Enable Live Metrics on Application Insights for a Docker based Python Function App

I have a Docker based Python Function App running, which is connected to an Application Insights resource. I get all the usual metrics, but the Live Metrics fails telling me "Not available: your app is offline or using an older SDK".
I am using the azure-functions/python:4-python3.9-appservice image as a base. If I remember correctly I was able to view Live Metrics when I simply deployed a Function App via ZIP deploy, but since switching to Docker this option has disappeared. Online I'm not able to find the right information to fix this or to determine if it is even possible.

AFAIK, currently Live Metric Stream for Python is not supported.
The MSDOC says that currently supported languages are .NET, Java and Node.js.
For achieving this you can refer the alternate solution given by #AJG for that you have to create a LogHandler and write the messages into Cosmos DB container. It will stream into console.

Unable to pinpoint issue with GCP Composer (Airflow) DAG task failure

I am new at using Apache Airflow.
Some operators of my dag have a failed status. I am trying to understand the origin of the error.
Here are the details of the problem:
My dag is pretty big, and certain parts of it are composed of sub-dags.
What I notice in the Composer UI, is that the Subdags that failed, all did in a task_id named download_file that uses XCom with a GoogleCloudStorageDownloadOperator.
>> GoogleCloudStorageDownloadOperator(
task_id='download_file',
bucket="sftp_sef",
object="{{task_instance.xcom_pull(task_ids='find_file') | first }}",
filename="/home/airflow/gcs/data/zips/{{{{ds_nodash}}}}_{0}.zip".format(table)
)
The logs in the said Subdag do not show anything useful.
LOG :
[2020-04-07 15:19:25,618] {models.py:1359} INFO - Dependencies all met
for [2020-04-07 15:19:25,660]
{models.py:1359} INFO - Dependencies all met for [2020-04-07 15:19:25,660]
{models.py:1577} INFO -
------------------------------------------------------------------------------- Starting attempt 10 of 1
[2020-04-07 15:19:25,685] {models.py:1599} INFO - Executing
on
2020-04-06T11:44:31+00:00 [2020-04-07 15:19:25,685]
{base_task_runner.py:118} INFO - Running: ['bash', '-c', 'airflow run
datamart_integration.consentement_email download_file
2020-04-06T11:44:31+00:00 --job_id 156313 --pool integration --raw -sd
DAGS_FOLDER/datamart/datamart_integration.py --cfg_path
/tmp/tmpacazgnve']
I am not sure if there is somewhere I am not checking... Here are my questions :
How do I debug errors in my Composer DAGs in general
Is it a good idea to create a local airflow environment to run &
debug my dags locally?
How do I verify if there are errors in XCom?

Regarding your three questions:
First, when using Cloud Composer you have several ways of debugging error in your code. According to the documentation, you should:
Check the Airflow logs.
These logs are related to single DAG tasks. It is possible to view them in the Cloud Storage's logs folder and in the Web Airflow interface.
When you create a Cloud Composer environment a Cloud Storage Bucket is also created and associate with it. Thus, Cloud Composer stores the logs for single DAG tasks in the logs folder inside this bucket, each workflow folder has a folder for its DAGs and sub-DAGs. You can check its structure here.
Regarding the Airflow web interface, it is refreshed every 60 seconds.Also, you can check more about it here.
Review the Google Cloud's operations suite.
You can use Cloud Monitoring and Cloud Logging with Cloud Composer. Whereas Cloud Monitoring provides visibility into the performance and overall health of cloud-powered applications, Cloud Logging shows the logs that the scheduler and worker containers produce. Therefore, you can use both or just the one you find more useful based on your need.
In the Cloud Console, check for errors on the pages for the Google Cloud components running your environment.
In the Airflow web interface, check in the DAG's Graph View for failed task instances.
Thus, these are the steps recommended when troubleshooting your DAG.
Second, regarding testing and debugging, it is recommended that you separate production and test environment to avoid DAG interference.
Furthermore, it is possible to test your DAG locally, there is a tutorial in the documentation about this topic, here. Testing locally allows you to identify syntax and task errors. However, I must point that it won't be possible to check/evaluate dependencies and communication to the database.
Third, in general, in order to verify errors in Xcom you should check:
If there is any error code/number;
Check with a sample code from the documentation if your syntax is correct;
Check if the packages if they are deprecated;
I would like to point that, according to this documentation, the path to GoogleCloudStorageDownloadOperator was updated to GCSToLocalOperator.
In addition, I also encourage you to have a look at this: code and documentation to check Xcom syntax and errors.
Feel free to share the error code with me if you need further help.

Is GPU available in Azure Cloud Services Worker role?

Coming from AWS, I am completely new to Azure in general and to Cloud Services spesifically.
I want to write a python application that leverages GPU on Azure in a PaaS architecture (Platform as a Service). The application will hopefully be deployed somewhere central, and then a number of GPU enabled nodes will spin up and run the application until it is done before closing down again.
I want to know, what is the shortest way to accomplish this in Azure?
Is my assumption correct that I will need to use what is called Cloud Services with a worker role, or will I have to create my own infrastructure based on single VMs running in IaaS?

It sounds like you created an application which need to do some general-purpose computing on GPU via Cuda or OpenCL. If so, you need to install GPGPU driver on Azure to support your Python application, so the Azure NC & NV Series VMs are suitable for this scenario like on AWS, as the figure below from here.
Hope it helps. Any concern, please feel free to let me know.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.