Airflow sigkill tasks on cloud composer - python

I've seen mention of sigkill occurring for others but I think my use case is slightly different.
I'm using the managed airflow service through gcp cloud composer, running airflow 2. I have 3 worker nodes all set to the default instance creation settings.
The environment runs dags fairly smoothly for the most part (api calls, moving files from on prem) however it would seem as though its having a terribly hard time executing a couple of slightly larger jobs.
One of these jobs uses a samba connector to incrementally backfill missing data and store on gcs. The other is a salesforce api connector.
These jobs run locally with absolutely no issue so I'm wondering why I'm encountering these issues. There should be plenty memory to run these tasks as a cluster, although scaling up my cluster for just 2 jobs doesn't seem like it's particularly efficient.
I have tried both dag and task timeouts. I've tried increasing the connection timeout on the samba client.
So could someone please share some insight into how I can get airflow to execute these tasks without killing the session - even if it does take longer.
Happy to add more detail if required but I don't have the available data in front of me currently to share.

I believe this is about keepalives on the connections. If there are long running calls that cause a long period of inactivity (because the other side is busy preparing the data). On manged instance it is quite often the case that inactive connections get killed by the firewalls. This happens for example with long Postgres queries and solution there is to configure keepalives.
I think you should do the same for your samba connection (but you need to figure out how to do it in composer) https://www.linuxtopia.org/online_books/network_administration_guides/using_samba_book/ch08_06_04.html

Frustratingly, increasing resources meant the jobs could run. I don't know why the resources weren't enough as they really should've been. But optimisation for fully managed solutions isn't overly straight forward other than adding cost.

Related

MongoDB Python parallel operation for a long time

I am developing an automation tool that is supposed to upgrade IP network devices.
I developed 2 totally separated script for the sake of simplicity - I am not an expert developer - one for the core and aggregation nodes, and one for the access nodes.
The tool executes software upgrade on the routers, and verifies the result by executing a set of post check commands. The device role implies the "size" of the router. Bigger routers take much more to finish the upgrade. Meanwhile the smaller ones are up much earlier than the bigger ones, the post check cannot be started until the bigger ones finish the upgrade, because they are connected to each other.
I want to implement a reliable signaling between the 2 scripts. That is, the slower script(core devices) flips a switch when the core devices are up, while the other script keeps checking this value, and start the checks for the access devices.
Both script run 200+ concurrent sessions moreover, each and every access device(session) needs individual signaling, so all the sessions keep checking the same value in the DB.
First I used the keyring library, but noticed that the keys do disappear sometimes. Now I am using a txt file to manipulate the signal values. It looks pretty much armature, so I would like to use MongoDB.
Would it cause any performance issues or unexpected exception?
The script will be running for 90+ minutes. Is it OK to connect to the DB once at the beginning of the script, set the signal to False, then 20~30 minutes later keep checking for an additional 20 minutes. Or is it advised to establish a new connection for reading the value for each and every parallel session?
The server runs on the same VM as the script. What exceptions shall I expect?
Thank you!

Airflow spinning up multiple subprocess for a single task and hanging

Airflow version = 1.10.10
Hosted on Kubernetes, Uses Kubernetes executor.
DAG setup
DAG - Is generated using the dynamic dag
Task - Is a PythonOperator that pulls some data, runs an inference, stores the predictions.
Where does it hang? - When running the inference using tensorflow
More details
One of our running tasks, as mentioned above, was hanging for 4 hours. No amount of restarting can help it to recover from that point. We found out that the pod had almost 30+ subprocess and 40GB of memory used.
We weren't convinced because when running on a local machine, the model doesn't consume more than 400MB. There is no way it can suddenly bump up to 40GB in memory.
Another suspicion was maybe it's spinning up so many processes because we are dynamically generating around 19 DAGS. I changed the generator to generate only 1, and the processes didn't vanish. The worker pods still had 35+ subprocesses with the same memory.
Here comes the interesting part, I wanted to be really sure that it's not the dynamic DAG. Hence I created an independent DAG that prints out 1..100000 while pausing for 5 seconds each. The memory usage was still the same but not the number of processes.
At this point, I am not sure which direction to take to debug the issue further.
Questions
Why is the task hanging?
Why are there so many sub-processes when using dynamic dag?
How can I debug this issue further?
Have you faced this before, and can you help?

Long running cloud task on gae flexible terminates early without error. How to debug? What am I missing?

I am running an application on gae flexible with python and flask. I periodically dispatch cloud tasks with a cron job. These basically loop through all users and perform some cluster analysis. The tasks terminate without throwing any kind of error but don't perform all the work (meaning not all users were looped through). It doesn't seem to happen at a consistent time 276.5s - 323.3s nor does it ever stop at the same user. Has anybody experienced anything similar?
My guess is that I am breaching some type of resource limit or timeout somewhere. Things i have thought about or tried:
Cloud tasks should be allowed to run for up to an hour (as per this: https://cloud.google.com/tasks/docs/creating-appengine-handlers)
I increased the timeout of gunicorn workers to be 3600 to reflect this.
I have several workers running.
I tried to find if there are memory spikes or cpu overload but didn't see anything suspicious.
Sorry if I am too vague or am completely missing the point, I am quite confused with this problem. Thank you for any pointers.
Thank you for all the suggestions, I played around with them and have found out the root cause, although by accident reading firestore documentation. I had no indication that this had anything to do with firestore.
From here: https://googleapis.dev/python/firestore/latest/collection.html
I found out that Query.stream() (or Query.get()) has a timeout on the individual documents like so:
Note: The underlying stream of responses will time out after the
max_rpc_timeout_millis value set in the GAPIC client configuration for
the RunQuery API. Snapshots not consumed from the iterator before that
point will be lost.
So what eventually timed out was the query of all users, I came across this by chance, none of the errors I caught pointed me back towards the query. Hope this helps someone in the future!
Other than use Cloud Scheduler, you can inspect the logs to make sure the Tasks ran properly and make sure there's no deadline issues. As application logs are grouped, and after the task itself is executed, it’s sent to Stackdriver. When a task is forcibly terminated, no log may be output. Try catching the Deadline exception so that some log is output and you may see some helpful info to start troubleshooting.

How to architech heroku app for workers with wildly varying runtimes?

I'm building a web application that has some long-running jobs to do, and some very short. The jobs are invoked by users on the website, and can run anywhere from a few seconds to several hours. The spawned job needs to provide status updates to the user via the website.
I'm new to heroku, and am thinking the proper approach is to spawn a new dyno for each background task, and to communicate status via database or a memcache record. Maybe?
My question is whether this is the technically feasibly, and advisable approach?
I ask it because the documentation has a different mindset: that the worker dyno pulls jobs off a queue, and if you want things to go faster you run more dynos. I'm not sure that will work—could a 10 second job get blocked waiting for a couple of 10 hour jobs to finish? There is a way of determining the size of the job but, again, there is a highly variable amount of work to do before it is known.
I've not found any examples suggesting it is even possible for the web dyno to run up workers ad-hoc. Is it? Is the solution to multi-thread the worker dyno? Is so, what about potential memory space issues?
What's my best approach?
thx.

Monitor python scraper programs on multiple Amazon EC2 servers with a single web interface written in Django

I have a web-scraper (command-line scripts) written in Python that run on 4-5 Amazon-EC2 instances.
What i do is place the copy of these python scripts in these EC2 servers and run them.
So the next time when i change the program i have to do it for all the copies.
So, you can see the problem of redundancy, management and monitoring.
So, to reduce the redundancy and for easy management , I want to place the code in a separate server from which it can be executed on other EC2 servers and also monitor theses python programs, and logs created them through a Django/Web interface situated in this server.
There are at least two issues you're dealing with:
monitoring of execution of the scraping tasks
deployment of code to multiple servers
and each of them requires a different solution.
In general I would recommend using task queue for this kind of assignment (I have tried and was very pleased with Celery running on Amazon EC2).
One advantage of the task queue is that it abstracts the definition of the task from the worker which actually performs it. So you send the tasks to the queue, and then a variable number of workers (servers with multiple workers) process those tasks by asking for them one at a time. Each worker if it's idle will connect to the queue and ask for some work. If it receives it (a task) it will start processing it. Then it might send the results back and it will ask for another task and so on.
This means that a number of workers can change over time and they will process the tasks from the queue automatically until there are no more tasks to process. The use case for this is using Amazon's Spot instances which will greatly reduce the cost. Just send your tasks to the queue, create X spot requests and see the servers processing your tasks. You don't really need to care about the servers going up and down at any moment because the price went above your bid. That's nice, isn't it ?
Now, this implicitly takes care of monitoring - because celery has tools for monitoring the queue and processing, it can even be integrated with django using django-celery.
When it comes to deployment of code to multiple servers, Celery doesn't support that. The reasons behind this are of different nature, see e.g. this discussion. One of them might be that it's just difficult to implement.
I think it's possible to live without it, but if you really care, I think there's a relatively simple DIY solution. Put your code under VCS (I recommend Git) and check for updates on a regular basis. If there's an update, run a bash script which will kill your workers, make all the updates and start the workers again so that they can process more tasks. Given Celerys ability to handle failure this should work just fine.

Categories