Airflow + Kubernetes Executor too old resource version - python

I have a strange behaviour of Airflow with Kubernetes executor. In my config tasks run in dynamically created kubernetes pods, and i have a number of tasks that runs once or twice a day. Taks itself is python operators that runs some ETL routine, dag files is syncing via separate pod with git repo inside. For some time all working ok, but not so long ago in scheduler pod i begin to see error
kubernetes.client.exceptions.ApiException: (410)
Reason: Gone: too old resource version: 51445975 (51489631)
After that error is appear, old pods from tasks won't be deleted and after some time new pods can't be created and tasks won't run(or to be more precise it freezes in "scheduled" state). In this situation only deleting the scheduler pod with
kubectl delete -n SERVICE_NAME pod scheduler
and waiting for kubernetes to recreate it helps, but after some time error appears again and situation repeats. Another strange thing, that this error seems only appear after scheduled tasks run. If i trigger any task any number of time it via UI no error appears and pods are created and deleted normally.
Airflow version is 1.10.12 Any help will be appreciated, thanks!

This is because of Kubernetes Python client version 12.0
Restrict the version to <12
pip install -U 'kubernetes<12'

Related

Airflow not creating log files or showing logs in task instance on UI

Im getting the following airflow issue:
When I run Dags that have mutiple tasks in it, randomly airflow set some of the tasks to failed state, and also doesn't show any logs on the UI. I went to my running worker container and saw that the log files for those failed tasks were also not created.
Going to Celery Flower, I found these logs on failed tasks:
airflow.exceptions.AirflowException: Celery command failed on host
How to solve this?
My environment is:
airflow:2.3.1
Docker compose
Celery Executor
Worker, webserver, scheduler and triggerer in different containers
Docker compose hosted on Ubuntu
I also saw this https://stackoverflow.com/a/69201032/11949273 answer that might be related.
Anyone with these same issues?
Edit:
On my EC2 Instance I got more vCPU's and fine tuned airflow/celery workers parameters and solved this. Probably is some issue with lack of CPU and or something else.
I am faced with some issue. In my case in Inspect -> Console has some error with replaceAll in old browser (Chrome 83.X). Chrome 98.X does not have this issue.

Run Python Code on SSH Target using Airflow

There are 2 systems: A and B. Airflow Scheduler, webserver, redis and flower runs on A while an Airflow worker runs on B. Both systems are running Ubuntu 18.04 and uses Airflow 1.10.10 in docker containers.
Is it possible to create a DAG that remotely runs Python code (defined in that DAG) on B?
SSHOperator allows the remote execution of a bash command on B over SSH, but we require a remote execution of Python code over SSH instead.
Thank you!
I don't know if you've gotten your answer already, but I had a very similar (if not the very same) problem until just a few moments ago and thought I could provide the answer here.
The easiest way is to mount a shared folder onto both nodes so they can both access the actual physical DAG files.
More details about my case can be found here.

Triggering an Airflow DAG from terminal not working

I'm trying to use airflow to define a specific workflow that I want to manually trigger from the command line.
I create the DAG and add a bunch of tasks.
dag = airflow.DAG(
"DAG_NAME",
start_date=datetime(2015, 1, 1),
schedule_interval=None,
default_args=args)
I then run in the terminal
airflow trigger_dag DAG_NAME
and nothing happens. The scheduler is running in another thread. Any direction is much appreciated. Thank You
I just encountered the same issue.
Assuming you are able to see your dag in airflow list_dags or via the web server then:
Not only did I have to turn on the dag in the web UI, but I also had to ensure that airflow scheduler was running as a separate process.
Once I had the scheduler running I was able to successfully execute my dag using airflow trigger_dag <dag_id>
My dag configuration is not significantly different from yours. I also have schedule_interval=None
You may have disabled the workflow.
To enable the workflow manually. Open up the airflow web server by
$ airflow webserver -p 8080
Go to http://localhost:8080 . You should see the list of all available dags with a toggle button on/off. By default everything is set to off. Search for your dag and toggle your workflow. Now try triggering the workflow from terminal. It should work now.
first make sure your database connection string on the airflow is working, weather it be on postgres, sqlite(by default) or any other database. Then run the command
airflow initdb
This command should not be showing any connection errors
Secondly make sure your webserver is running on a separate thread
airflow webserver
Then run your schdeuler on a different thread
airflow scheduler
Finally trigger your dag on a different thread after the scheduler is running
airflow trigger_dag dag_id
Also make sure the dag name and task are present in the dag and task list
airflow list_dags
airflow list_tasks dag_id
And if the dag is switched off in your UI then toggle it on.
You should 'unpause' the drag you what to trigger. use airflow unpause xxx_drag and then airflow trigger_dag xxx_drag and it should work.
airflow trigger_dag -e <execution_date> <dag_id>

How can I communicate with Celery on Cloud Foundry?

I have a wsgi app with a celery component. Basically, when certain requests come in they can hand off relatively time-consuming tasks to celery. I have a working version of this product on a server I set up myself, but our client recently asked me to deploy it to Cloud Foundry. Since Celery is not available as a service on Cloud Foundry, we (me and the client's deployment team) decided to deploy the app twice – once as a wsgi app and once as a standalone celery app, sharing a rabbitmq service.
The code between the apps is identical. The wsgi app responds correctly, returning the expected web pages. vmc logs celeryapp shows that celery is to be up-and-running, but when I send requests to wsgi that should become celery tasks, they disappear as soon as they get to a .delay() statement. They neither appear in the celery logs nor do they appear as an error.
Attempts to debug:
I can't use celery.contrib.rdb in Cloud Foundry (to supply a telnet interface to pdb), as each app is sandboxed and port-restricted.
I don't know how to find the specific rabbitmq instance these apps are supposed to share, so I can see what messages it's passing.
Update: to corroborate the above statement about finding rabbitmq, here's what happens when I try to access the node that should be sharing celery tasks:
root#cf:~# export RABBITMQ_NODENAME=eecef185-e1ae-4e08-91af-47f590304ecc
root#cf:~# export RABBITMQ_NODE_PORT=57390
root#cf:~# ~/cloudfoundry/.deployments/devbox/deploy/rabbitmq/sbin/rabbitmqctl list_queues
Listing queues ...
=ERROR REPORT==== 18-Jun-2012::11:31:35 ===
Error in process <0.36.0> on node 'rabbitmqctl17951#cf' with exit value: {badarg,[{erlang,list_to_existing_atom,["eecef185-e1ae-4e08-91af-47f590304ecc#localhost"]},{dist_util,recv_challenge,1},{dist_util,handshake_we_started,1}]}
Error: unable to connect to node 'eecef185-e1ae-4e08-91af-47f590304ecc#cf': nodedown
diagnostics:
- nodes and their ports on cf: [{'eecef185-e1ae-4e08-91af-47f590304ecc',57390},
{rabbitmqctl17951,36032}]
- current node: rabbitmqctl17951#cf
- current node home dir: /home/cf
- current node cookie hash: 1igde7WRgkhAea8fCwKncQ==
How can I debug this and/or why are my tasks vanishing?
Apparently the problem was caused by a deadlock between the broker and the celery worker, such that the worker would never acknowledge the task as complete, and never accept a new task, but never crashed or failed either. The tasks weren't vanishing; they were simply staying in queue forever.
Update: The deadlock was caused by the fact that we were running celeryd inside a wrapper script that installed dependencies. (Literally pip install -r requirements.txt && ./celeryd -lINFO). Because of how Cloud Foundry manages process trees, Cloud Foundry would try to kill the parent process (bash), which would HUP celeryd, but ultimately lots of child processes would never die.

Automatic task execution on google app engine development server (python)

The docs for the python dev server say this about running tasks:
When your app is running in the
development server, task queues are
not processed automatically. Instead,
task queues accrue tasks which you can
examine and execute from the developer
console...
But the release notes for version 1.3.4 of the python sdk (which I am using) say:
Auto task execution is now enabled in
the dev_appserver. To turn this off
use the flag --disable_task_running.
So maybe the docs are a little behind, right? Except when I go to "http://localhost:8080/_ah/admin/tasks?queue=default", I see this:
Tasks will not run automatically. Push the 'Run' button to execute each task.
Can tasks be run automatically or not? If so, what is the trick?
It seems the problem was that I was running the dev server with python 2.6 instead of 2.5. When using 2.5, everything worked.

Categories