Airflow Dag Statuses Inconsistent in Webserver - python

I have an Airflow cluster up, configured to use the CeleryExecutor and a Postgres backend.
For some reason, the statuses of the DAGs on the Webserver UI are inconsistent every time I refresh. Upon each refresh, it shows many different things such as the DAG not available in the webserver dagbag object, or black statuses, or hiding the links on the right.
It changes on each refresh.
Here are a few screenshots:
Webserver UI 1
Webserver UI 2

Run airflow web-server in debug mode than you can get this resolved
airflow webserver -p <<port>> -d
The problem seems to be some dynamic code changes happening on new dag creation and hence production mode flask server is not patching it through

Related

Airflow not creating log files or showing logs in task instance on UI

Im getting the following airflow issue:
When I run Dags that have mutiple tasks in it, randomly airflow set some of the tasks to failed state, and also doesn't show any logs on the UI. I went to my running worker container and saw that the log files for those failed tasks were also not created.
Going to Celery Flower, I found these logs on failed tasks:
airflow.exceptions.AirflowException: Celery command failed on host
How to solve this?
My environment is:
airflow:2.3.1
Docker compose
Celery Executor
Worker, webserver, scheduler and triggerer in different containers
Docker compose hosted on Ubuntu
I also saw this https://stackoverflow.com/a/69201032/11949273 answer that might be related.
Anyone with these same issues?
Edit:
On my EC2 Instance I got more vCPU's and fine tuned airflow/celery workers parameters and solved this. Probably is some issue with lack of CPU and or something else.
I am faced with some issue. In my case in Inspect -> Console has some error with replaceAll in old browser (Chrome 83.X). Chrome 98.X does not have this issue.

Airflow + Kubernetes Executor too old resource version

I have a strange behaviour of Airflow with Kubernetes executor. In my config tasks run in dynamically created kubernetes pods, and i have a number of tasks that runs once or twice a day. Taks itself is python operators that runs some ETL routine, dag files is syncing via separate pod with git repo inside. For some time all working ok, but not so long ago in scheduler pod i begin to see error
kubernetes.client.exceptions.ApiException: (410)
Reason: Gone: too old resource version: 51445975 (51489631)
After that error is appear, old pods from tasks won't be deleted and after some time new pods can't be created and tasks won't run(or to be more precise it freezes in "scheduled" state). In this situation only deleting the scheduler pod with
kubectl delete -n SERVICE_NAME pod scheduler
and waiting for kubernetes to recreate it helps, but after some time error appears again and situation repeats. Another strange thing, that this error seems only appear after scheduled tasks run. If i trigger any task any number of time it via UI no error appears and pods are created and deleted normally.
Airflow version is 1.10.12 Any help will be appreciated, thanks!
This is because of Kubernetes Python client version 12.0
Restrict the version to <12
pip install -U 'kubernetes<12'

unable to use threadpoolexecutor in a flask app run with gunicorn preload flags

I have a flask app which i'm trying to front with gunicorn. I want to use the preload flag since my application has some scheduled jobs using apscheduler which i want only to run in the master and not the workers.
I also want to use the ThreadPoolExecutor in python to delegate jobs to the background triggered by a route on my app.
when I use the --preload flag with gunicorn any calls to my threadpoolexecutor (using executor.submit) seem to fail. The same seems to happen when i programatically trigger a job through the apscheduler.
When i don't use the --preload flag everything runs smoothly.
Is there some config i can change to get this working or would this not work with the --preload flag?

Triggering an Airflow DAG from terminal not working

I'm trying to use airflow to define a specific workflow that I want to manually trigger from the command line.
I create the DAG and add a bunch of tasks.
dag = airflow.DAG(
"DAG_NAME",
start_date=datetime(2015, 1, 1),
schedule_interval=None,
default_args=args)
I then run in the terminal
airflow trigger_dag DAG_NAME
and nothing happens. The scheduler is running in another thread. Any direction is much appreciated. Thank You
I just encountered the same issue.
Assuming you are able to see your dag in airflow list_dags or via the web server then:
Not only did I have to turn on the dag in the web UI, but I also had to ensure that airflow scheduler was running as a separate process.
Once I had the scheduler running I was able to successfully execute my dag using airflow trigger_dag <dag_id>
My dag configuration is not significantly different from yours. I also have schedule_interval=None
You may have disabled the workflow.
To enable the workflow manually. Open up the airflow web server by
$ airflow webserver -p 8080
Go to http://localhost:8080 . You should see the list of all available dags with a toggle button on/off. By default everything is set to off. Search for your dag and toggle your workflow. Now try triggering the workflow from terminal. It should work now.
first make sure your database connection string on the airflow is working, weather it be on postgres, sqlite(by default) or any other database. Then run the command
airflow initdb
This command should not be showing any connection errors
Secondly make sure your webserver is running on a separate thread
airflow webserver
Then run your schdeuler on a different thread
airflow scheduler
Finally trigger your dag on a different thread after the scheduler is running
airflow trigger_dag dag_id
Also make sure the dag name and task are present in the dag and task list
airflow list_dags
airflow list_tasks dag_id
And if the dag is switched off in your UI then toggle it on.
You should 'unpause' the drag you what to trigger. use airflow unpause xxx_drag and then airflow trigger_dag xxx_drag and it should work.
airflow trigger_dag -e <execution_date> <dag_id>

How can I communicate with Celery on Cloud Foundry?

I have a wsgi app with a celery component. Basically, when certain requests come in they can hand off relatively time-consuming tasks to celery. I have a working version of this product on a server I set up myself, but our client recently asked me to deploy it to Cloud Foundry. Since Celery is not available as a service on Cloud Foundry, we (me and the client's deployment team) decided to deploy the app twice – once as a wsgi app and once as a standalone celery app, sharing a rabbitmq service.
The code between the apps is identical. The wsgi app responds correctly, returning the expected web pages. vmc logs celeryapp shows that celery is to be up-and-running, but when I send requests to wsgi that should become celery tasks, they disappear as soon as they get to a .delay() statement. They neither appear in the celery logs nor do they appear as an error.
Attempts to debug:
I can't use celery.contrib.rdb in Cloud Foundry (to supply a telnet interface to pdb), as each app is sandboxed and port-restricted.
I don't know how to find the specific rabbitmq instance these apps are supposed to share, so I can see what messages it's passing.
Update: to corroborate the above statement about finding rabbitmq, here's what happens when I try to access the node that should be sharing celery tasks:
root#cf:~# export RABBITMQ_NODENAME=eecef185-e1ae-4e08-91af-47f590304ecc
root#cf:~# export RABBITMQ_NODE_PORT=57390
root#cf:~# ~/cloudfoundry/.deployments/devbox/deploy/rabbitmq/sbin/rabbitmqctl list_queues
Listing queues ...
=ERROR REPORT==== 18-Jun-2012::11:31:35 ===
Error in process <0.36.0> on node 'rabbitmqctl17951#cf' with exit value: {badarg,[{erlang,list_to_existing_atom,["eecef185-e1ae-4e08-91af-47f590304ecc#localhost"]},{dist_util,recv_challenge,1},{dist_util,handshake_we_started,1}]}
Error: unable to connect to node 'eecef185-e1ae-4e08-91af-47f590304ecc#cf': nodedown
diagnostics:
- nodes and their ports on cf: [{'eecef185-e1ae-4e08-91af-47f590304ecc',57390},
{rabbitmqctl17951,36032}]
- current node: rabbitmqctl17951#cf
- current node home dir: /home/cf
- current node cookie hash: 1igde7WRgkhAea8fCwKncQ==
How can I debug this and/or why are my tasks vanishing?
Apparently the problem was caused by a deadlock between the broker and the celery worker, such that the worker would never acknowledge the task as complete, and never accept a new task, but never crashed or failed either. The tasks weren't vanishing; they were simply staying in queue forever.
Update: The deadlock was caused by the fact that we were running celeryd inside a wrapper script that installed dependencies. (Literally pip install -r requirements.txt && ./celeryd -lINFO). Because of how Cloud Foundry manages process trees, Cloud Foundry would try to kill the parent process (bash), which would HUP celeryd, but ultimately lots of child processes would never die.

Categories