Airflow "This DAG isnt available in the webserver DagBag object "

Airflow "This DAG isnt available in the webserver DagBag object " - python

when I put a new DAG python script in the dags folder, I can view a new entry of DAG in the DAG UI but it was not enabled automatically. On top of that, it seems does not loaded properly as well. I can only click on the Refresh button few times on the right side of the list and toggle the on/off button on the left side of the list to be able to schedule the DAG. These are manual process as I need to trigger something even though the DAG Script was put inside the dag folder.
Anyone can help me on this ? Did I missed something ? Or this is a correct behavior in airflow ?
By the way, as mentioned in the post title, there is an indicator with this message "This DAG isn't available in the webserver DagBag object. It shows up in this list because the scheduler marked it as active in the metdata database" tagged with the DAG title before i trigger all this manual process.

It is not you nor it is correct or expected behavior.
It is a current 'bug' with Airflow.
The web server is caching the DagBag in a way that you cannot really use it as expected.
"Attempt removing DagBag caching for the web server" remains on the official TODO as part of the roadmap, indicating that this bug may not yet be fully resolved, but here are some suggestions on how to proceed:
only use builders in airflow v1.9+
Prior to airflow v1.9 this occurs when a dag is instantiated by a function which is imported into the file where instantiation happens. That is: when a builder or factory pattern is used. Some reports of this issue on github 2 and JIRA 3 led to a fix released with in airflow v1.9.
If you are using an older version of airflow, don't use builder functions.
airflow backfill to reload the cache
As Dmitri suggests, running airflow backfill '<dag_id>' -s '<date>' -e '<date>' for the same start and end date can sometimes help. Thereafter you may end up with the (non)-issue that Priyank points, but that is expected behavior (state: paused or not) depending on the configuration you have in your installation.

Restart the airflow webserver solves my issue.

This error can be misleading. If hitting refresh button or restarting airflow webserver doesn't fix this issue, check the DAG (python script) for errors.
Running airflow list_dags can display the DAG errors (in addition to listing out the dags) or even try running/testing your dag as a normal python script.
After fixing the error, this indicator should go away.

The issue is because the DAG by default is put in the DagBag in paused state so that the scheduler is not overwhelmed with lots of backfill activity on start/restart.
To work around this change the below setting in your airflow.cfg file:
# Are DAGs paused by default at creation
dags_are_paused_at_creation = False
Hope this helps. Cheers!

I have a theory about possible cause of this issue in Google Composer. There is section about dag failures on webserver in troubleshooting documentation for Composer, which says:
Avoid running heavyweight computation at DAG parse time. Unlike the
worker and scheduler nodes, whose machine types can be customized to
have greater CPU and memory capacity, the webserver uses a fixed
machine type, which can lead to DAG parsing failures if the parse-time
computation is too heavyweight.
And I was trying to load configuration from external source (which actually took negligible amount of time comparing to other operations to create DAG, but still broke something, because webserver of Airflow in composer runs on App Engine, which has strange behaviours).
I found the workaround in discussion of this Google issue, and it is to create separate DAG with task which loads all the data needed and stores that data in airflow variable:
Variable.set("pipeline_config", config, serialize_json=True)
Then I could do
Variable.get("pipeline_config", deserialize_json=True)
And successfully generate pipeline from that. Additional benefit is that I get logs from that task, which I get from web server, because of this issue.

Related

Trigger a DAG Run from a python

I wrote a python program to create a DAG file. After creating this DAG file, I want to trigger this DAG run. I tried to use the following code -
from airflow.api.client.local_client import Client
c = Client(None, None)
c.trigger_dag(dag_id='local_job_md', run_id='local_job_md', conf={})
But this code is getting error as it is not able to find DAG table in sqlite. After little research, I realized this might be an issue for some gaps in installation. I am new to API but I realized that there is a way to use stable Rest API to trigger the DAG from my program. I need help in this from you people. I want to trigger the DAG from my code badly.
Please help me out from such a situation. Any help is appreciated!
Thanks,
Jay

Even though you only want to use the API, you still need to initialize the Airflow DB. Unable to find the dag table in sqlite means you don't have your airflow.db initialized.
To do this, go to your $AIRFLOW_HOME directory and run:
airflow initdb
If this command doesn't work for you, you may not have set up Airflow correctly, so I'd suggest starting with the install steps from the beginning.

Airflow DAGS Orchestration

I have three DAGs (say, DAG1, DAG2 and DAG3). I have a monthly scheduler for DAG1. DAG2 and DAG3 must not be run directly (no scheduler for these) and must be run only when DAG1 is completed successfully. That is, once DAG1 is complete, DAG2 and DAG3 will need to start in parallel.
What is the best mechanism to do this? I came across TriggerDAGRun and ExternalTaskSensor options. I am wanting to understand the pros and cons of each and which one is the best. I see few questions around these. However, I am trying to find the answer for the latest stable Airflow version.

ExternalTaskSensor is not relevant for your use case as none of the DAGs you mention needs to wait for another DAG.
You need to set TriggerDagRunOperator at the code of DAG1 that will trigger the DAG runs for DAG2, DAG3.
A skeleton of the solution would be:
dag2 = DAG(dag_id="DAG2", schedule_inteval=None)
dag3 = DAG(dag_id="DAG3", schedule_inteval=None)
with DAG(dag_id="DAG1", schedule_inteval="#monthly") as dag1:
op_first = DummyOperator(task_id="first") #Replace with operators of your DAG
op_trig2 = TriggerDagRunOperator(task_id="trigger_dag2", trigger_dag_id="DAG2")
op_trig3 = TriggerDagRunOperator(task_id="trigger_dag3", trigger_dag_id="DAG3")
op_first >> [op_trig2, op_trig3]
Edit:
After discussing in comments and since you mentioned you can not edit DAG1 as it's someone else code your best option is ExternalTaskSensor. You will have to set DAG2 & DAG3 to start on the same schedule as DAG1 and they will need to constantly poke DAG1 till it's finish. It will work just not very optimal.

Configure logging retention policy for Apache airflow

I could not find in Airflow docs how to set up the retension policy I need.
At the moment, we keep all airflow logs forever on our servers which is not the best way to go.
I wish to create global logs configurations for all the different logs I have.
How and where do I configure:
Number of days to keep
Max file size

I ran into the same situation yesterday, the solution for me was to use a DAG that handles all the log cleanup and schedule it as any other DAG.
Check this repo, you will find a step-by-step guide on how to set it up. Basically what you will achieve is to delete files located on airflow-home/log/ and airflow-home/log/scheduler based on a given period defined on a Variable. The DAG dynamically creates one task for each directory targeted for deletion based on your previous definition.
In my case, the only modification I made to the original DAG was to allow deletion only to the scheduler folder by replacing the initial value of DIRECTORIES_TO_DELETE. All credits to the creators! works very well out of the box, and it's easy to customize.

Web2Py - configure a scheduler

I have an application written in Web2Py that contains some modules. I need to call some functions out of a module on a periodic basis, say once daily. I have been trying to get a scheduler working for that purpose but am not sure how to get it working properly. I have referred to this and this to get started.
I have got a scheduler.py class in the models directory, which contains code like this:
from gluon.scheduler import Scheduler
from Module1 import Module1
def daily_task():
module1 = Module1()
module1.action1(arg1, arg2, arg3)
daily_task_scheduler = Scheduler(db, tasks=dict(my_daily_task=daily_task))
In default.py I have following code for the scheduler:
def daily_periodic_task():
daily_task_scheduler.queue_task('daily_running_task', repeats=0, period=60)
[for testing I am running it after 60 seconds, otherwise for daily I plan to use period=86400]
In my Module1.py class, I have this kind of code:
def action1(self, arg1, arg2, arg3):
for row in db().select(db.table1.ALL):
row.processed = 'processed'
row.update_record()
One of the issues I am facing is that I don't understand clearly how to make this scheduler work to automatically handle the execution of action1 on daily basis.
When I launch my application using syntax similar to: python web2py.py -K my_app it shows this in the console:
web2py Web Framework
Created by Massimo Di Pierro, Copyright 2007-2015
Version 2.11.2-stable+timestamp.2015.05.30.16.33.24
Database drivers available: sqlite3, imaplib, pyodbc, pymysql, pg8000
starting single-scheduler for "my_app"...
However, when I see the browser at:
http://127.0.0.1:8000/my_app/default/daily_periodic_task
I just see "None" as text displayed on the screen and I don't see any changes produced by the scheduled task in my database table.
While when I see the browser at:
http://127.0.0.1:8000/my_app/default/index
I get an error stating This web page is not available, basically indicating my application never got started.
When I start my application normally using python web2py.py my application loads fine but I don't see any changes produced by the scheduled task in my database table.
I am unable to figure out what I am doing wrong here and how to properly use the scheduler with Web2Py. Basically, I need to know how can I start my application normally alongwith the scheduled tasks properly running in background.
Any help in this regard would be highly appreciated.

Running python web2py.py starts the built-in web server, enabling web2py to respond to HTTP requests (i.e., serving web pages to a browser). This has nothing to do with the scheduler and will not result in any scheduled tasks being run.
To run scheduled tasks, you must start one or more background workers via:
python web2py.py -K myapp
The above does not start the built-in web server and therefore does not enable you to visit web pages. It simply starts a worker process that will be available to execute scheduled tasks.
Also, note that the above does not actually result in any tasks being scheduled. To schedule a task, you must insert a record in the db.scheduler_task table, which you can do via any of the usual methods of inserting records (including using appadmin) or programmatically via the scheduler.queue_task method (which is what you use in your daily_periodic_task action).
Note, you can simultaneously start the built-in web server and a scheduler worker process via:
python web2py.py -a yourpassword -K myapp -X
So, to schedule a daily task and have it actually executed, you need to (a) start a scheduler worker and (b) schedule the task. You can schedule the task by visiting your daily_periodic_task action, but note that you only need to visit that action once, as once the task has been scheduled, it remains in effect indefinitely (given that you have set repeats=0).
If the task does not appear to be working, it is possible there is something wrong with the task itself that is resulting in an error.

How can I send tasks to a newly named backend

I've got a google appengine app which runs some code on dynamic backend defined as follows:
backends:
- name: downloadfilesbackend
class: B1
instances: 1
options: dynamic
I've recently made some changes to my code and added a second backend. I've moved some tasks from the front end to the new backend and they work fine. However I want to move the tasks that originally ran on downloadfilesbackend to the new backend (to save on instance hours). I am doing this simply by changing the name of the target to the new backend i.e.
taskqueue.add(queue_name = "organise-files",
url=queue_organise_files,
target='organise-files-backend')
However, despite giving the new backend name as the target the tasks are still being run by the old backend. Any idea why this is happening or how I can fix it?
EDIT:
The old backend is running new tasks - I've checked this.
I've also been through all of my code to check to see if anything is calling the old backend and nothing is. There are only two methods which added tasks to the old backend, and both of these methods have been changed as detailed above.
I stopped the old backend for a few hours, to see whether this would change anything, but all that happened was that the tasks got jammed until I restarted the backend. The new backend is running other tasks fine, so it's definitely been updated correctly...

It's taken a while but I've finally discovered that it is not enough to just change the code and upload it using the SDK. If the code running on the backend or sending tasks to the backend handler is changed then you must run
appcfg backends <dir> update [backend]
documentation for this command is here. There isn't any documentation I've seen that says this - it was related to another related error I was experiencing that prompted this as an avenue. Just thought I'd let people know who may be having a similar problem

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.