Configure logging retention policy for Apache airflow - python

I could not find in Airflow docs how to set up the retension policy I need.
At the moment, we keep all airflow logs forever on our servers which is not the best way to go.
I wish to create global logs configurations for all the different logs I have.
How and where do I configure:
Number of days to keep
Max file size

I ran into the same situation yesterday, the solution for me was to use a DAG that handles all the log cleanup and schedule it as any other DAG.
Check this repo, you will find a step-by-step guide on how to set it up. Basically what you will achieve is to delete files located on airflow-home/log/ and airflow-home/log/scheduler based on a given period defined on a Variable. The DAG dynamically creates one task for each directory targeted for deletion based on your previous definition.
In my case, the only modification I made to the original DAG was to allow deletion only to the scheduler folder by replacing the initial value of DIRECTORIES_TO_DELETE. All credits to the creators! works very well out of the box, and it's easy to customize.

Related

Moving files of very different size from one place to another - optimization in Airflow

I'm implementing a DAG in Airflow moving files from an on-prem location to Azure Blob Storage. Reading files from the source and sending them to Azure is realized via a Data Access Layer outside of Airflow.
The thing is that files in the source can be very small (kilobytes) but potentially also very big (gigabytes). The goal is not to delay the movement of small files while the big ones are being processed.
I currently have a DAG which has two tasks:
list_files - list files in the source.
move_file[] - download the file to a temporary location, upload it to Azure and clean up (delete it from the temporary location and from the source).
Task 1) returns a list of locations in the source, whereas task 2) is ran in parallel for each path returned by task 1) using dynamic task mapping introduced in Airflow 2.3.0.
The DAG is set with max_active_runs=1 so that another DAGRun is not created while the big files are still being processed by the previous DAGRun. The problem is however that between two scheduled DAGRuns some new files can arrive in the source and they cannot be moved right away because a previous DAGRun is still processing the big files. Setting max_active_runs to 2 does not seem like an option because the second DAGRun will attempt to process the files which are already being processed by the previous DAGRun (the big ones which did not move between two scheduled DAGRuns).
What is the best approach in Airflow to address such an issue? Basically I want to make file transfer from one place to another as smooth as possible taking into account that I might have both very small and very big files.
EDIT: I am know thinking that maybe I could use some .lock files. The move_file task will put a .lock file with the name of the file being moved in a certain location that Airflow has access to. Now list_files will read this location and only return those file which do not have locks. Of course the move_file task will clean up after successfully moving the file and release the lock. That will work but is it a good practice? Maybe instead of .lock files I should somehow use the metadata database of Airflow?

Trigger a DAG Run from a python

I wrote a python program to create a DAG file. After creating this DAG file, I want to trigger this DAG run. I tried to use the following code -
from airflow.api.client.local_client import Client
c = Client(None, None)
c.trigger_dag(dag_id='local_job_md', run_id='local_job_md', conf={})
But this code is getting error as it is not able to find DAG table in sqlite. After little research, I realized this might be an issue for some gaps in installation. I am new to API but I realized that there is a way to use stable Rest API to trigger the DAG from my program. I need help in this from you people. I want to trigger the DAG from my code badly.
Please help me out from such a situation. Any help is appreciated!
Thanks,
Jay
Even though you only want to use the API, you still need to initialize the Airflow DB. Unable to find the dag table in sqlite means you don't have your airflow.db initialized.
To do this, go to your $AIRFLOW_HOME directory and run:
airflow initdb
If this command doesn't work for you, you may not have set up Airflow correctly, so I'd suggest starting with the install steps from the beginning.

Django Schedule/run tasks dynamically at particular time periodically from user's order in app (without celery)

I am creating a Django App where the user can schedule some tasks to happen at a particular time.
for example. in google calendar we tell google what we will be doing tomorrow and then at right time it sends us a notification.
similarly here I want it to be flexible, for instance, the user can tell Django the time and function to run.
Django will wait for the time and then run the function.
like he said turn off lights at 12 pm
then Django will do it.
or for example:-
user says remind me to go to the gym in 30 minutes
And then after 30 minutes, he gets a notification.
Actually the tasks are added dynamically so we can't hardcode them at first.
code:-
skills.py # all the functions defined by user are imported in it
# for example take this task
def turn_off_light(room, id, *args, **kwargs):
# turned off light using python (your logic)
print(f"light no {id} is turned off!")
there's a Django model known as Function in which this function is stored and the user can access it easily.
I want users to select this function within the application and also give parameters and time and Django will run it at that time!
In short what I need is that the user from within the application is able to set the time of a function(or task) to run at a particular time(maybe periodically or once or maybe in the situation) and Django runs it on that particular time.
Note: user will also give args and kwargs(I am taking care of that). All I need is Django run the function with those args and kwargs.
(only Django method without celery or something will be appreciated)
Thanks!
Without Celery and using Django the best way to do is to create custom django-admin commands with Cron
For example :
Create customer command called calendar_routine.py
Create a cron schedule to call your function from your server at a given time
Otherwise there is no way to do it in pure Python/Django

Django real time jobs

How Can I create real time actions in python / django?
for more info:
Some user add some thing to database also other user add same one too (not same as they are but they have similar property) at same time (and all times) program should being check if they are objects with similar property {if they are not same check both of them in other time with all other objects that may be added/edited on database}
these actions should be in real time or at last by few minuts appart.
for example:
for every(2min):
do_job()
or
while True:
do_job()
if i use second one program will stop.
You need to run a async task to check the objects in background. You can check this link for reference Celery doc
In case if you have any limitations using celery or a similar approach, the other way is to create a scripts.py inside your app(same level as models.py & views.py) and write logic and schedule this in cron or any scheduler based on your host server.

Airflow "This DAG isnt available in the webserver DagBag object "

when I put a new DAG python script in the dags folder, I can view a new entry of DAG in the DAG UI but it was not enabled automatically. On top of that, it seems does not loaded properly as well. I can only click on the Refresh button few times on the right side of the list and toggle the on/off button on the left side of the list to be able to schedule the DAG. These are manual process as I need to trigger something even though the DAG Script was put inside the dag folder.
Anyone can help me on this ? Did I missed something ? Or this is a correct behavior in airflow ?
By the way, as mentioned in the post title, there is an indicator with this message "This DAG isn't available in the webserver DagBag object. It shows up in this list because the scheduler marked it as active in the metdata database" tagged with the DAG title before i trigger all this manual process.
It is not you nor it is correct or expected behavior.
It is a current 'bug' with Airflow.
The web server is caching the DagBag in a way that you cannot really use it as expected.
"Attempt removing DagBag caching for the web server" remains on the official TODO as part of the roadmap, indicating that this bug may not yet be fully resolved, but here are some suggestions on how to proceed:
only use builders in airflow v1.9+
Prior to airflow v1.9 this occurs when a dag is instantiated by a function which is imported into the file where instantiation happens. That is: when a builder or factory pattern is used. Some reports of this issue on github 2 and JIRA 3 led to a fix released with in airflow v1.9.
If you are using an older version of airflow, don't use builder functions.
airflow backfill to reload the cache
As Dmitri suggests, running airflow backfill '<dag_id>' -s '<date>' -e '<date>' for the same start and end date can sometimes help. Thereafter you may end up with the (non)-issue that Priyank points, but that is expected behavior (state: paused or not) depending on the configuration you have in your installation.
Restart the airflow webserver solves my issue.
This error can be misleading. If hitting refresh button or restarting airflow webserver doesn't fix this issue, check the DAG (python script) for errors.
Running airflow list_dags can display the DAG errors (in addition to listing out the dags) or even try running/testing your dag as a normal python script.
After fixing the error, this indicator should go away.
The issue is because the DAG by default is put in the DagBag in paused state so that the scheduler is not overwhelmed with lots of backfill activity on start/restart.
To work around this change the below setting in your airflow.cfg file:
# Are DAGs paused by default at creation
dags_are_paused_at_creation = False
Hope this helps. Cheers!
I have a theory about possible cause of this issue in Google Composer. There is section about dag failures on webserver in troubleshooting documentation for Composer, which says:
Avoid running heavyweight computation at DAG parse time. Unlike the
worker and scheduler nodes, whose machine types can be customized to
have greater CPU and memory capacity, the webserver uses a fixed
machine type, which can lead to DAG parsing failures if the parse-time
computation is too heavyweight.
And I was trying to load configuration from external source (which actually took negligible amount of time comparing to other operations to create DAG, but still broke something, because webserver of Airflow in composer runs on App Engine, which has strange behaviours).
I found the workaround in discussion of this Google issue, and it is to create separate DAG with task which loads all the data needed and stores that data in airflow variable:
Variable.set("pipeline_config", config, serialize_json=True)
Then I could do
Variable.get("pipeline_config", deserialize_json=True)
And successfully generate pipeline from that. Additional benefit is that I get logs from that task, which I get from web server, because of this issue.

Categories