Trigger a DAG Run from a python - python

I wrote a python program to create a DAG file. After creating this DAG file, I want to trigger this DAG run. I tried to use the following code -
from airflow.api.client.local_client import Client
c = Client(None, None)
c.trigger_dag(dag_id='local_job_md', run_id='local_job_md', conf={})
But this code is getting error as it is not able to find DAG table in sqlite. After little research, I realized this might be an issue for some gaps in installation. I am new to API but I realized that there is a way to use stable Rest API to trigger the DAG from my program. I need help in this from you people. I want to trigger the DAG from my code badly.
Please help me out from such a situation. Any help is appreciated!
Thanks,
Jay

Even though you only want to use the API, you still need to initialize the Airflow DB. Unable to find the dag table in sqlite means you don't have your airflow.db initialized.
To do this, go to your $AIRFLOW_HOME directory and run:
airflow initdb
If this command doesn't work for you, you may not have set up Airflow correctly, so I'd suggest starting with the install steps from the beginning.

Related

Is there any difference between python scripts in airflow and same script in python

I was writing the below code but it is running endless in airflow, but in my system it take 5 min to run
gc=pygsheets.authorize(service_account_file='file.json')
sh3 = gc.open("city")
wks3 = sh3.worksheet_by_title("test")
df = wks3.get_as_df()
df2 = demo_r
wks3.clear()
wks3.set_dataframe(df2,(1,1))
Answering just the question in the title because we can't do anything about your code without more details (stack trace/full code sample/infra setup/etc).
Airflow is a Python framework and will run any code you give it. So there is no difference between a Python script run via an Airflow task or just on your laptop -- the same lines of code will be executed. However, do note that Airflow runs Python code in a separate process, and possibly on different machines, depending on your chosen executor. Airflow registers metadata in a database and manages logfiles from your tasks, so there's more happening around your task when you execute it in Airflow.

Safely run executable in Node

I found myself having to implement the following use case: I need to run a webapp in which users can submit C programs, which need to be run safely on my backend.
I'm trying to get this done using Node. In the past, I had to do something similar but the user-submitted code was JavaScript code, and I got away with using Node vm2 module. Essentially, I would create a VM and call its run method with the user submitted code as a string argument, then collect the output and do whatever I had to.
I'm trying to understand if using the same moule could help me with C code as well. The idea would be to use exec to first call gcc and compile the user code. Afterwards, I would use a VM to run exec again, this time passing the generated executable as a result. Would this be safe?
I don't understand vm2 deeply enough to know whether the safety is only limited to executing JS code or if it can be trusted to also run any arbitrary shell command safely.
In case vm2 isn't appropriate, what would be another way to run an executable in a sandboxed fashion in Node? Feel free to also suggest Python-based solutions, if you know any. Please note that the code will still be executed in a separate container as the main app regardless, but I want to make extra sure users cannot easily just tear it down at their liking.
Thank you in advance.
I am currently experiencing the same challenge as you, trying to execute safely some untrusted code using spawn, so what I can tell you is that vm2 only works for JS/TS code, but can't control what happens to a new process created by spawn, fork or exec.
For now I haven't found any good solution, but I'm thinking of trying to run the process as a user with limited rights.
As you seem to have access to the C source code, I would advise you to search how to run untrusted C programs (in plain C), and see if you can manipulate the C code in order to have a safer environment from this point of view.

Configure logging retention policy for Apache airflow

I could not find in Airflow docs how to set up the retension policy I need.
At the moment, we keep all airflow logs forever on our servers which is not the best way to go.
I wish to create global logs configurations for all the different logs I have.
How and where do I configure:
Number of days to keep
Max file size
I ran into the same situation yesterday, the solution for me was to use a DAG that handles all the log cleanup and schedule it as any other DAG.
Check this repo, you will find a step-by-step guide on how to set it up. Basically what you will achieve is to delete files located on airflow-home/log/ and airflow-home/log/scheduler based on a given period defined on a Variable. The DAG dynamically creates one task for each directory targeted for deletion based on your previous definition.
In my case, the only modification I made to the original DAG was to allow deletion only to the scheduler folder by replacing the initial value of DIRECTORIES_TO_DELETE. All credits to the creators! works very well out of the box, and it's easy to customize.

Airflow "This DAG isnt available in the webserver DagBag object "

when I put a new DAG python script in the dags folder, I can view a new entry of DAG in the DAG UI but it was not enabled automatically. On top of that, it seems does not loaded properly as well. I can only click on the Refresh button few times on the right side of the list and toggle the on/off button on the left side of the list to be able to schedule the DAG. These are manual process as I need to trigger something even though the DAG Script was put inside the dag folder.
Anyone can help me on this ? Did I missed something ? Or this is a correct behavior in airflow ?
By the way, as mentioned in the post title, there is an indicator with this message "This DAG isn't available in the webserver DagBag object. It shows up in this list because the scheduler marked it as active in the metdata database" tagged with the DAG title before i trigger all this manual process.
It is not you nor it is correct or expected behavior.
It is a current 'bug' with Airflow.
The web server is caching the DagBag in a way that you cannot really use it as expected.
"Attempt removing DagBag caching for the web server" remains on the official TODO as part of the roadmap, indicating that this bug may not yet be fully resolved, but here are some suggestions on how to proceed:
only use builders in airflow v1.9+
Prior to airflow v1.9 this occurs when a dag is instantiated by a function which is imported into the file where instantiation happens. That is: when a builder or factory pattern is used. Some reports of this issue on github 2 and JIRA 3 led to a fix released with in airflow v1.9.
If you are using an older version of airflow, don't use builder functions.
airflow backfill to reload the cache
As Dmitri suggests, running airflow backfill '<dag_id>' -s '<date>' -e '<date>' for the same start and end date can sometimes help. Thereafter you may end up with the (non)-issue that Priyank points, but that is expected behavior (state: paused or not) depending on the configuration you have in your installation.
Restart the airflow webserver solves my issue.
This error can be misleading. If hitting refresh button or restarting airflow webserver doesn't fix this issue, check the DAG (python script) for errors.
Running airflow list_dags can display the DAG errors (in addition to listing out the dags) or even try running/testing your dag as a normal python script.
After fixing the error, this indicator should go away.
The issue is because the DAG by default is put in the DagBag in paused state so that the scheduler is not overwhelmed with lots of backfill activity on start/restart.
To work around this change the below setting in your airflow.cfg file:
# Are DAGs paused by default at creation
dags_are_paused_at_creation = False
Hope this helps. Cheers!
I have a theory about possible cause of this issue in Google Composer. There is section about dag failures on webserver in troubleshooting documentation for Composer, which says:
Avoid running heavyweight computation at DAG parse time. Unlike the
worker and scheduler nodes, whose machine types can be customized to
have greater CPU and memory capacity, the webserver uses a fixed
machine type, which can lead to DAG parsing failures if the parse-time
computation is too heavyweight.
And I was trying to load configuration from external source (which actually took negligible amount of time comparing to other operations to create DAG, but still broke something, because webserver of Airflow in composer runs on App Engine, which has strange behaviours).
I found the workaround in discussion of this Google issue, and it is to create separate DAG with task which loads all the data needed and stores that data in airflow variable:
Variable.set("pipeline_config", config, serialize_json=True)
Then I could do
Variable.get("pipeline_config", deserialize_json=True)
And successfully generate pipeline from that. Additional benefit is that I get logs from that task, which I get from web server, because of this issue.

Python: Increasing timeout value in EMR using yelps MRJOB

I am using the yelp MRjob for writing some of the mapreduce programs. I am running it on EMR. My program has reducer code which takes a long time to execute. I am noticing that because of the default timeout period in EMR I am getting this error
Task attempt_201301171501_0001_r_000000_0 failed to report status for 600 seconds.Killing!
I want a way to increase the timeout of the EMR. I read the mrjobs official documentation about the same but I was not able to understand the procedure. Can someone suggest a way to solve this issue.
I've dealt with a similar issue with EMR in the past, the property you are looking for mapred.task.timeout which corresponds to the number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string.
With MRJob, you could add the following option:
--jobconf mapred.task.timeout=1800000
EDIT: It appears that some EMR AMIs appear do not support setting parameters like timeout with jobconf at run time. Instead, you must use Bootstrap-time configuration like this:
--bootstrap-action="s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.task.timeout=1800000"
I would still try the first one to start with and see if you can get it to work, otherwise try the bootstrap action.
To run any of these parameters, just create your job extending from MRJob, this class has a jobconf method that will read your --jobconf parameters, so you should specify these as regular options on command line:
python job.py --num-ec2-instances 42 --python-archive t.tar.gz -r emr --jobconf mapred.task.timeout=1800000 /path/to/input.txt

Categories