I have an airflow comprising of 2-3 steps
PythonOperator --> It runs the query on AWS Athena and stores the generated file on specific s3 path
BashOperator --> Increments the airflow variable for tracking
BashOperator --> It takes the output(response) of task1 and and run some code on top of it.
What happens here is the airflow gets completed within seconds even if the Athena query step is running.
I want to make sure that after the file is generated further steps should run. Basically i want this to be synchronous.
You can set the tasks as:
def athena_task():
# Add your code
return
t1 = PythonOperator(
task_id='athena_task',
python_callable=athena_task,
)
t2 = BashOperator(
task_id='variable_task',
bash_command='', #replace with relevant command
)
t3 = BashOperator(
task_id='process_task',
bash_command='', #replace with relevant command
)
t1 >> t2 >> t3
t2 will run only after t1 is completed successfully and t3 will start only after t2 is completed successfully.
Note that Airflow has AWSAthenaOperator which might save you the trouble of writing the code yourself. The operator submit a query to Athena and save the output in S3 path by setting the output_location parameter:
run_query = AWSAthenaOperator(
task_id='athena_task',
query='SELECT * FROM my_table',
output_location='s3://some-bucket/some-path/',
database='my_database'
)
Athena's query API is asynchronous. You start a query, get an ID back, and then you need to poll until the query has completed using the GetQueryExecution API call.
If you only start the query in the first task then there is not guarantee that the query has completed when the next task runs. Only when GetQueryExecution has returned a status of SUCCEEDED (or FAILED/CANCELLED) can you expect the output file to exist.
As #Elad points out, AWSAthenaOperator does this for you, and handles error cases, and more.
Related
def my_function(**kwargs):
global dag_run_id
dag_run_id = kwargs['dag_run'].run_id
example_task = PythonOperator(
task_id='example_task',
python_callable=my_function,
provide_context=True,
dag=dag)
bash_task = BashOperator(
task_id = 'bash_task' ,
bash_command = 'echo {{ dag_run_id }}' ,
dag = dag ,
)
bash_task is not printing value of dag_run_id and i want to use dag_run_id in other tasks
XCom (Cross-Communication) is the mechanism in Airflow that allows you to share data between tasks. Returning a value from a PythonOperator's callable automatically stores the value as an XCom. So your Python function could do:
def my_function(**kwargs):
dag_run_id = kwargs["run_id"]
return dag_run_id
Note that run_id is one of the templated variables given by Airflow, see the full list here: https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#variables.
This stores the returned value as an "XCom" in Airflow. You can observe XComs via the Grid View -> select task -> XCom, or see all XCom values via Admin -> XComs. The task-specific XCom view shows something like this:
You can then fetch (known as "pull" in Airflow) the value in another task:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo {{ ti.xcom_pull(task_ids='example_task') }}",
)
This will fetch the XCom value from the task with id example_task and echo it.
The full DAG code looks like this:
import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
with DAG(
dag_id="so_75213078",
start_date=datetime.datetime(2023, 1, 1),
schedule_interval=None,
):
def my_function(**kwargs):
dag_run_id = kwargs["run_id"]
return dag_run_id
example_task = PythonOperator(task_id="example_task", python_callable=my_function)
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo {{ ti.xcom_pull(task_ids='example_task') }}",
)
example_task >> bash_task
Tasks are executed by separate processes in Airflow (and sometimes on separate machines), therefore you cannot rely on e.g. global or a local file path to exist for all tasks.
To pass a variable value from a task to another one, you can use Airflow Cross-Communication (XCom) as explained in the other answer.
But if you just want to pass the dag_run id, you don't need to do all of this, where it's available on all the tasks:
bash_task = BashOperator(
task_id = 'bash_task' ,
bash_command = 'echo {{ dag_run.run_id }}' ,
dag = dag ,
)
I want to build Airflow tasks that use multiple gcloud commands.
A simple example :
def worker(**kwargs) :
exe = subprocess.run(["gcloud", "compute", "instances", "list"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
print(exe.returncode)
for line in exe.stdout.splitlines() :
print(line.decode())
exe = subprocess.run(["gcloud", "compute", "ssh", "user#host", "--command=pwd"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
print(exe.returncode)
for line in exe.stdout.splitlines() :
print(line.decode())
dag = DAG("TEST", default_args=default_args, schedule_interval=None)
worker_task = PythonOperator(task_id='sample-task', python_callable=worker, provide_context = True, dag=dag)
worker_task
I have this error :
ERROR: gcloud crashed (AttributeError): 'NoneType' object has no attribute 'isatty'
Apart from airflow, these commands work fine.
I've already tried disabling gcloud interactive mode with "--quiet", but that doesn't help.
I don't want to use the GcloudOperator operator from airflow, because these commands must be integrated in a custom operator.
thank you in advance for your help
As I see, your two commands are independent, so you can run them in two separate task from the operator BashOperator, and if you want to access the output of the commands, the output of each one will be available as a xcom, you can read it using ti.xcom_pull(task_ids='<the task id>').
Maybe use BashOperator?
worker_task = BashOperator(task_id="sample-task",bash_command='gcloud compute instances list', dag=dag)
I'm new to airflow and I'm trying to run a job on an ec2 instance using airflow's ssh_operator like shown below:
t2 = SSHOperator(
ssh_conn_id='ec2_ssh_connection',
task_id='execute_script',
command="nohup python test.py &",
retries=3,
dag=dag)
The job takes few hours and I want airflow to execute the python script and end. However when the command is executed and the dag completes the script is terminated on the ec2 instance. I also noticed that the above code doesn't create a nohup.out file.
I'm looking at how to run nohup using SSHOperator. It seems like this might be a python related issue because I'm getting the following error on EC2 script when the nohup has been executed:
[Errno 32] Broken pipe
Thanks!
Airflow's SSHHook uses the Paramiko module for SSH connectivity. There is an SO question regarding Prarmiko and nohup. One of the answers suggests to add sleep after the nohup command. I cannot explain exactly why, but it actually works. It is also necessary to set get_pty=True in SSHOperator.
Here is a complete example that demonstrates the solution:
from datetime import datetime
from airflow import DAG
from airflow.contrib.operators.ssh_operator import SSHOperator
default_args = {
'start_date': datetime(2001, 2, 3, 4, 0),
}
with DAG(
'a_dag', schedule_interval=None, default_args=default_args, catchup=False,
) as dag:
op = SSHOperator(
task_id='ssh',
ssh_conn_id='ssh_default',
command=(
'nohup python -c "import time;time.sleep(30);print(1)" & sleep 10'
),
get_pty=True, # This is needed!
)
The nohup.out file is written to the user's $HOME.
I am a newbie to Airflow and struggling with BashOperator. I want to access a shell script using bash operatory in my dag.py.
I checked:
How to run bash script file in Airflow
and
BashOperator doen't run bash file apache airflow
on how to access shell script through bash operator.
This is what I did:
cmd = "./myfirstdag/dag/lib/script.sh "
t_1 = BashOperator(
task_id='start',
bash_command=cmd
)
On running my recipe and checking in airflow I got the below error:
[2018-11-01 10:44:05,078] {bash_operator.py:77} INFO - /tmp/airflowtmp7VmPci/startUDmFWW: line 1: ./myfirstdag/dag/lib/script.sh: No such file or directory
[2018-11-01 10:44:05,082] {bash_operator.py:80} INFO - Command exited with return code 127
[2018-11-01 10:44:05,083] {models.py:1361} ERROR - Bash command failed
Not sure why this is happening. Any help would be appreciated.
Thanks !
EDIT NOTE: I assume that it's searching in some airflow tmp location rather than the path I provided. But how do I make it search for the right path.
Try this:
bash_operator = BashOperator(
task_id = 'task',
bash_command = '${AIRFLOW_HOME}/myfirstdag/dag/lib/script.sh '
dag = your_dag)
For those running a docker version.
I had this same issue, took me a while to realise the problem, the behaviour can be different with docker. When the DAG is run it moves it tmp file, if you do not have airflow on docker this is on the same machine. with my the docker version it moves it to another container to run, which of course when it is run would not have the script file on.
check the task logs carefully, you show see this happen before the task is run.
This may also depend on your airflow-docker setup.
Try the following. It needs to have a full file path to your bash file.
cmd = "/home/notebook/work/myfirstdag/dag/lib/script.sh "
t_1 = BashOperator(
task_id='start',
bash_command=cmd
)
Are you sure of the path you defined?
cmd = "./myfirstdag/dag/lib/script.sh "
With the heading . it means it is relative to the path where you execute your command.
Could you try this?
cmd = "find . -type f"
try running this:
path = "/home/notebook/work/myfirstdag/dag/lib/script.sh"
copy_script_cmd = 'cp ' + path + ' .;'
execute_cmd = './script.sh'
t_1 = BashOperator(
task_id='start',
bash_command=copy_script_cmd + execute_cmd
)
I have a main python script that generates a GUI, and through that GUI I want the user to be able to create, amend, and delete schedules managed by the windows task scheduler.
This code creates a task which will run in 5 minutes (uses pywin32):
import datetime
import win32com.client
scheduler = win32com.client.Dispatch('Schedule.Service')
scheduler.Connect()
root_folder = scheduler.GetFolder('\\')
task_def = scheduler.NewTask(0)
# Create trigger
start_time = datetime.datetime.now() + datetime.timedelta(minutes=5)
TASK_TRIGGER_TIME = 1
trigger = task_def.Triggers.Create(TASK_TRIGGER_TIME)
trigger.StartBoundary = start_time.isoformat()
# Create action
TASK_ACTION_EXEC = 0
action = task_def.Actions.Create(TASK_ACTION_EXEC)
action.ID = 'DO NOTHING'
action.Path = 'cmd.exe'
action.Arguments = '/c "exit"'
# Set parameters
task_def.RegistrationInfo.Description = 'Test Task'
task_def.Settings.Enabled = True
task_def.Settings.StopIfGoingOnBatteries = False
# Register task
# If task already exists, it will be updated
TASK_CREATE_OR_UPDATE = 6
TASK_LOGON_NONE = 0
root_folder.RegisterTaskDefinition(
'Test Task', # Task name
task_def,
TASK_CREATE_OR_UPDATE,
'', # No user
'', # No password
TASK_LOGON_NONE)
More info on tasks and their properties here: https://learn.microsoft.com/en-us/windows/desktop/taskschd/task-scheduler-objects
PyWin32 provides an interface to the Task Scheduler in win32com.taskscheduler. You can see an example of it's use here:
https://github.com/SublimeText/Pywin32/blob/master/lib/x32/win32comext/taskscheduler/test/test_addtask_1.py
Also #FredP linked to a good example that's much simpler:
http://blog.ziade.org/2007/11/01/scheduling-tasks-in-windows-with-pywin32/
There is also an interesting tidbit in the wmi module's cookbook about scheduling a job, although it doesn't appear to use the Task Scheduler:
http://timgolden.me.uk/python/wmi/cookbook.html#schedule-a-job
Just to round out the option list here... How about just calling the windows command line?
import os
os.system(r'SchTasks /Create /SC DAILY /TN "My Task" /TR "C:mytask.bat" /ST 09:00')
You can launch any executable, batch file, or even another python script - assuming the system is set to execute python...
schtasks has a rich list of options and capabilities...https://learn.microsoft.com/en-us/windows/win32/taskschd/schtasks
I also needed a way to use Python to schedule a task in Windows 10. I discovered something simpler, using only subprocess and PowerShell's Scheduled Tasks cmdlets, which is more powerful as it gives you finer control over the task to schedule.
And there's no need for a third-party module for this one.
import subprocess
# Use triple quotes string literal to span PowerShell command multiline
STR_CMD = """
$action = New-ScheduledTaskAction -Execute "powershell.exe" -Argument "C:\\path\\to\\file.ps1"
$description = "Using PowerShell's Scheduled Tasks in Python"
$settings = New-ScheduledTaskSettingsSet -DeleteExpiredTaskAfter (New-TimeSpan -Seconds 2)
$taskName = "Test Script"
$trigger = New-ScheduledTaskTrigger -Once -At (Get-Date).AddSeconds(10)
$trigger.EndBoundary = (Get-Date).AddSeconds(30).ToString("s")
Register-ScheduledTask -TaskName $taskName -Description $description -Action $action -Settings $settings -Trigger $trigger | Out-Null
"""
# Use a list to make it easier to pass argument to subprocess
listProcess = [
"powershell.exe",
"-NoExit",
"-NoProfile",
"-Command",
STR_CMD
]
# Enjoy the magic
subprocess.run(listProcess, check=True)