Hi all I have a function
def get_campaign_active(ds, **kwargs):
logging.info('Checking for inactive campaign types..')
the_db = ds['client']
db = the_db['misc-server']
collection = db.campaigntypes
campaign = list(collection.find({}))
for item in campaign:
if item['active'] == False:
# storing false 'active' campaigns
result = "'{}' active status set to False".format(item['text'])
logging.info("'{}' active status set to False".format(item['text']))
mapped to an airflow task
get_campaign_active = PythonOperator(
task_id='get_campaign_active',
provide_context=True,
python_callable=get_campaign_active,
xcom_push=True,
op_kwargs={'client': client_production},
dag=dag)
As you can see I pass in the client_production variable into op_kwargs with the task. The hope is this variable to be passed in through '**kwargs' parameter in the function when this task is run in airflow.
However for testing, when I try to call the function like so
get_campaign_active({"client":client_production})
The client_production variable is found inside the ds parameter. I don't have a staging server for airflow to test this out, but could someone tell me if I deploy this function/task to airflow, will it read the client_production variable from ds or kwargs?
Right now if I try to access the 'client' key in kwargs, kwargs is empty.
Thanks
You should do:
def get_campaign_active(ds, **kwargs):
logging.info('Checking for inactive campaign types..')
the_db = kwargs['client']
the ds (and all other macros are passed to kwargs as you set provide_context=True, you can either use named params like you did or let the ds be passed into kwargs as well)
Since in your code you don't actually use ds nor any other macros you can change your function signature to get_campaign_active(**kwargs) and remove provide_context=True. Note that from Airflow>=2.0 the provide_context=True is not needed at all.
Related
I can't figure out how to use pytest to test a dag task waiting for xcom_arg.
I created the following DAG using the new airflow API syntax :
#dag(...)
def transfer_files():
#task()
def retrieve_existing_files():
existing = []
for elem in os.listdir("./backup"):
existing.append(elem)
return existing
#task()
def get_new_file_to_sync(existing: list[str]):
new_files = []
for elem in os.listdir("./prod"):
if not elem in existing:
new_files.append(elem)
return new_files
r = retrieve_existing_files()
get_new_file_to_sync(r)
Now I want to perform unit testing on the get_new_file_to_sync task. I wrote the following test :
def test_get_new_elan_list():
mocked_existing = ["a.out", "b.out"]
dag_bag = DagBag(include_examples=False)
dag = dag_bag.get_dag("transfer_files")
task = dag.get_task("get_new_file_to_sync")
result = task.execute({}, mocked_existing)
print(result)
The test fails because task.execute is waiting for 2 parameters but 3 were given.
My issue is that I don't have any clue of how to proceed in order to test my tasks waiting for arguments with a mocked custom argument.
Thanks for your insights
I managed to find a way to unit test airflow tasks declared using the new airflow API.
Here is a test case for the task get_new_file_to_sync contained in the DAG transfer_files declared in the question :
def test_get_new_file_to_synct():
mocked_existing = ["a.out", "b.out"]
# Asking airflow to load the dags in its home folder
dag_bag = DagBag(include_examples=False)
# Retrieving the dag to test
dag = dag_bag.get_dag("transfer_files")
# Retrieving the task to test
task = dag.get_task("get_new_file_to_sync")
# extracting the function to test from the task
function_to_unit_test = task.python_callable
# Calling the function normally
results = function_to_unit_test(mocked_existing)
assert len(results) == 10
This allows bypassing all the airflow mechanics triggered before calling the actual code you have written for your task. Thus, you can focus on writing tests for the code you have written for your task.
For testing such a task, I believe you'll need to use mocking from pytest.
Let's take this user defined operator for an example:
class MovielensPopularityOperator(BaseOperator):
def __init__(self, conn_id, start_date, end_date, min_ratings=4, top_n=5, **kwargs):
super().__init__(**kwargs)
self._conn_id = conn_id
self._start_date = start_date
self._end_date = end_date
self._min_ratings = min_ratings
self._top_n = top_n
def execute(self, context):
with MovielensHook(self._conn_id) as hook:
ratings = hook.get_ratings(start_date=self._start_date, end_date=self._end_date)
rating_sums = defaultdict(Counter)
for rating in ratings:
rating_sums[rating["movieId"]].update(count=1, rating=rating["rating"])
averages = {
movie_id: (rating_counter["rating"] / rating_counter["count"], rating_counter["count"])
for movie_id, rating_counter in rating_sums.items()
if rating_counter["count"] >= self._min_ratings
}
return sorted(averages.items(), key=lambda x: x[1], reverse=True)[: self._top_n]
And a test written just like the one you did:
def test_movielenspopularityoperator():
task = MovielensPopularityOperator(
task_id="test_id",
start_date="2015-01-01",
end_date="2015-01-03",
top_n=5,
)
result = task.execute(context={})
assert len(result) == 5
Running this test fail as:
=============================== FAILURES ===============================
___________________ test_movielenspopularityoperator ___________________
mocker = <pytest_mock.plugin.MockFixture object at 0x10fb2ea90>
def test_movielenspopularityoperator(mocker: MockFixture):
task = MovielensPopularityOperator(
➥
>
task_id="test_id", start_date="2015-01-01", end_date="2015-01-
03", top_n=5
)
➥
E
TypeError: __init__() missing 1 required positional argument:
'conn_id'
tests/dags/chapter9/custom/test_operators.py:30: TypeError
========================== 1 failed in 0.10s ==========================
The test failed because we’re missing the required argument conn_id, which points to the connection ID in the metastore. But how do you provide this in a test? Tests should be isolated from each other; they should not be able to influence the results of other tests, so a database shared between tests is not an ideal situation. In this case, mocking comes to the rescue.
Mocking is “faking” certain operations or objects. For example, the call to a database that is expected to exist in a production setting but not while testing could be faked, or mocked, by telling Python to return a certain value instead of making the actual call to the (nonexistent during testing) database. This allows you to develop and run tests without requiring a connection to external systems. It requires insight into the internals of whatever it is you’re testing, and thus sometimes requires you to dive into third-party code.
After installing pytest-mock in your enviroment:
pip install pytest-mock
Here is the test written where mocking is used:
def test_movielenspopularityoperator(mocker):
mocker.patch.object(
MovielensHook,
"get_connection",
return_value=Connection(conn_id="test", login="airflow", password="airflow"),
)
task = MovielensPopularityOperator(
task_id="test_id",
conn_id="test",
start_date="2015-01-01",
end_date="2015-01-03",
top_n=5,
)
result = task.execute(context=None)
assert len(result) == 5
Now, hopefully this will give you an idea about how to write your tests for Airflow Tasks.
For more about mocking and unit tests, you can check here and here.
Previously, I was using the python_callable parameter of the TriggerDagRunOperator to dynamically alter the dag_run_obj payload that is passed to the newly triggered DAG.
Since its removal in Airflow 2.0.0 (Pull Req: https://github.com/apache/airflow/pull/6317), is there a way to do this, without creating a custom TriggerDagRunOperator?
For context, here is the flow of my code:
#Poll Amazon S3 bucket for new received files
fileSensor_tsk = S3KeySensor()
#Use chooseDAGBasedOnInput function to create dag_run object (previously python_callable was used directly in TriggerDagRunOperator to create the dag_run object for the new triggered DAG)
#dag_run object will pass received file name details to new DAG for reference in order to complete its own work
chooseDAGTrigger_tsk = BranchPythonOperator(
task_id='chooseDAGTrigger_tsk',
python_callable=chooseDAGBasedOnInput,
provide_context=True
)
triggerNewDAG_tsk = TriggerDagRunOperator(
task_id='triggerNewDAG_tsk',
trigger_dag_id='1000_NEW_LOAD'
)
triggerNewDAG2_tsk = TriggerDagRunOperator(
task_id='triggerNew2DAG_tsk',
trigger_dag_id='1000_NEW2_LOAD'
) ...
Any help or commentary would be greatly appreciated!
EDIT - adding previously used python_callable function used in TriggerDagRunOperator:
def intakeFile(context, dag_run_obj):
#read from S3, get filename and pass to triggered DAG
bucket_name = os.environ.get('bucket_name')
s3_hook = S3Hook(aws_conn_id='aws_default')
s3_hook.copy_object()
s3_hook.delete_objects()
...
dag_run_obj.payload = {
'filePath': workingPath,
'source': source,
'fileName': fileName
}
return dag_run_obj
The TriggerDagRunOperator now takes a conf parameter to which a dictinoary can be provided as the conf object for the DagRun. Here is more information on triggering DAGs which you may find helpful as well.
EDIT
Since you need to execute a function to determine which DAG to trigger and do not want to create a custom TriggerDagRunOperator, you could execute intakeFile() in a PythonOperator (or use the #task decorator with the Task Flow API) and use the return value as the conf argument in the TriggerDagRunOperator. As part of Airflow 2.0, return values are automatically pushed to XCom within many operators; the PythonOperator included.
Here is the general idea:
def intakeFile(*args, **kwargs):
# read from S3, get filename and pass to triggered DAG
bucket_name = os.environ.get("bucket_name")
s3_hook = S3Hook(aws_conn_id="aws_default")
s3_hook.copy_object()
s3_hook.delete_objects()
...
dag_run_obj.payload = {
"filePath": workingPath,
"source": source,
"fileName": fileName,
}
return dag_run_obj
get_dag_to_trigger = PythonOperator(
task_id="get_dag_to_trigger",
python_callable=intakeFile
)
triggerNewDAG_tsk = TriggerDagRunOperator(
task_id="triggerNewDAG_tsk",
trigger_dag_id="{{ ti.xcom_pull(task_ids='get_dag_to_trigger', key='return_value') }}",
)
get_dag_to_trigger >> triggerNewDAG_tsk
Celery - bottom line: I want to get the task name by using the task id (I don't have a task object)
Suppose I have this code:
res = chain(add.s(4,5), add.s(10)).delay()
cache.save_task_id(res.task_id)
And then in some other place:
task_id = cache.get_task_ids()[0]
task_name = get_task_name_by_id(task_id) #how?
print(f'Some information about the task status of: {task_name}')
I know I can get the task name if I have a task object, like here: celery: get function name by task id?.
But I don't have a task object (perhaps it can be created by the task_id or by some other way? I didn't see anything related to that in the docs).
In addition, I don't want to save in the cache the task name. (Suppose I have a very long chain/other celery primitives, I don't want to save all their names/task_ids. Just the last task_id should be enough to get all the information regarding all the tasks, using .parents, etc)
I looked at all the relevant methods of AsyncResult and AsyncResult.Backend objects. The only thing that seemed relevant is backend.get_task_meta(task_id), but that doesn't contain the task name.
Thanks in advance
PS: AsyncResult.name always returns None:
result = AsyncResult(task_id, app=celery_app)
result.name #Returns None
result.args #Also returns None
Finally found an answer.
For anyone wondering:
You can solve this by enabling result_extended = True in your celery config.
Then:
result = AsyncResult(task_id, app=celery_app)
result.task_name #tasks.add
You have to enable it first in Celery configurations:
celery_app = Celery()
...
celery_app.conf.update(result_extended=True)
Then, you can access it:
task = AsyncResult(task_id, app=celery_app)
task.name
Something like the following (pseudocode) should be enough:
app = Celery("myapp") # add your parameters here
task_id = "6dc5f968-3554-49c9-9e00-df8aaf9e7eb5"
aresult = app.AsyncResult(task_id)
task_name = aresult.name
task_args = aresult.args
print(task_name, task_args)
Unfortunately, it does not work (I would say it is a bug in Celery), so we have to find an alternative. First thing that came to my mind was that Celery CLI has inspect query_task feature, and that hinted me that it would be possible to find task name by using the inspect API, and I was right. Here is the code:
# Since the expected way does not work we need to use the inspect API:
insp = app.control.inspect()
task_ids = [task_id]
inspect_result = insp.query_task(*task_ids)
# print(inspect_result)
for node_name in inspect_result:
val = inspect_result[node_name]
if val:
# we found node that executes the task
arr = val[task_id]
state = arr[0]
meta = arr[1]
task_name = meta["name"]
task_args = meta["args"]
print(task_name, task_args)
Problem with this approach is that it works only while the task is running. The moment it is done you will not be able to use the code above.
This is not very clear from the docs for celery.result.AsyncResult but not all the properties are populated unless you enable result_extended = True as per configuration docs:
result_extended
Default: False
Enables extended task result attributes (name, args, kwargs, worker, retries, queue, delivery_info) to be written to backend.
Then the following will work:
result = AsyncResult(task_id)
result.name = 'project.tasks.my_task'
result.args = [2, 3]
result.kwargs = {'a': 'b'}
Also be aware that the rpc:// backend does not store this data, you will need Redis, or similar. If you are using rpc, even with result_extended = True you will still get None returned.
I found a good answer in this code snippet.
If and when you have an instance of AsyncResult you do not need the task_id, rather you can simply do this:
result # instance of AsyncResult
result_meta = result._get_task_meta()
task_name = result_meta.get("task_name")
Of course this relies on a private method, so it's a bit hacky. I hope celery introduces a simpler way to retrieve this - it's especially useful for testing.
I am using DataFlowJavaOperator() in airflow(Cloud Composer). Is there any way to get ID of executed dataflow job in next PythonOperator task? I would like to use the job_id to call gcloud command to get job result.
def check_dataflow(ds, **kwargs)
# here I want to execute gloud command with the job ID to get job result.
# gcloud dataflow jobs describe <JOB_ID>
t1 = DataFlowJavaOperator(
task_id='task1'
jar='gs://path/to/jar/abc.jar',
options={
'stagingLocation': "gs://stgLocation/",
'tempLocation': "gs://tmpLocation/",
},
provide_context=True
dag=dag,
)
t2 = PythonOperator(
task_id='task2',
python_callable=check_dataflow,
provide_context=True
dag=dag,
)
t1 >> t2
As it appears, the job_name option in the DataFlowJavaOperator gets overridden by the task_id. Job name will have the task as the prefix and append a random ID suffix. If you still want to have a Dataflow job name that is actually different from the task ID you can add hard-code it in the Dataflow Java code:
options.setJobName("jobNameInCode")
Then, using the PythonOperator you can retrieve the job ID from the prefix (either the job name provided in code or otherwise the Composer task id) as I explained here. Briefly, list jobs with:
result = dataflow.projects().locations().jobs().list(
projectId=project,
location=location,
).execute()
and then filter by prefix where job_prefix is the job_name defined when launching the job:
for job in result['jobs']:
if re.findall(r'' + re.escape(job_prefix) + '', job['name']):
job_id = job['id']
break
The break statement is there to ensure we only get the latest job with that name, which should be the one just launched.
I have written a DAG with multiple PythonOperators
task1 = af_op.PythonOperator(task_id='Data_Extraction_Environment',
provide_context=True,
python_callable=Task1, dag=dag1)
def Task1(**kwargs):
return(kwargs['dag_run'].conf.get('file'))
From PythonOperator i am calling "Task1" method. That method is returning a value,that value i need to pass to the next PythonOperator.How can i get the value from the "task1" variable or How can i get the value which is returned from Task1 method?
updated :
def Task1(**kwargs):
file_name = kwargs['dag_run'].conf.get[file]
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='file', value=file_name)
return file_name
t1 = PythonOperator(task_id = 'Task1',provide_context=True,python_callable=Task1,dag=dag)
t2 = BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1',key='file') }} ',
dag=dag,
)
t2.set_upstream(t1)
You might want to check out Airflow's XCOM: https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html
If you return a value from a function, this value is stored in xcom. In your case, you could access it like so from other Python code:
task_instance = kwargs['task_instance']
task_instance.xcom_pull(task_ids='Task1')
or in a template like so:
{{ task_instance.xcom_pull(task_ids='Task1') }}
If you want to specify a key you can push into XCOM (being inside a task):
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='the_key', value=my_str)
Then later on you can access it like so:
task_instance.xcom_pull(task_ids='my_task', key='the_key')
EDIT 1
Follow-up question: Instead of using the value in another function how can i pass the value to another PythonOperator like - "t2 = "BashOperator(task_id='Moving_bucket', bash_command='python /home/raw.py "%s" '%file_name, dag=dag)" --- i want to access file_name which is returned by "Task1". How can this will be acheived?
First of all, it seems to me that the value is, in fact, not being passed to another PythonOperator but to a BashOperator.
Secondly, this is already covered in my answer above. The field bash_command is templated (see template_fields in the source: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/bash_operator.py). Hence, we can use the templated version:
BashOperator(
task_id='Moving_bucket',
bash_command='python /home/raw.py {{ task_instance.xcom_pull(task_ids='Task1') }} ',
dag=dag,
)
EDIT 2
Explanation:
Airflow works like this: It will execute Task1, then populate xcom and then execute the next task. So for your example to work you need Task1 executed first and then execute Moving_bucket downstream of Task1.
Since you are using a return function, you could also omit the key='file' from xcom_pull and not manually set it in the function.