Airlfow API : testing tasks with parameters

Airlfow API : testing tasks with parameters - python

I can't figure out how to use pytest to test a dag task waiting for xcom_arg.
I created the following DAG using the new airflow API syntax :
#dag(...)
def transfer_files():
#task()
def retrieve_existing_files():
existing = []
for elem in os.listdir("./backup"):
existing.append(elem)
return existing
#task()
def get_new_file_to_sync(existing: list[str]):
new_files = []
for elem in os.listdir("./prod"):
if not elem in existing:
new_files.append(elem)
return new_files
r = retrieve_existing_files()
get_new_file_to_sync(r)
Now I want to perform unit testing on the get_new_file_to_sync task. I wrote the following test :
def test_get_new_elan_list():
mocked_existing = ["a.out", "b.out"]
dag_bag = DagBag(include_examples=False)
dag = dag_bag.get_dag("transfer_files")
task = dag.get_task("get_new_file_to_sync")
result = task.execute({}, mocked_existing)
print(result)
The test fails because task.execute is waiting for 2 parameters but 3 were given.
My issue is that I don't have any clue of how to proceed in order to test my tasks waiting for arguments with a mocked custom argument.
Thanks for your insights

I managed to find a way to unit test airflow tasks declared using the new airflow API.
Here is a test case for the task get_new_file_to_sync contained in the DAG transfer_files declared in the question :
def test_get_new_file_to_synct():
mocked_existing = ["a.out", "b.out"]
# Asking airflow to load the dags in its home folder
dag_bag = DagBag(include_examples=False)
# Retrieving the dag to test
dag = dag_bag.get_dag("transfer_files")
# Retrieving the task to test
task = dag.get_task("get_new_file_to_sync")
# extracting the function to test from the task
function_to_unit_test = task.python_callable
# Calling the function normally
results = function_to_unit_test(mocked_existing)
assert len(results) == 10
This allows bypassing all the airflow mechanics triggered before calling the actual code you have written for your task. Thus, you can focus on writing tests for the code you have written for your task.

For testing such a task, I believe you'll need to use mocking from pytest.
Let's take this user defined operator for an example:
class MovielensPopularityOperator(BaseOperator):
def __init__(self, conn_id, start_date, end_date, min_ratings=4, top_n=5, **kwargs):
super().__init__(**kwargs)
self._conn_id = conn_id
self._start_date = start_date
self._end_date = end_date
self._min_ratings = min_ratings
self._top_n = top_n
def execute(self, context):
with MovielensHook(self._conn_id) as hook:
ratings = hook.get_ratings(start_date=self._start_date, end_date=self._end_date)
rating_sums = defaultdict(Counter)
for rating in ratings:
rating_sums[rating["movieId"]].update(count=1, rating=rating["rating"])
averages = {
movie_id: (rating_counter["rating"] / rating_counter["count"], rating_counter["count"])
for movie_id, rating_counter in rating_sums.items()
if rating_counter["count"] >= self._min_ratings
}
return sorted(averages.items(), key=lambda x: x[1], reverse=True)[: self._top_n]
And a test written just like the one you did:
def test_movielenspopularityoperator():
task = MovielensPopularityOperator(
task_id="test_id",
start_date="2015-01-01",
end_date="2015-01-03",
top_n=5,
)
result = task.execute(context={})
assert len(result) == 5
Running this test fail as:
=============================== FAILURES ===============================
___________________ test_movielenspopularityoperator ___________________
mocker = <pytest_mock.plugin.MockFixture object at 0x10fb2ea90>
def test_movielenspopularityoperator(mocker: MockFixture):
task = MovielensPopularityOperator(
➥
>
task_id="test_id", start_date="2015-01-01", end_date="2015-01-
03", top_n=5
)
➥
E
TypeError: __init__() missing 1 required positional argument:
'conn_id'
tests/dags/chapter9/custom/test_operators.py:30: TypeError
========================== 1 failed in 0.10s ==========================
The test failed because we’re missing the required argument conn_id, which points to the connection ID in the metastore. But how do you provide this in a test? Tests should be isolated from each other; they should not be able to influence the results of other tests, so a database shared between tests is not an ideal situation. In this case, mocking comes to the rescue.
Mocking is “faking” certain operations or objects. For example, the call to a database that is expected to exist in a production setting but not while testing could be faked, or mocked, by telling Python to return a certain value instead of making the actual call to the (nonexistent during testing) database. This allows you to develop and run tests without requiring a connection to external systems. It requires insight into the internals of whatever it is you’re testing, and thus sometimes requires you to dive into third-party code.
After installing pytest-mock in your enviroment:
pip install pytest-mock
Here is the test written where mocking is used:
def test_movielenspopularityoperator(mocker):
mocker.patch.object(
MovielensHook,
"get_connection",
return_value=Connection(conn_id="test", login="airflow", password="airflow"),
)
task = MovielensPopularityOperator(
task_id="test_id",
conn_id="test",
start_date="2015-01-01",
end_date="2015-01-03",
top_n=5,
)
result = task.execute(context=None)
assert len(result) == 5
Now, hopefully this will give you an idea about how to write your tests for Airflow Tasks.
For more about mocking and unit tests, you can check here and here.

Related

retrieve pytest results programmatically when run via pytest.main()

I'd like to run pytest and then store results and present them to users on demand (e.g. store pytest results to a db and then expose them through web service)
I could run pytest from a command line with option to save results report into file, then find and parse the file, but feels silly to have the results in a (pytest) python app, then store them to a file and then instantly look for the file, parse it back into python code for further processing. I know I can run pytest programatically via pytest.main(args) however it only return some exit code and not details about tests results - how can I retrieve the results when using pytest.main()?
I'm looking for smt like
args = # arguments
ret_code = pytest.main(args=args) # pytest.main() as is only returns trivial return code
my_own_method_to_process(pytest.results) # how to retrieve any kind of pytest.results object that would contain test execution results data (list of executed tests, pass fail info, etc as pytest is displaying into console or saves into file reports)
There are couple of similar questions but always with some deviation that doesn't work for me. I simply want to run pytest from my code and - whatever format the output would be - directly grab it and further process.
(Note I'm in a corporate environment where installing new packages (i.e. pytest plugins) is limited, so I'd like to achieve this without installing any other module/pytest plugin into my environment)

Write a small plugin that collects and stores reports for each test. Example:
import time
import pytest
class ResultsCollector:
def __init__(self):
self.reports = []
self.collected = 0
self.exitcode = 0
self.passed = 0
self.failed = 0
self.xfailed = 0
self.skipped = 0
self.total_duration = 0
#pytest.hookimpl(hookwrapper=True)
def pytest_runtest_makereport(self, item, call):
outcome = yield
report = outcome.get_result()
if report.when == 'call':
self.reports.append(report)
def pytest_collection_modifyitems(self, items):
self.collected = len(items)
def pytest_terminal_summary(self, terminalreporter, exitstatus):
print(exitstatus, dir(exitstatus))
self.exitcode = exitstatus.value
self.passed = len(terminalreporter.stats.get('passed', []))
self.failed = len(terminalreporter.stats.get('failed', []))
self.xfailed = len(terminalreporter.stats.get('xfailed', []))
self.skipped = len(terminalreporter.stats.get('skipped', []))
self.total_duration = time.time() - terminalreporter._sessionstarttime
def run():
collector = ResultsCollector()
pytest.main(plugins=[collector])
for report in collector.reports:
print('id:', report.nodeid, 'outcome:', report.outcome) # etc
print('exit code:', collector.exitcode)
print('passed:', collector.passed, 'failed:', collector.failed, 'xfailed:', collector.xfailed, 'skipped:', collector.skipped)
print('total duration:', collector.total_duration)
if __name__ == '__main__':
run()

Celery - how to get task name by task id?

Celery - bottom line: I want to get the task name by using the task id (I don't have a task object)
Suppose I have this code:
res = chain(add.s(4,5), add.s(10)).delay()
cache.save_task_id(res.task_id)
And then in some other place:
task_id = cache.get_task_ids()[0]
task_name = get_task_name_by_id(task_id) #how?
print(f'Some information about the task status of: {task_name}')
I know I can get the task name if I have a task object, like here: celery: get function name by task id?.
But I don't have a task object (perhaps it can be created by the task_id or by some other way? I didn't see anything related to that in the docs).
In addition, I don't want to save in the cache the task name. (Suppose I have a very long chain/other celery primitives, I don't want to save all their names/task_ids. Just the last task_id should be enough to get all the information regarding all the tasks, using .parents, etc)
I looked at all the relevant methods of AsyncResult and AsyncResult.Backend objects. The only thing that seemed relevant is backend.get_task_meta(task_id), but that doesn't contain the task name.
Thanks in advance
PS: AsyncResult.name always returns None:
result = AsyncResult(task_id, app=celery_app)
result.name #Returns None
result.args #Also returns None

Finally found an answer.
For anyone wondering:
You can solve this by enabling result_extended = True in your celery config.
Then:
result = AsyncResult(task_id, app=celery_app)
result.task_name #tasks.add

You have to enable it first in Celery configurations:
celery_app = Celery()
...
celery_app.conf.update(result_extended=True)
Then, you can access it:
task = AsyncResult(task_id, app=celery_app)
task.name

Something like the following (pseudocode) should be enough:
app = Celery("myapp") # add your parameters here
task_id = "6dc5f968-3554-49c9-9e00-df8aaf9e7eb5"
aresult = app.AsyncResult(task_id)
task_name = aresult.name
task_args = aresult.args
print(task_name, task_args)
Unfortunately, it does not work (I would say it is a bug in Celery), so we have to find an alternative. First thing that came to my mind was that Celery CLI has inspect query_task feature, and that hinted me that it would be possible to find task name by using the inspect API, and I was right. Here is the code:
# Since the expected way does not work we need to use the inspect API:
insp = app.control.inspect()
task_ids = [task_id]
inspect_result = insp.query_task(*task_ids)
# print(inspect_result)
for node_name in inspect_result:
val = inspect_result[node_name]
if val:
# we found node that executes the task
arr = val[task_id]
state = arr[0]
meta = arr[1]
task_name = meta["name"]
task_args = meta["args"]
print(task_name, task_args)
Problem with this approach is that it works only while the task is running. The moment it is done you will not be able to use the code above.

This is not very clear from the docs for celery.result.AsyncResult but not all the properties are populated unless you enable result_extended = True as per configuration docs:
result_extended
Default: False
Enables extended task result attributes (name, args, kwargs, worker, retries, queue, delivery_info) to be written to backend.
Then the following will work:
result = AsyncResult(task_id)
result.name = 'project.tasks.my_task'
result.args = [2, 3]
result.kwargs = {'a': 'b'}
Also be aware that the rpc:// backend does not store this data, you will need Redis, or similar. If you are using rpc, even with result_extended = True you will still get None returned.

I found a good answer in this code snippet.
If and when you have an instance of AsyncResult you do not need the task_id, rather you can simply do this:
result # instance of AsyncResult
result_meta = result._get_task_meta()
task_name = result_meta.get("task_name")
Of course this relies on a private method, so it's a bit hacky. I hope celery introduces a simpler way to retrieve this - it's especially useful for testing.

How should I test a method of a mocked object

I have a question about how to mock a nested method and test what it was called with. I'm having a hard time getting my head around: https://docs.python.org/3/library/unittest.mock-examples.html#mocking-chained-calls.
I'd like to test that the "put" method from the fabric library is called by the deploy_file method in this class, and maybe what values are given to it. This is the module that gathers some information from AWS and provides a method to take action on the data.
import json
import os
from aws.secrets_manager import get_secret
from fabric import Connection
class Deploy:
def __init__(self):
self.secrets = None
self.set_secrets()
def set_secrets(self):
secrets = get_secret()
self.secrets = json.loads(secrets)
def deploy_file(self, source_file):
with Connection(host=os.environ.get('SSH_USERNAME'), user=os.environ.get("SSH_USERNAME")) as conn:
destination_path = self.secrets["app_path"] + '/' + os.path.basename(source_file)
conn.put(source_file, destination_path)
"get_secret" is a method in another module that uses the boto3 library to get the info from AWS.
These are the tests I'm working on:
from unittest.mock import patch
from fabric import Connection
from jobs.deploy import Deploy
def test_set_secrets_dict_from_expected_json_string():
with patch('jobs.deploy.get_secret') as m_get_secret:
m_get_secret.return_value = '{"app_path": "/var/www/html"}'
deployment = Deploy()
assert deployment.secrets['app_path'] == "/var/www/html"
def test_copy_app_file_calls_fabric_put():
with patch('jobs.deploy.get_secret') as m_get_secret:
m_get_secret.return_value = '{"app_path": "/var/www/html"}'
deployment = Deploy()
with patch('jobs.deploy.Connection', spec=Connection) as m_conn:
local_file_path = "/tmp/foo"
deployment.deploy_file(local_file_path)
m_conn.put.assert_called_once()
where the second test results in "AssertionError: Expected 'put' to have been called once. Called 0 times."
the first test mocks the "get_secret" function just fine to test that the constructor for "Deploy" sets "Deploy.secrets" from the fake AWS data.
In the second test, get_secrets is mocked just as before, and I mock "Connection" from the fabric library. If I don't mock Connection, I get an error related to the "host" parameter when the Connection object is created.
I think that when "conn.put" is called its creating a whole new Mock object and I'm not testing that object when the unittest runs. I'm just not sure how to define the test to actually test the call to put.
I'm also a novice at understanding what to test (and how) and what not to test as well as how to use mock and such. I'm fully bought in on the idea though. It's been very helpful to find bugs and regressions as I work on projects.

How to retrieve Test Results in Azure DevOps with Python REST API?

How to retrieve Test Results from VSTS (Azure DevOps) by using Python REST API?
Documentation is (as of today) very light, and even the examples in the dedicated repo of the API examples are light (https://github.com/Microsoft/azure-devops-python-samples).
For some reasons, the Test Results are not considered as WorkItems so a regular WIQL query would not work.
Additionally, it would be great to query the results for a given Area Path.
Thanks

First you need to get the proper connection client with the client string that matches the test results.
from vsts.vss_connection import VssConnection
from msrest.authentication import BasicAuthentication
token = "hcykwckuhe6vbnigsjs7r3ai2jefsdlkfjslkfj5mxizbtfu6k53j4ia"
team_instance = "https://tfstest.toto.com:8443/tfs/Development/"
credentials = BasicAuthentication("", token)
connection = VssConnection(base_url=team_instance, creds=credentials)
TEST_CLIENT = "vsts.test.v4_1.test_client.TestClient"
test_client = connection.get_client(TEST_CLIENT)
Then, you can have a look at all the functions available in: vsts/test/<api_version>/test_client.py"
The following functions look interesting:
def get_test_results(self, project, run_id, details_to_include=None, skip=None, top=None, outcomes=None) (Get Test Results for a run based on filters)
def get_test_runs(self, project, build_uri=None, owner=None, tmi_run_id=None, plan_id=None, include_run_details=None, automated=None, skip=None, top=None)
def query_test_runs(self, project, min_last_updated_date, max_last_updated_date, state=None, plan_ids=None, is_automated=None, publish_context=None, build_ids=None, build_def_ids=None, branch_name=None, release_ids=None, release_def_ids=None, release_env_ids=None, release_env_def_ids=None, run_title=None, top=None, continuation_token=None) (although this function has a limitation of 7 days range between min_last_updated_date and max_last_updated_date
To retrieve all the results from the Test Plans in a given Area Path, I have used the following code:
tp_query = Wiql(query="""
SELECT
[System.Id]
FROM workitems
WHERE
[System.WorkItemType] = 'Test Plan'
AND [Area Path] UNDER 'Development\MySoftware'
ORDER BY [System.ChangedDate] DESC""")
for plan in wit_client.query_by_wiql(tp_query).work_items:
print(f"Results for {plan.id}")
for run in test_client.get_test_runs(my_project, plan_id = plan.id):
for res in test_client.get_test_results(my_project, run.id):
tc = res.test_case
print(f"#{run.id}. {tc.name} ({tc.id}) => {res.outcome} by {res.run_by.display_name} in {res.duration_in_ms}")
Note that a test result includes the following attributes:
duration_in_ms
build
outcome (string)
associated_bugs
run_by (Identity)
test_case (TestCase)
test_case_title (string)
area (AreaPath)
Test_run, corresponding to the test run
test_suite
test_plan
completed_date (Python datetime object)
started_date ( Python datetime object)
configuration
Hope it can help others save the number of hours I spent exploring this API.
Cheers

Django data leak between 2 separated tests

In my whole tests base, I experience a weird behaviour with two tests. They are completely isolated. However, I can find data from the first test in the second one. Here are the tests:
file1 (services.tests)
class ServiceTestCase(TestCase):
#patch('categories.models.ArticlesByCategory.objects.has_dish_type')
def test_build_dishtype_conflicts(self, mock_has_dish_type):
# WARN: create interference in tests
restaurant = RestaurantFactory()
dt_1 = DishTypeFactory(restaurant=restaurant)
cat_1 = CategoryFactory(restaurant=restaurant)
art_1 = ArticleFactory(name='fooA1', restaurant=restaurant)
art_2 = ArticleFactory(name='fooA2', restaurant=restaurant)
abc_1 = ArticlesByCategory.objects.create(category=cat_1, article=art_1, is_permanent=True,
dish_type=dt_1)
abc_2 = ArticlesByCategory.objects.create(category=cat_1, article=art_2, is_permanent=True,
dish_type=dt_1)
mock_has_dish_type.return_value = [abc_1, abc_2]
abcs_to_check = ArticlesByCategory.objects.filter(pk__in=[abc_1.pk, abc_2.pk])
conflicts = ServiceFactory()._build_dishtype_conflicts(abcs_to_check)
self.assertDictEqual(conflicts, {dt_1.pk: 2})
file2 (products.tests)
class ArticleQuerySetTestCase(TestCase):
def test_queryset_usable_for_category(self):
restaurant = RestaurantFactory()
category_1 = CategoryFactory(name='fooB1', restaurant=restaurant)
category_2 = CategoryFactory(name='fooB2', restaurant=restaurant)
article_1 = ArticleFactory(restaurant=restaurant)
article_2 = ArticleFactory(restaurant=restaurant)
ArticlesByCategory.objects.create(article=article_1, category=category_1, is_permanent=True)
queryset_1 = Article.objects.usable_for_category(category_1)
# This line is used for debug
for art in Article.objects.all():
print(art.name)
When running test_build_dishtype_conflicts THEN test_queryset_usable_for_category in the same command, here are the results of the print in the second test:
fooA1
fooA2
fooB1
fooB2
I suspect I did something wrong but can't find what.

Ok found the problem from Django documentation.
If your tests rely on database access such as creating or querying models, be sure to create your test classes as subclasses of django.test.TestCase rather than unittest.TestCase.
Using unittest.TestCase avoids the cost of running each test in a transaction and flushing the database, but if your tests interact with the database their behavior will vary based on the order that the test runner executes them. This can lead to unit tests that pass when run in isolation but fail when run in a suite.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Airlfow API : testing tasks with parameters - python

Related

retrieve pytest results programmatically when run via pytest.main()

Celery - how to get task name by task id?

How should I test a method of a mocked object

How to retrieve Test Results in Azure DevOps with Python REST API?

Django data leak between 2 separated tests

Categories

Resources