How to unittest pyspark `withColumn` action - Python 3? - python

I'm learning pyspark, I have a function:
import re
def function_1(string):
new_string = re.sub(r"!", " ", string)
return new_string
udf_function_1 = udf(lambda s: function_1(s), StringType())
def function_2(data):
new_data = data \
.withColumn("column_1", udf_function_1("column_1"))
return new_data
My question is how to write unittest for function_2() in Python.

what do you exactly want to test in function_2?
Below is a simple test saved in a file called sample_test.py. I used pytest but you can right very similar code in unittest.
# sample_test.py
from pyspark import sql
spark = sql.SparkSession.builder \
.appName("local-spark-session") \
.getOrCreate()
def test_create_session():
assert isinstance(spark, sql.SparkSession) == True
assert spark.sparkContext.appName == 'local-spark-session'
def test_spark_version():
assert spark.version == '3.1.2'
running the test...
C:\Users\user\Desktop>pytest -v sample_test.py
============================================= test session starts =============================================
platform win32 -- Python 3.6.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- c:\users\user\appdata\local\programs\python\python36\python.exe
cachedir: .pytest_cache
rootdir: C:\Users\user\Desktop
collected 2 items
sample_test.py::test_create_session PASSED [ 50%]
sample_test.py::test_spark_version PASSED [100%]
============================================== 2 passed in 4.81s ==============================================

Related

Python click sample code test case does not give 100% coverage [duplicate]

This question already has answers here:
Python coverage.py exclude_lines
(2 answers)
is there a python-version specific "#pragma nocover" available for python coverage tool?
(4 answers)
Closed last year.
I wrote the following code.
https://gitlab.com/ksaito11/click-test
$ cat commands/cmd.py
import click
from commands.hello import hello
def print_version(ctx, param, value):
if not value or ctx.resilient_parsing:
return
click.echo('Version 1.0')
ctx.exit()
#click.group()
#click.option('--opt1')
#click.option('--version', is_flag=True, callback=print_version,
expose_value=False, is_eager=True)
#click.pass_context
def cmd(ctx, **kwargs):
ctx.obj = kwargs
def main():
cmd.add_command(hello)
cmd(auto_envvar_prefix='HELLOCLI')
if __name__ == '__main__':
main()
$ cat commands/hello.py
import click
#click.command()
def hello():
click.echo('Hello World!')
The code works correctly.
$ export PYTHONPATH=.
$ python commands/cmd.py
Usage: cmd.py [OPTIONS] COMMAND [ARGS]...
Options:
--opt1 TEXT
--version
--help Show this message and exit.
Commands:
hello
$ python commands/cmd.py --version
Version 1.0
$ python commands/cmd.py hello
Hello World!
I wrote the following test case.
$ cat tests/test_cmd.py
from click.testing import CliRunner
import click
import pytest
from commands.cmd import cmd, main
from commands.hello import hello
def test_version():
runner = CliRunner()
result = runner.invoke(cmd, ["--version"])
assert result.exit_code == 0
def test_help():
runner = CliRunner()
result = runner.invoke(cmd)
assert result.exit_code == 0
def test_hello():
runner = CliRunner()
result = runner.invoke(hello)
assert result.exit_code == 0
I measured the coverage with the following command.
$ pytest --cov-branch --cov=commands
================================================================ test session starts ================================================================
platform linux -- Python 3.9.9, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/ksaito/ghq/gitlab.com/ksaito11/click-test
plugins: cov-3.0.0
collected 3 items
tests/test_cmd.py ... [100%]
----------- coverage: platform linux, python 3.9.9-final-0 -----------
Name Stmts Miss Branch BrPart Cover
--------------------------------------------------------
commands/__init__.py 0 0 0 0 100%
commands/cmd.py 18 5 4 2 68%
commands/hello.py 4 0 0 0 100%
--------------------------------------------------------
TOTAL 22 5 4 2 73%
================================================================= 3 passed in 0.15s =================================================================
I didn't know how to write the code to test the part below and couldn't get 100% coverage.
def cmd(ctx, **kwargs):
ctx.obj = kwargs
def main():
cmd.add_command(hello)
cmd(auto_envvar_prefix='HELLOCLI')
The code below may not be needed when using "# click.group", but I couldn't determine.
def print_version(ctx, param, value):
if not value or ctx.resilient_parsing:
return
Please give me advice.
By adding the following settings, code that does not need to be included in coverage is excluded.
$ cat .coveragerc
[run]
branch = True
[report]
exclude_lines =
# Don't complain if non-runnable code isn't run:
if 0:
if __name__ == .__main__.:
def main
ctx.obj = kwargs
I deleted the code below because I thought it was unnecessary.
if not value or ctx.resilient_parsing:
return
The coverage is now 100%.
$ pytest --cov-branch --cov=commands
================================================================ test session starts ================================================================
platform linux -- Python 3.9.9, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/ksaito/ghq/gitlab.com/ksaito11/click-test
plugins: cov-3.0.0
collected 3 items
tests/test_cmd.py ... [100%]
----------- coverage: platform linux, python 3.9.9-final-0 -----------
Name Stmts Miss Branch BrPart Cover
--------------------------------------------------------
commands/__init__.py 0 0 0 0 100%
commands/cmd.py 10 0 0 0 100%
commands/hello.py 4 0 0 0 100%
--------------------------------------------------------
TOTAL 14 0 0 0 100%
================================================================= 3 passed in 0.22s =================================================================

Make pytest output like googletest?

I'm using PyTest for python code testing. Since I use googletest for my C++ code testing, I like the output format of googletest.
I'm wondering, is it possible to make pytest output like googletest? The pytest output line is too long, while googletest is short:
// pytest example:
(base) zz#home% pytest test_rle_v2.py
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.8.1, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/zz/work/test/learn-hp/.hypothesis/examples')
rootdir: /home/zz/work/test/learn-hp
plugins: env-0.6.2, hypothesis-4.38.0
collected 1 item
test_rle_v2.py . [100%]
=================================================================================== 1 passed in 0.46s ====================================================================================
// googletest example
(base) zz#home% ./test_version
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from VERSION
[ RUN ] VERSION.str
[ OK ] VERSION.str (0 ms)
[ RUN ] VERSION.parts
[ OK ] VERSION.parts (0 ms)
[ RUN ] VERSION.metadata
[ OK ] VERSION.metadata (1 ms)
[ RUN ] VERSION.atLeast
[ OK ] VERSION.atLeast (0 ms)
[ RUN ] VERSION.hasFeature
[ OK ] VERSION.hasFeature (0 ms)
[----------] 5 tests from VERSION (1 ms total)
[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (1 ms total)
[ PASSED ] 5 tests.
After several hours searching and trying, I found a conftest.py file required for my purpose. In conftest.py, people can override default pytest function, i.e. by providing hooks.
The following is an WIP example:
# conftest.py
import os
import random
def pytest_runtest_call(item):
item.add_report_section("call", "custom", " [ Run ] " + str(item))
def pytest_report_teststatus(report, config):
#print(">>> outcome:", report.outcome)
if report.when == 'call':
# line = f' [ Run ] {report.nodeid}'
# report.sections.append(('ChrisZZ', line))
if (report.outcome == 'failed'):
line = f' [ FAILED ] {report.nodeid}'
report.sections.append(('failed due to', line))
if report.when == 'teardown':
if (report.outcome == 'passed'):
line = f' [ OK ] {report.nodeid}'
report.sections.append(('ChrisZZ', line))
def pytest_terminal_summary(terminalreporter, exitstatus, config):
reports = terminalreporter.getreports('')
content = os.linesep.join(text for report in reports for secname, text in report.sections)
if content:
terminalreporter.ensure_newline()
#terminalreporter.section('', sep=' ', green=True, bold=True)
#terminalreporter.section('My custom section2', sep='------]', green=True, bold=True, fullwidth=None)
terminalreporter.line(content)

Would like to see list of deselected tests and their node ids in pytest output

Is there an option to list the deselected tests in the cli output along with the mark that triggered their deselection?
I know that in suites with many tests this would not be good as a default but would be a useful option in something like api testing where the tests are likely to be more limited.
The numeric summary
collected 21 items / 16 deselected / 5 selected
is helpful but not enough when trying to organize marks and see what happened in a ci build.
pytest has a hookspec pytest_deselected for accessing the deselected tests. Example: add this code to conftest.py in your test root dir:
def pytest_deselected(items):
if not items:
return
config = items[0].session.config
reporter = config.pluginmanager.getplugin("terminalreporter")
reporter.ensure_newline()
for item in items:
reporter.line(f"deselected: {item.nodeid}", yellow=True, bold=True)
Running the tests now will give you an output similar to this:
$ pytest -vv
...
plugins: cov-2.8.1, asyncio-0.10.0
collecting ...
deselected: test_spam.py::test_spam
deselected: test_spam.py::test_bacon
deselected: test_spam.py::test_ham
collected 4 items / 3 deselected / 1 selected
...
If you want a report in another format, simply store the deselected items in the config and use them for the desired output somewhere else, e.g. pytest_terminal_summary:
# conftest.py
import os
def pytest_deselected(items):
if not items:
return
config = items[0].session.config
config.deselected = items
def pytest_terminal_summary(terminalreporter, exitstatus, config):
reports = terminalreporter.getreports('')
content = os.linesep.join(text for report in reports for secname, text in report.sections)
deselected = getattr(config, "deselected", [])
if deselected:
terminalreporter.ensure_newline()
terminalreporter.section('Deselected tests', sep='-', yellow=True, bold=True)
content = os.linesep.join(item.nodeid for item in deselected)
terminalreporter.line(content)
gives the output:
$ pytest -vv
...
plugins: cov-2.8.1, asyncio-0.10.0
collected 4 items / 3 deselected / 1 selected
...
---------------------------------------- Deselected tests -----------------------------------------
test_spam.py::test_spam
test_spam.py::test_bacon
test_spam.py::test_ham
================================= 1 passed, 3 deselected in 0.01s =================================

How to add custom sections to terminal report in pytest

In pytest, when a test case is failed, you have in the report the following categories:
Failure details
Captured stdout call
Captured stderr call
Captured log call
I would like to add some additional custom sections (I have a server that turns in parallel and would like to display the information logged by this server in a dedicated section).
How could I do that (if ever possible)?
Thanks
NOTE:
I have currently found the following in source code but don't know whether that shall be right approach
nodes.py
class Item(Node):
...
def add_report_section(self, when, key, content):
"""
Adds a new report section, similar to what's done internally
to add stdout and stderr captured output::
...
"""
reports.py
class BaseReport:
...
#property
def caplog(self):
"""Return captured log lines, if log capturing is enabled
.. versionadded:: 3.5
"""
return "\n".join(
content for (prefix, content) in self.get_sections("Captured log")
)
To add custom sections to terminal output, you need to append to report.sections list. This can be done in pytest_report_teststatus hookimpl directly, or in other hooks indirectly (via a hookwrapper); the actual implementation heavily depends on your particular use case. Example:
# conftest.py
import os
import random
import pytest
def pytest_report_teststatus(report, config):
messages = (
'Egg and bacon',
'Egg, sausage and bacon',
'Egg and Spam',
'Egg, bacon and Spam'
)
if report.when == 'teardown':
line = f'{report.nodeid} says:\t"{random.choice(messages)}"'
report.sections.append(('My custom section', line))
def pytest_terminal_summary(terminalreporter, exitstatus, config):
reports = terminalreporter.getreports('')
content = os.linesep.join(text for report in reports for secname, text in report.sections)
if content:
terminalreporter.ensure_newline()
terminalreporter.section('My custom section', sep='-', blue=True, bold=True)
terminalreporter.line(content)
Example tests:
def test_spam():
assert True
def test_eggs():
assert True
def test_bacon():
assert False
When running the tests, you should see My custom section header at the bottom colored blue and containing a message for every test:
collected 3 items
test_spam.py::test_spam PASSED
test_spam.py::test_eggs PASSED
test_spam.py::test_bacon FAILED
============================================= FAILURES =============================================
____________________________________________ test_bacon ____________________________________________
def test_bacon():
> assert False
E assert False
test_spam.py:9: AssertionError
---------------------------------------- My custom section -----------------------------------------
test_spam.py::test_spam says: "Egg, bacon and Spam"
test_spam.py::test_eggs says: "Egg and Spam"
test_spam.py::test_bacon says: "Egg, sausage and bacon"
================================ 1 failed, 2 passed in 0.07 seconds ================================
The other answer shows how to add a custom section to the terminal report summary, but it's not the best way for adding a custom section per test.
For this goal, you can (and should) use the higher-level API add_report_section of an Item node (docs). A minimalist example is shown below, modify it to suit your needs. You can pass state from the test instance through an item node, if necessary.
In test_something.py, here is one passing test and two failing:
def test_good():
assert 2 + 2 == 4
def test_bad():
assert 2 + 2 == 5
def test_ugly():
errorerror
In conftest.py, setup a hook wrapper:
import pytest
content = iter(["first", "second", "third"])
#pytest.hookimpl(hookwrapper=True)
def pytest_runtest_call(item):
outcome = yield
item.add_report_section("call", "custom", next(content))
The report will now display custom sections per-test:
$ pytest
============================== test session starts ===============================
platform linux -- Python 3.9.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /tmp/example
collected 3 items
test_something.py .FF [100%]
==================================== FAILURES ====================================
____________________________________ test_bad ____________________________________
def test_bad():
> assert 2 + 2 == 5
E assert (2 + 2) == 5
test_something.py:5: AssertionError
------------------------------ Captured custom call ------------------------------
second
___________________________________ test_ugly ____________________________________
def test_ugly():
> errorerror
E NameError: name 'errorerror' is not defined
test_something.py:8: NameError
------------------------------ Captured custom call ------------------------------
third
============================ short test summary info =============================
FAILED test_something.py::test_bad - assert (2 + 2) == 5
FAILED test_something.py::test_ugly - NameError: name 'errorerror' is not defined
========================== 2 failed, 1 passed in 0.02s ===========================

Parametrize the test based on the list test-data from a json file

Is there a way to parametrize a test, when test has a list of different/multiple test-data?
example_test_data.json
{ "test_one" : [1,2,3], # this is the case, where the `test_one` test need to be parametrize.
"test_two" : "split",
"test_three" : {"three":3},
"test_four" : {"four":4},
"test_set_comparison" : "1234"
}
Directory structure:
main --
conftest.py # conftest file for my fixtures
testcases
project_1
(contains these files -- test_suite_1.py, config.json)
project_2
(contains these files -- test_suite_2.py, config.json)
workflows
libs
Using below code in conftest.py at top directory level, able to get/map the test data from json file for particular test case.
#pytest.yield_fixture(scope="class", autouse=True)
def test_config(request):
f = pathlib.Path(request.node.fspath.strpath)
print "File : %s" % f
config = f.with_name("config.json")
print "Config json file : %s" % config
with config.open() as fd:
testdata = json.loads(fd.read())
print "test data :", testdata
yield testdata
#pytest.yield_fixture(scope="function", autouse=True)
def config_data(request, test_config):
testdata = test_config
test = request.function.__name__
print "Class Name : %s" % request.cls.__name__
print "Testcase Name : %s" % test
if test in testdata:
test_args = testdata[test]
yield test_args
else:
yield {}
In my case:
#pytest.yield_fixture(scope="function", autouse=True)
def config_data(request, test_config):
testdata = test_config
test = request.function.__name__
print "Class Name : %s" % request.cls.__name__
print "Testcase Name : %s" % test
if test in testdata:
test_args = testdata[test]
if isinstance(test_args, list):
# How to parametrize the test
# yield test_args
else:
yield {}
I would handle the special parametrization case in pytest_generate_tests hook:
# conftest.py
import json
import pathlib
import pytest
#pytest.fixture(scope="class")
def test_config(request):
f = pathlib.Path(request.node.fspath.strpath)
config = f.with_name("config.json")
with config.open() as fd:
testdata = json.loads(fd.read())
yield testdata
#pytest.fixture(scope="function")
def config_data(request, test_config):
testdata = test_config
test = request.function.__name__
if test in testdata:
test_args = testdata[test]
yield test_args
else:
yield {}
def pytest_generate_tests(metafunc):
if 'config_data' not in metafunc.fixturenames:
return
config = pathlib.Path(metafunc.module.__file__).with_name('config.json')
testdata = json.loads(config.read_text())
param = testdata.get(metafunc.function.__name__, None)
if isinstance(param, list):
metafunc.parametrize('config_data', param)
Some notes: yield_fixture is deprecated so I replaced it with plain fixture. Also, you don't need autouse=True in fixtures that return values - you call them anyway.
Example tests and configs I used:
# testcases/project_1/config.json
{
"test_one": [1, 2, 3],
"test_two": "split"
}
# testcases/project_1/test_suite_1.py
def test_one(config_data):
assert config_data >= 0
def test_two(config_data):
assert config_data == 'split'
# testcases/project_2/config.json
{
"test_three": {"three": 3},
"test_four": {"four": 4}
}
# testcases/project_2/test_suite_2.py
def test_three(config_data):
assert config_data['three'] == 3
def test_four(config_data):
assert config_data['four'] == 4
Running the tests yields:
$ pytest -vs
============================== test session starts ================================
platform linux -- Python 3.6.5, pytest-3.4.1, py-1.5.3, pluggy-0.6.0 --
/data/gentoo64/usr/bin/python3.6
cachedir: .pytest_cache
rootdir: /data/gentoo64/home/u0_a82/projects/stackoverflow/so-50815777, inifile:
plugins: mock-1.6.3, cov-2.5.1
collected 6 items
testcases/project_1/test_suite_1.py::test_one[1] PASSED
testcases/project_1/test_suite_1.py::test_one[2] PASSED
testcases/project_1/test_suite_1.py::test_one[3] PASSED
testcases/project_1/test_suite_1.py::test_two PASSED
testcases/project_2/test_suite_2.py::test_three PASSED
testcases/project_2/test_suite_2.py::test_four PASSED
============================ 6 passed in 0.12 seconds =============================

Categories