How to run Spark unit testing in parallel via pytest (and fixture)?

How to run Spark unit testing in parallel via pytest (and fixture)? - python

I am writing unit testing for a spark application. I am using pytest and I have created a fixture to load the spark session once.
When I run one test at a time, it is passing but when I run all the tests together I am getting unexpected behavior. Then, I realize, spark is not multi-threadable. Any way to fix this? Is running pytest in non-parallel mode is the only solution?
Sample code structure,
#pytest.fixture(scope="session")
def spark() -> SparkSession:
builder = SparkSession.builder.appName("pandas-on-spark")
builder = builder.config("spark.sql.execution.arrow.pyspark.enabled", "true")
return builder.getOrCreate()
def test1(spark):
df = spark.createDataFrame(dummy_rows)
# do some transformaton
# assert
def test2(spark):
df = spark.createDataFrame(dummy_rows)
# do some transformaton
# assert
def testN(spark):
df = spark.createDataFrame(dummy_rows)
# do some transformaton
# assert
pytest -s .

With scope="session", you'd have a single Spark session for all the tests, means all variables, all caches, all transformations etc. If you really need to have each transformation completely separated from each test, you should consider having a new Spark session for each test by changing lower scope into class or function. The whole test would run slower, but your logic will be secured.

Related

Find the yarn ApplicationID of of the current Spark job from the DRIVER node?

Is there a straightforward way to get the yarn ApplicationId of the current job from the DRIVER node running under Amazon's Elastic Map Reduce (EMR)? This is running Spark in the cluster mode.
Right now I'm using code that runs a map() operation on a worker to read the CONTAINER_ID environment variable. This seems inefficient. Here's the code:
def applicationIdFromEnvironment():
return "_".join(['application'] + os.environ['CONTAINER_ID'].split("_")[1:3])
def applicationId():
"""Return the Yarn (or local) applicationID.
The environment variables are only set if we are running in a Yarn container.
"""
# First check to see if we are running on the worker...
try:
return applicationIdFromEnvironment()
except KeyError:
pass
# Perhaps we are running on the driver? If so, run a Spark job that finds it.
try:
from pyspark import SparkConf, SparkContext
sc = SparkContext.getOrCreate()
if "local" in sc.getConf().get("spark.master"):
return f"local{os.getpid()}"
# Note: make sure that the following map does not require access to any existing module.
appid = sc.parallelize([1]).map(lambda x: "_".join(['application'] + os.environ['CONTAINER_ID'].split("_")[1:3])).collect()
return appid[0]
except ImportError:
pass
# Application ID cannot be determined.
return f"unknown{os.getpid()}"

You can get the applicationID directly from the SparkContext using the property applicationId:
A unique identifier for the Spark application. Its format depends on
the scheduler implementation.
in case of local spark app something like ‘local-1433865536131’
case of YARN something like ‘application_1433865536131_34483’
appid = sc.applicationId

How to debug PySpark in local mode from test class

I'm writing an aggregation in pysaprk
To this project, I'm also adding test, where I create a session, put some data, and then run my aggregation, and check the results
The code looks like as following:
def mapper_convert_row(row):
#... specific of business logic code, eventually return one string value
return my_str
def run_spark_query(spark: SparkSession, from_dt, to_dt):
query = get_hive_query_str(from_dt, to_dt)
df = spark.sql(query).rdd.map(lambda row: Row(mapper_convert_row(row)))
out_schema = StructType([StructField("data", StringType())])
df_conv = spark.createDataFrame(df, out_schema)
df_conv.write.mode('overwrite').format("csv").save(folder)
And here is my test class
class SparkFetchTest(unittest.TestCase):
#staticmethod
def getOrCreateSC():
conf = SparkConf()
conf.setMaster("local")
spark = (SparkSession.builder.config(conf=conf).appName("MyPySparkApp")
.enableHiveSupport().getOrCreate())
return spark
def test_fetch(self):
dt_from = datetime.strptime("2019-01-01-10-00", '%Y-%m-%d-%H-%M')
dt_to = datetime.strptime("2019-01-01-10-05", '%Y-%m-%d-%H-%M')
spark = self.getOrCreateSC()
self.init_and_populate_table_with_test_data(spark, input_tbl, dt_from, dt_to)
run_spark_query(spark, dt_from, dt_to)
# assert on results
I've added PySpark dependencies via the Conda environment
and running this code via PyCharm. Just to make it clear - there is no spark installation on my local machine except PySpark Conda package
When I set the breakpoint inside the code, it works for me in the driver code, but it does not stop inside mapper_convert_row function.
How can I debug this business logic function in a local test environment?
The same approach in scala works perfectly, but this code should be in python.

pyspark is a conduit to the spark runtime that runs on the jvm / is written in scala. The connection is through py4j that provides a tcp-based socket from the python executable to the jvm. Unfortunately that means
No local debugging
I'm no more happy about it than you. I might just write/maintain a parallel code branch in scala to figure some things out that are tiring to do without the debugger.
Update Pycharm is able to debug spark programs. I have been using it nearly daily Pycharm Debugging of Pyspark

Access variables and lists from function

I am new to unit testing with Python. I would like to test some functions in my code. In particular I need to test if the outputs have specific dimensions or the same dimensions.
My Python script for unit testing looks like this:
import unittest
from func import *
class myTests(unittest.TestCase):
def setUp(self):
# I am not really sure whats the purpose of this function
def test_main(self):
# check if outputs of the function "main" are not empty:
self.assertTrue(main, msg = 'The main() function provides no return values!')
# check if "run['l_modeloutputs']" and "run['l_modeloutputs']", within the main() function have the same size:
self.assertCountEqual(self, run['l_modeloutputs'], run['l_dataoutputs'], msg=None)
# --> Doesn't work so far!
# check if the dimensions of "props['k_iso']", within the main() function are (80,40,100):
def tearDown(self):
# I am also not sure of the purpose of this function
if _name__ == "__main__":
unittest.main()
Here is the code under test:
def main(param_file):
# Load parameter file
run, model, sequences, hydraulics, flowtrans, elements, mg = hu.model_setup(param_file)
# some other code
...
if 'l_modeloutputs' in run:
if hydraulics['flag_gen'] is False:
print('No hydraulic parameters generated. No model outputs saved')
else:
save_models(realdir, realname, mg, run['l_modeloutputs'], flowtrans, props['k_iso'], props['ktensors'])
I need to access the parameters run['l_modeloutputs'] and run['l_dataoutputs'] of the main function from func.py. How can I pass the dimensions of these parameters to the unit testing script?

It sounds a bit like one of two things at the moment. Either your code isn't laid out at the moment in a way that is easy to test, or maybe you are trying to test or call too much code in one go.
If your code is laid out like the following:
main(file_name):
with open(file_name) as file:
... do work ...
results = outcome_of_work
and you are trying to test what you have got from the file_name as well as the size of results, then you may want to think of refactoring this so that you can test a smaller action. Maybe:
main(file_name):
# `get_file_contents` appears to be `hu.model_setup`
# `file_contents` would be `run`
file_contents = get_file_contents(file_name)
results = do_work_on_file_contents(file_contents)
Of course, if you already have a similar setup then the following is also applicable. This you can do easier tests, as you have easy control to both what's going into test (file_name or file_contents) and can then test the outcome (file_contents or results) for expected results.
With the unittest module you would basically be creating a small function for each test:
class Test(TestCase):
def test_get_file_contents(self):
# ... set up example `file-like object` ...
run = hu.model_setup(file_name)
self.assertCountEqual(
run['l_modeloutputs'], run['l_dataoutputs'])
... repeat for other possible files ...
def test_do_work_on_file_contents(self):
example_input = ... setup input ...
example_output = do_work_on_file_contents(example_input)
assert example_output == as_expected
This can then be repeated for different sets of potential inputs, both good and edge cases.
Its probably worth looking about for a more in-depth tutorial as this is obviously only a very quick look over.
And setUp and tearDown are only needed if there is something to be done for each test you have written (i.e. you have set up an object in a particular way, for several tests, this can be done in setUp and its run before each test function.

PyTest: Django transaction commit failure

I am using Pytest to implement unit test in my django project which has MySql as backend.
In combination with these I am making use of SQLAlchemy for data generation.
I have a python function call_my_flow() which executes two different flows depending upon conditions. First flow uses sqlalchemy connection and second flow uses django connection for database insert.
I have written two unit tests using pytest to check both the flows.
First flow (where sqlalchemy connection is used): Commits the process flow transaction in database and pytest runs as per expectation.
Second flow (where django database connection is used): The transaction commit fails thus resulting into the failure of test.
Demo code:
import pytest
from myflow import call_my_flow
#pytest.fixture(scope="class")
#pytest.mark.django_db(transaction=False)
def setup_my_flow():
call_my_flow()
#pytest.mark.usefixtures("setup_my_flow")
class TestGenerateOrder(object):
#pytest.fixture(autouse=True)
def setuporder(self):
self.first_count = 2
self.second_count = 5
#pytest.mark.order1
#pytest.mark.django_db
def test_first_flow_count(self):
db_count = get_first_count()
assert db_count == self.first_count
#pytest.mark.order2
#pytest.mark.django_db
def test_second_flow_count(self):
db_count = get_second_count()
assert db_count == self.second_count
Please suggest a solution on the same.

How to skip the rest of tests in the class if one has failed?

I'm creating the test cases for web-tests using Jenkins, Python, Selenium2(webdriver) and Py.test frameworks.
So far I'm organizing my tests in the following structure:
each Class is the Test Case and each test_ method is a Test Step.
This setup works GREAT when everything is working fine, however when one step crashes the rest of the "Test Steps" go crazy. I'm able to contain the failure inside the Class (Test Case) with the help of teardown_class(), however I'm looking into how to improve this.
What I need is somehow skip(or xfail) the rest of the test_ methods within one class if one of them has failed, so that the rest of the test cases are not run and marked as FAILED (since that would be false positive)
Thanks!
UPDATE: I'm not looking or the answer "it's bad practice" since calling it that way is very arguable. (each Test Class is independent - and that should be enough).
UPDATE 2: Putting "if" condition in each test method is not an option - is a LOT of repeated work. What I'm looking for is (maybe) somebody knows how to use the hooks to the class methods.

I like the general "test-step" idea. I'd term it as "incremental" testing and it makes most sense in functional testing scenarios IMHO.
Here is a an implementation that doesn't depend on internal details of pytest (except for the official hook extensions). Copy this into your conftest.py:
import pytest
def pytest_runtest_makereport(item, call):
if "incremental" in item.keywords:
if call.excinfo is not None:
parent = item.parent
parent._previousfailed = item
def pytest_runtest_setup(item):
previousfailed = getattr(item.parent, "_previousfailed", None)
if previousfailed is not None:
pytest.xfail("previous test failed (%s)" % previousfailed.name)
If you now have a "test_step.py" like this:
import pytest
#pytest.mark.incremental
class TestUserHandling:
def test_login(self):
pass
def test_modification(self):
assert 0
def test_deletion(self):
pass
then running it looks like this (using -rx to report on xfail reasons):
(1)hpk#t2:~/p/pytest/doc/en/example/teststep$ py.test -rx
============================= test session starts ==============================
platform linux2 -- Python 2.7.3 -- pytest-2.3.0.dev17
plugins: xdist, bugzilla, cache, oejskit, cli, pep8, cov, timeout
collected 3 items
test_step.py .Fx
=================================== FAILURES ===================================
______________________ TestUserHandling.test_modification ______________________
self = <test_step.TestUserHandling instance at 0x1e0d9e0>
def test_modification(self):
> assert 0
E assert 0
test_step.py:8: AssertionError
=========================== short test summary info ============================
XFAIL test_step.py::TestUserHandling::()::test_deletion
reason: previous test failed (test_modification)
================ 1 failed, 1 passed, 1 xfailed in 0.02 seconds =================
I am using "xfail" here because skips are rather for wrong environments or missing dependencies, wrong interpreter versions.
Edit: Note that neither your example nor my example would directly work with distributed testing. For this, the pytest-xdist plugin needs to grow a way to define groups/classes to be sent whole-sale to one testing slave instead of the current mode which usually sends test functions of a class to different slaves.

If you'd like to stop the test execution after N failures anywhere (not in a particular test class) the command line option pytest --maxfail=N is the way to go:
https://docs.pytest.org/en/latest/usage.html#stopping-after-the-first-or-n-failures
if you instead want to stop a test that is comprised of multiple steps if any of them fails, (and continue executing the other tests) you should put all your steps in a class, and use the #pytest.mark.incremental decorator on that class and edit your conftest.py to include the code shown here
https://docs.pytest.org/en/latest/example/simple.html#incremental-testing-test-steps.

The pytest -x option will stop test after first failure:
pytest -vs -x test_sample.py

It's generally bad practice to do what are you doing. Each test should be as independent as possible from the others, while you completely depend on the results of the other tests.
Anyway, reading the docs it seems like a feature like the one you want is not implemented.(Probably because it wasn't considered useful).
A work-around could be to "fail" your tests calling a custom method which sets some condition on the class, and mark each test with the "skipIf" decorator:
class MyTestCase(unittest.TestCase):
skip_all = False
#pytest.mark.skipIf("MyTestCase.skip_all")
def test_A(self):
...
if failed:
MyTestCase.skip_all = True
#pytest.mark.skipIf("MyTestCase.skip_all")
def test_B(self):
...
if failed:
MyTestCase.skip_all = True
Or you can do this control before running each test and eventually call pytest.skip().
edit:
Marking as xfail can be done in the same way, but using the corresponding function calls.
Probably, instead of rewriting the boiler-plate code for each test, you could write a decorator(this would probably require that your methods return a "flag" stating if they failed or not).
Anyway, I'd like to point out that,as you state, if one of these tests fails then other failing tests in the same test case should be considered false positive...
but you can do this "by hand". Just check the output and spot the false positives.
Even though this might be boring./error prone.

You might want to have a look at pytest-dependency. It is a plugin that allows you to skip some tests if some other test had failed.
In your very case, it seems that the incremental tests that gbonetti discussed is more relevant.

Based on hpk42's answer, here's my slightly modified incremental mark that makes test cases xfail if the previous test failed (but not if it xfailed or it was skipped). This code has to be added to conftest.py:
import pytest
try:
pytest.skip()
except BaseException as e:
Skipped = type(e)
try:
pytest.xfail()
except BaseException as e:
XFailed = type(e)
def pytest_runtest_makereport(item, call):
if "incremental" in item.keywords:
if call.excinfo is not None:
if call.excinfo.type in {Skipped, XFailed}:
return
parent = item.parent
parent._previousfailed = item
def pytest_runtest_setup(item):
previousfailed = getattr(item.parent, "_previousfailed", None)
if previousfailed is not None:
pytest.xfail("previous test failed (%s)" % previousfailed.name)
And then a collection of test cases has to be marked with #pytest.mark.incremental:
import pytest
#pytest.mark.incremental
class TestWhatever:
def test_a(self): # this will pass
pass
def test_b(self): # this will be skipped
pytest.skip()
def test_c(self): # this will fail
assert False
def test_d(self): # this will xfail because test_c failed
pass
def test_e(self): # this will xfail because test_c failed
pass

UPDATE: Please take a look at #hpk42 answer. His answer is less intrusive.
This is what I was actually looking for:
from _pytest.runner import runtestprotocol
import pytest
from _pytest.mark import MarkInfo
def check_call_report(item, nextitem):
"""
if test method fails then mark the rest of the test methods as 'skip'
also if any of the methods is marked as 'pytest.mark.blocker' then
interrupt further testing
"""
reports = runtestprotocol(item, nextitem=nextitem)
for report in reports:
if report.when == "call":
if report.outcome == "failed":
for test_method in item.parent._collected[item.parent._collected.index(item):]:
test_method._request.applymarker(pytest.mark.skipif("True"))
if test_method.keywords.has_key('blocker') and isinstance(test_method.keywords.get('blocker'), MarkInfo):
item.session.shouldstop = "blocker issue has failed or was marked for skipping"
break
def pytest_runtest_protocol(item, nextitem):
# add to the hook
item.ihook.pytest_runtest_logstart(
nodeid=item.nodeid, location=item.location,
)
check_call_report(item, nextitem)
return True
Now adding this to conftest.py or as a plugin solves my problem.
Also it's improved to STOP testing if the blocker test has failed. (meaning that the entire further tests are useless)

Or quite simply instead of calling py.test from cmd (or tox or wherever), just call:
py.test --maxfail=1
see here for more switches:
https://pytest.org/latest/usage.html

To complement hpk42's answer, you can also use pytest-steps to perform incremental testing, this can help you in particular if you wish to share some kind of incremental state/intermediate results between the steps.
With this package you do not need to put all the steps in a class (you can, but it is not required), simply decorate your "test suite" function with #test_steps:
from pytest_steps import test_steps
def step_a():
# perform this step ...
print("step a")
assert not False # replace with your logic
def step_b():
# perform this step
print("step b")
assert not False # replace with your logic
#test_steps(step_a, step_b)
def test_suite_no_shared_results(test_step):
# Execute the step
test_step()
You can add a steps_data parameter to your test function if you wish to share a StepsDataHolder object between your steps.
import pytest
from pytest_steps import test_steps, StepsDataHolder
def step_a(steps_data):
# perform this step ...
print("step a")
assert not False # replace with your logic
# intermediate results can be stored in steps_data
steps_data.intermediate_a = 'some intermediate result created in step a'
def step_b(steps_data):
# perform this step, leveraging the previous step's results
print("step b")
# you can leverage the results from previous steps...
# ... or pytest.skip if not relevant
if len(steps_data.intermediate_a) < 5:
pytest.skip("Step b should only be executed if the text is long enough")
new_text = steps_data.intermediate_a + " ... augmented"
print(new_text)
assert len(new_text) == 56
#test_steps(step_a, step_b)
def test_suite_with_shared_results(test_step, steps_data: StepsDataHolder):
# Execute the step with access to the steps_data holder
test_step(steps_data)
Finally, you can automatically skip or fail a step if another has failed using #depends_on, check in the documentation for details.
(I'm the author of this package by the way ;) )

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to run Spark unit testing in parallel via pytest (and fixture)? - python

Related

Find the yarn ApplicationID of of the current Spark job from the DRIVER node?

How to debug PySpark in local mode from test class

Access variables and lists from function

PyTest: Django transaction commit failure

How to skip the rest of tests in the class if one has failed?

Categories

Resources