Py.test: excessive memory usage with large number of tests

Py.test: excessive memory usage with large number of tests - python

I am using py.test (version 2.4, on Windows 7) with xdist to run a number of numerical regression and interface tests for a C++ library that provides a Python interface through a C module.
The number of tests has grown to ~2,000 over time, but we are running into some memory issues now. Whether using xdist or not, the memory usage of the python process running the tests seems to be ever increasing.
In single-process mode we have even seen a few issues of bad allocation errors, whereas with xdist total memory usage may bring down the OS (8 processes, each using >1GB towards the end).
Is this expected behaviour? Or did somebody else experience the same issue when using py.test for a large number of tests? Is there something I can do in tearDown(Class) to reduce the memory usage over time?
At the moment I cannot exclude the possibility of the problem lying somewhere inside the C/C++ code, but when running some long-running program using that code through the Python interface outside of py.test, I do see relatively constant memory usage over time. I also do not see any excessive memory usage when using nose instead of py.test (we are using py.test as we need junit-xml reporting to work with multiple processes)

py.test's memory usage will grow with the number of tests. Each test is collected before they are executed and for each test run a test report is stored in memory, which will be much larger for failures, so that all the information can be reported at the end. So to some extend this is expected and normal.
However I have no hard numbers and have never closely investigated this. We did run out of memory on some CI hosts ourselves before but just gave them more memory to solve it instead of investigating. Currently our CI hosts have 2G of mem and run about 3500 tests in one test run, it would probably work on half of that but might involve more swapping. Pypy is also a project that manages to run a huge test suite with py.test so this should certainly be possible.
If you suspect the C code to leak memory I recommend building a (small) test script which just tests the extension module API (with or without py.test) and invoke that in an infinite loop while gathering memory stats after every loop. After a few loops the memory should never increase anymore.

Try using --tb=no which should prevent pytest from accumulating stacks on every failure.
I have found that it's better to have your test runner run smaller instances of pytest in multiple processes, rather than one big pytest run, because of it's accumulation in memory of every error.
pytest should probably accumulate test results on-disk, rather than in ram.

We also experience similar problems. In our case we run about ~4600 test cases.
We use extensively pytest fixtures and we managed to save the few MB by scoping the fixtures slightly differently (scoping several from "session" to "class" of "function"). However we dropped in test performances.

Related

pytest takes 10 minutes to collect at builtins compile method

i have a pytest test suite with about 1800 tests which takes more than 10 minutes to collect and execute. i tried to create a cprofile on the test and found out that majority of the time, around 300 seconds went in {built-in method builtins.compile}
There were some other compile method calls from the regular expression package which i tried to remove and saw a reduction of about 50 seconds. but it still takes 9.5 minutes which is huge.
What i understood till now is that the builtins compile method is used to convert the script into code object and that pytest internally uses this function for creating and executing code objects. But 9-10 minutes is insanely huge amount of time for running 1800 tests. I am new to pytest and python so trying to figure out the reason for this time.
Could there be a possibility that pytest is not configured properly that it uses compile method to generate code object ? or could the other imported libraries use compile internally ?

Could there be a possibility that pytest is not configured properly that it uses compile method to generate code object ?
Though I have never looked, I would fully expect pytest to compile files to bytecode by hand, for the simple reason that it performs assertion rewriting by default in order to instrument assert statements: when an assertion fails, rather than just show the assertion message pytest shows the various intermediate values. This requires compiling either way: either they're compiling the code to bytecode and rewriting the bytecode, or they're parsing the code to the AST, updating the AST, and still compiling to bytecode.
It's possible to disable this behaviour (--assert=plain), but I would not expect there to be much gain from it (though I could be wrong): pytest simply does that instead of the interpreter performing the compilation on its own. It has to be done one way or an other for the test suite to run.
Though taking 5 minutes does sound like a lot, do you have a large amounts of very small files or something? Rough benching indicates that compile works at about 5usec/line on my machine (though it probably depends on code complexity). I've got 6kLOC worth of test, and while the test suite takes ages it's because the tests themselves are expensive, the collection is unnoticeable.
Of course it's possible you could be triggering some sort of edge case or issue in pytest e.g. maybe you have an ungodly number of assert statements which causes pytest to generate an insane amount of rewritten code? The aforementioned --assert=plain could hint at that if it makes running the test suite significantly shorter.
You could also try running e.g. --collect-only to see what that yields, though I don't know whether the assertion rewriting is performed during or after the collection. FWIW on the 6kLOC test suite above I get 216 tests collected in 1.32s.
Either way this seems like something more suitable to the pytest bug tracker.
or could the other imported libraries use compile internally ?
You could use a flamegraph-based profiler to record the entire stack. cprofile is, frankly, kinda shit.

Temporary object-pool for unit tests?

I am running a large unit test repository for a complex project.
This project has some things that don't play well with large test amounts:
caches (memoization) that cause objects not to be freed between tests
complex objects at module level that are singletons and might gather data when being used
I am interested in each test (or at least each test suite) having its own "python-object-pool" and being able to free it after.
Sort of a python-garbage-collector-problem workaround.
I imagine a python self-contained temporary and discardable interpreter that can run certain code for me and after i can call "interpreter.free()" and be assured it doesn't leak.
One tough solution for this I found is to use Nose or implement this via subprocess for each time I need an expendable interpreter that will run a test. So each test becomes "fork_and_run(conditions)" and leaks no memory in the original process.
Also saw Nose single process per each test and run the tests sequantially - though people mentioned it sometimes freezes midway - less fun..
Is there a simpler solution?
P.S.
I am not interested in going through vast amounts of other peoples code and trying to make all their caches/objects/projects be perfectly memory-managed objects that can be cleaned.
P.P.S
Our PROD code also creates a new process for each job, which is very comfortable since we don't have to mess around with "surviving forever" and other scary stories.

TL;DR
Module reload trick I tried worked locally, broke when used on a machine with a different python version... (?!)
I ended up taking any and all caches I wrote in code and adding them to a global cache list - then clearing them between tests.
Sadly this will break if anyone uses a cache/manual cache mechanism and misses this, tests will start growing in memory again...
For starters I wrote a loop that goes over sys.modules dict and reloads (loops twice) all modules of my code. this worked amazingly - all references were freed properly, but it seems it cannot be used in production/serious code for multiple reasons:
old python versions break when reloading and classes that inherit meta-classes are redefined (I still don't get how this breaks).
unit tests survive the reload and sometimes have bad instances to old classes - especially if the class uses another classes instance. Think super(class_name, self) where self is the previously defined class, and now class_name is the redefined-same-name-class.

Pytest with xdist takes 16x longer, regardless of number of workers

I've just added pytest to an existing Django project - all unit tests are using Django's unittest subclasses, etc. We use an SQLite in-memory database for tests.
manage.py test takes rougly 80 seconds on our test suite
py.test takes the same
py.test -n1 (or -n4, or anything like that) takes roughly 1280 seconds.
I would expect an overhead to support the distribution, but obviously with -n4, it should be approximately 3-4 times faster on a large-ish test suite.
Findings...
I've so far traced the issue down to database access. The tests run quickly until they first hit the database, but at the first .save() call on a Django model, that test will be incredibly slow.
After some profiling on the workers, it looks like they are spending a lot of time waiting on locks, but I have no idea if that's a reliable finding or not.
I wondered if there was some sort of locking on the database, it was suggested to me that the in-memory SQLite database might be a memory mapped file, and that locking might be happening on that between the workers, but apparently each call to open an in-memory database with SQLite will return a completely separate instance.
As it stands, I've probably spent 5+ hours on this so far, and spoken at length to colleagues and others about this, and not yet found the issue. I have not been able to reproduce on a separate codebase.
What kind of things might cause this?
What more could I do to further track down the issue?
Thanks in advance for any ideas!

PyTest - Run each Test as a Mutlitprocessing Process

I'm using pytest to run my tests, and testing my web application. My test file looks like
def test_logins():
# do stuff
def test_signups():
# do stuff
def testing_posting():
# do stuff
There are about 20 of them, and many of them have elements that run in constant time or rely on external HTTP requests, so it seems like it would lead to a large increase in testing speed if I could get pytest to start up 20 different mutliprocessing processes (one for each test) to run each testing function. Is this possible / reasonable / recommended?
I looked into xdist but splitting the tests so that they ran based on the amount of cores on my computer isn't what I want.
Also in case it's relevant, the bulk of the tests are done using python's requests library (although they will be moved to selenium eventually)

I would still recommend using pytest-xdist. And, as you mentioned already because your tests mostly do network IO, it's ok to start pytest with (much) more parallel processes than you have cores (like 20), it will be still beneficial, as GIL will not be preventing the speedup from the parallelization.
So you run it like:
py.test tests -n<number>
The additional benefit of xdist is that you can easily scale your test run to multiple machines with no effort.
For easier scaling among multiple machines, pytest-cloud can help a lot.

Proper way to automatically test performance in Python (for all developers)?

Our Python application (a cool web service) has a full suite of tests (unit tests, integration tests etc.) that all developers must run before committing code.
I want to add some performance tests to the suite to make sure no one adds code that makes us run too slow (for some rather arbitrary definition of slow).
Obviously, I can collect some functionality into a test, time it and compare to some predefined threshold.
The tricky requirements:
I want every developer to be able to test the code on his machine (varies with CPU power, OS(! Linux and some Windows) and external configurations - the Python version, libraries and modules are the same). A test server, while generally a good idea, does not solve this.
I want the test to be DETERMINISTIC - regardless of what is happening on the machine running the tests, I want multiple runs of the test to return the same results.
My preliminary thoughts:
Use timeit and do a benchmark of the system every time I run the tests. Compare the performance test results to the benchmark.
Use cProfile to instrument the interpreter to ignore "outside noise". I'm not sure I know how to read the pstats structure yet, but I'm sure it is doable.
Other thoughts?
Thanks!
Tal.

Check out funkload - it's a way of running your unit tests as either functional or load tests to gauge how well your site is performing.
Another interesting project which can be used in conjunction with funkload is codespeed. This is an internal dashboard that measures the "speed" of your codebase for every commit you make to your code, presenting graphs with trends over time. This assumes you have a number of automatic benchmarks you can run - but it could be a useful way to have an authoritative account of performance over time. The best use of codespeed I've seen so far is the speed.pypy.org site.
As to your requirement for determinism - perhaps the best approach to that is to use statistics to your advantage? Automatically run the test N times, produce the min, max, average and standard deviation of all your runs? Check out this article on benchmarking for some pointers on this.

I want the test to be DETERMINISTIC - regardless of what is happening on the machine running the tests, I want multiple runs of the test to return the same results.
Fail. More or less by definition this is utterly impossible in a multi-processing system with multiple users.
Either rethink this requirement or find a new environment in which to run tests that doesn't involve any of the modern multi-processing operating systems.
Further, your running web application is not deterministic, so imposing some kind of "deterministic" performance testing doesn't help much.
When we did time-critical processing (in radar, where "real time" actually meant real time) we did not attempt deterministic testing. We did code inspections and ran simple performance tests that involved simple averages and maximums.
Use cProfile to instrument the interpreter to ignore "outside noise". I'm not sure I know how to read the pstats structure yet, but I'm sure it is doable.
The Stats object created by the profiler is what you're looking for.
http://docs.python.org/library/profile.html#the-stats-class
Focus on 'pcalls', primitive call count, in the profile statistics and you'll have something that's approximately deterministic.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.