I am running a large unit test repository for a complex project.
This project has some things that don't play well with large test amounts:
caches (memoization) that cause objects not to be freed between tests
complex objects at module level that are singletons and might gather data when being used
I am interested in each test (or at least each test suite) having its own "python-object-pool" and being able to free it after.
Sort of a python-garbage-collector-problem workaround.
I imagine a python self-contained temporary and discardable interpreter that can run certain code for me and after i can call "interpreter.free()" and be assured it doesn't leak.
One tough solution for this I found is to use Nose or implement this via subprocess for each time I need an expendable interpreter that will run a test. So each test becomes "fork_and_run(conditions)" and leaks no memory in the original process.
Also saw Nose single process per each test and run the tests sequantially - though people mentioned it sometimes freezes midway - less fun..
Is there a simpler solution?
P.S.
I am not interested in going through vast amounts of other peoples code and trying to make all their caches/objects/projects be perfectly memory-managed objects that can be cleaned.
P.P.S
Our PROD code also creates a new process for each job, which is very comfortable since we don't have to mess around with "surviving forever" and other scary stories.
TL;DR
Module reload trick I tried worked locally, broke when used on a machine with a different python version... (?!)
I ended up taking any and all caches I wrote in code and adding them to a global cache list - then clearing them between tests.
Sadly this will break if anyone uses a cache/manual cache mechanism and misses this, tests will start growing in memory again...
For starters I wrote a loop that goes over sys.modules dict and reloads (loops twice) all modules of my code. this worked amazingly - all references were freed properly, but it seems it cannot be used in production/serious code for multiple reasons:
old python versions break when reloading and classes that inherit meta-classes are redefined (I still don't get how this breaks).
unit tests survive the reload and sometimes have bad instances to old classes - especially if the class uses another classes instance. Think super(class_name, self) where self is the previously defined class, and now class_name is the redefined-same-name-class.
Related
i have a pytest test suite with about 1800 tests which takes more than 10 minutes to collect and execute. i tried to create a cprofile on the test and found out that majority of the time, around 300 seconds went in {built-in method builtins.compile}
There were some other compile method calls from the regular expression package which i tried to remove and saw a reduction of about 50 seconds. but it still takes 9.5 minutes which is huge.
What i understood till now is that the builtins compile method is used to convert the script into code object and that pytest internally uses this function for creating and executing code objects. But 9-10 minutes is insanely huge amount of time for running 1800 tests. I am new to pytest and python so trying to figure out the reason for this time.
Could there be a possibility that pytest is not configured properly that it uses compile method to generate code object ? or could the other imported libraries use compile internally ?
Could there be a possibility that pytest is not configured properly that it uses compile method to generate code object ?
Though I have never looked, I would fully expect pytest to compile files to bytecode by hand, for the simple reason that it performs assertion rewriting by default in order to instrument assert statements: when an assertion fails, rather than just show the assertion message pytest shows the various intermediate values. This requires compiling either way: either they're compiling the code to bytecode and rewriting the bytecode, or they're parsing the code to the AST, updating the AST, and still compiling to bytecode.
It's possible to disable this behaviour (--assert=plain), but I would not expect there to be much gain from it (though I could be wrong): pytest simply does that instead of the interpreter performing the compilation on its own. It has to be done one way or an other for the test suite to run.
Though taking 5 minutes does sound like a lot, do you have a large amounts of very small files or something? Rough benching indicates that compile works at about 5usec/line on my machine (though it probably depends on code complexity). I've got 6kLOC worth of test, and while the test suite takes ages it's because the tests themselves are expensive, the collection is unnoticeable.
Of course it's possible you could be triggering some sort of edge case or issue in pytest e.g. maybe you have an ungodly number of assert statements which causes pytest to generate an insane amount of rewritten code? The aforementioned --assert=plain could hint at that if it makes running the test suite significantly shorter.
You could also try running e.g. --collect-only to see what that yields, though I don't know whether the assertion rewriting is performed during or after the collection. FWIW on the 6kLOC test suite above I get 216 tests collected in 1.32s.
Either way this seems like something more suitable to the pytest bug tracker.
or could the other imported libraries use compile internally ?
You could use a flamegraph-based profiler to record the entire stack. cprofile is, frankly, kinda shit.
I am using py.test (version 2.4, on Windows 7) with xdist to run a number of numerical regression and interface tests for a C++ library that provides a Python interface through a C module.
The number of tests has grown to ~2,000 over time, but we are running into some memory issues now. Whether using xdist or not, the memory usage of the python process running the tests seems to be ever increasing.
In single-process mode we have even seen a few issues of bad allocation errors, whereas with xdist total memory usage may bring down the OS (8 processes, each using >1GB towards the end).
Is this expected behaviour? Or did somebody else experience the same issue when using py.test for a large number of tests? Is there something I can do in tearDown(Class) to reduce the memory usage over time?
At the moment I cannot exclude the possibility of the problem lying somewhere inside the C/C++ code, but when running some long-running program using that code through the Python interface outside of py.test, I do see relatively constant memory usage over time. I also do not see any excessive memory usage when using nose instead of py.test (we are using py.test as we need junit-xml reporting to work with multiple processes)
py.test's memory usage will grow with the number of tests. Each test is collected before they are executed and for each test run a test report is stored in memory, which will be much larger for failures, so that all the information can be reported at the end. So to some extend this is expected and normal.
However I have no hard numbers and have never closely investigated this. We did run out of memory on some CI hosts ourselves before but just gave them more memory to solve it instead of investigating. Currently our CI hosts have 2G of mem and run about 3500 tests in one test run, it would probably work on half of that but might involve more swapping. Pypy is also a project that manages to run a huge test suite with py.test so this should certainly be possible.
If you suspect the C code to leak memory I recommend building a (small) test script which just tests the extension module API (with or without py.test) and invoke that in an infinite loop while gathering memory stats after every loop. After a few loops the memory should never increase anymore.
Try using --tb=no which should prevent pytest from accumulating stacks on every failure.
I have found that it's better to have your test runner run smaller instances of pytest in multiple processes, rather than one big pytest run, because of it's accumulation in memory of every error.
pytest should probably accumulate test results on-disk, rather than in ram.
We also experience similar problems. In our case we run about ~4600 test cases.
We use extensively pytest fixtures and we managed to save the few MB by scoping the fixtures slightly differently (scoping several from "session" to "class" of "function"). However we dropped in test performances.
I'm in the process of writing a small/medium sized GUI application with PyGObject (the new introspection based bindings for Gtk). I started out with a reasonable test suite based on nose that was able to test most of the functions of my app by simply importing the modules and calling various functions and checking the results.
However, recently I've begun to take advantage of some Gtk features like GLib.timeout_add_seconds which is a fairly simple callback mechanism that simply calls the specified callback after a timer expires. The problem I'm naturally facing now is that my code seems to work when I use the app, but the testsuite is poorly encapsulated so when one test checks that it's starting with clean state, it finds that it's state has been trampled all over by a callback that was registered by a different test. Specifically, the test successfully checks that no files are loaded, then it loads some files, then checks that the files haven't been modified since loading, and the test fails!
It took me a while to figure out what was going on, but essentially one test would modify some files (which initiates a timer) then close them without saving, then another test would reopen the unmodified files and find that they're modified, because the callback altered the files after the timer was up.
I've read about Python's reload() builtin for reloading modules in the hopes that I could make it unload and reload my app to get a fresh start, but it just doesn't seem to be working.
I'm afraid that I might have to resort to launching the app as a subprocess, tinkering with it, then ending the subprocess and relaunching it when I need to guarantee fresh state. Are there any test frameworks out there that would make this easy, particularly for pygobject code?
Would a mocking framework help you isolate the callbacks? That way, you should be able to get back to the same state as when you started. Note that a SetUp() and tearDown() pattern may help you there as well -- but I am kind of assuming that you already are using that.
The project I'm working on is a business logic software wrapped up as a Python package. The idea is that various script or application will import it, initialize it, then use it.
It currently has a top level init() method that does the initialization and sets up various things, a good example is that it sets up SQLAlchemy with a db connection and stores the SA session for later access. It is being stored in a subpackage of my project (namely myproj.model.Session, so other code could get a working SA session after import'ing the model).
Long story short, this makes my package a stateful one. I'm writing unit tests for the project and this stafeful behaviour poses some problems:
tests should be isolated, but the internal state of my package breaks this isolation
I cannot test the main init() method since its behavior depends on the state
future tests will need to be run against the (not yet written) controller part with a well known model state (eg. a pre-populated sqlite in-memory db)
Should I somehow refactor my package because the current structure is not the Best (possible) Practice(tm)? :)
Should I leave it at that and setup/teardown the whole thing every time? If I'm going to achieve complete isolation that'd mean fully erasing and re-populating the db at every single test, isn't that overkill?
This question is really on the overall code & tests structure, but for what it's worth I'm using nose-1.0 for my tests. I know the Isolate plugin could probably help me but I'd like to get the code right before doing strange things in the test suite.
You have a few options:
Mock the database
There are a few trade offs to be aware of.
Your tests will become more complex as you will have to do the setup, teardown and mocking of the connection. You may also want to do verification of the SQL/commands sent. It also tends to create an odd sort of tight coupling which may cause you to spend additonal time maintaining/updating tests when the schema or SQL changes.
This is usually the purest for of test isolation because it reduces a potentially large dependency from testing. It also tends to make tests faster and reduces the overhead to automating the test suite in say a continuous integration environment.
Recreate the DB with each Test
Trade offs to be aware of.
This can make your test very slow depending on how much time it actually takes to recreate your database. If the dev database server is a shared resource there will have to be additional initial investment in making sure each dev has their own db on the server. The server may become impacted depending on how often tests get runs. There is additional overhead to running your test suite in a continuous integration environment because it will need at least, possibly more dbs (depending on how many branches are being built simultaneously).
The benefit has to do with actually running through the same code paths and similar resources that will be used in production. This usually helps to reveal bugs earlier which is always a very good thing.
ORM DB swap
If your using an ORM like SQLAlchemy their is a possibility that you can swap the underlying database with a potentially faster in-memory database. This allows you to mitigate some of the negatives of both the previous options.
It's not quite the same database as will be used in production, but the ORM should help mitigate the risk that obscures a bug. Typically the time to setup an in-memory database is much shorter that one which is file-backed. It also has the benefit of being isolated to the current test run so you don't have to worry about shared resource management or final teardown/cleanup.
Working on a project with a relatively expensive setup (IPython), I've seen an approach used where we call a get_ipython function, which sets up and returns an instance, while replacing itself with a function which returns a reference to the existing instance. Then every test can call the same function, but it only does the setup for the first one.
That saves doing a long setup procedure for every test, but occasionally it creates odd cases where a test fails or passes depending on what tests were run before. We have ways of dealing with that - a lot of the tests should do the same thing regardless of the state, and we can try to reset the object's state before certain tests. You might find a similar trade-off works for you.
Mock is a simple and powerfull tool to achieve some isolation. There is a nice video from Pycon2011 which shows how to use it. I recommend to use it together with py.test which reduces the amount of code required to define tests and is still very, very powerfull.
Our Python application (a cool web service) has a full suite of tests (unit tests, integration tests etc.) that all developers must run before committing code.
I want to add some performance tests to the suite to make sure no one adds code that makes us run too slow (for some rather arbitrary definition of slow).
Obviously, I can collect some functionality into a test, time it and compare to some predefined threshold.
The tricky requirements:
I want every developer to be able to test the code on his machine (varies with CPU power, OS(! Linux and some Windows) and external configurations - the Python version, libraries and modules are the same). A test server, while generally a good idea, does not solve this.
I want the test to be DETERMINISTIC - regardless of what is happening on the machine running the tests, I want multiple runs of the test to return the same results.
My preliminary thoughts:
Use timeit and do a benchmark of the system every time I run the tests. Compare the performance test results to the benchmark.
Use cProfile to instrument the interpreter to ignore "outside noise". I'm not sure I know how to read the pstats structure yet, but I'm sure it is doable.
Other thoughts?
Thanks!
Tal.
Check out funkload - it's a way of running your unit tests as either functional or load tests to gauge how well your site is performing.
Another interesting project which can be used in conjunction with funkload is codespeed. This is an internal dashboard that measures the "speed" of your codebase for every commit you make to your code, presenting graphs with trends over time. This assumes you have a number of automatic benchmarks you can run - but it could be a useful way to have an authoritative account of performance over time. The best use of codespeed I've seen so far is the speed.pypy.org site.
As to your requirement for determinism - perhaps the best approach to that is to use statistics to your advantage? Automatically run the test N times, produce the min, max, average and standard deviation of all your runs? Check out this article on benchmarking for some pointers on this.
I want the test to be DETERMINISTIC - regardless of what is happening on the machine running the tests, I want multiple runs of the test to return the same results.
Fail. More or less by definition this is utterly impossible in a multi-processing system with multiple users.
Either rethink this requirement or find a new environment in which to run tests that doesn't involve any of the modern multi-processing operating systems.
Further, your running web application is not deterministic, so imposing some kind of "deterministic" performance testing doesn't help much.
When we did time-critical processing (in radar, where "real time" actually meant real time) we did not attempt deterministic testing. We did code inspections and ran simple performance tests that involved simple averages and maximums.
Use cProfile to instrument the interpreter to ignore "outside noise". I'm not sure I know how to read the pstats structure yet, but I'm sure it is doable.
The Stats object created by the profiler is what you're looking for.
http://docs.python.org/library/profile.html#the-stats-class
Focus on 'pcalls', primitive call count, in the profile statistics and you'll have something that's approximately deterministic.