What is the difference between Property Based Testing and Mutation testing?

What is the difference between Property Based Testing and Mutation testing? - python

My context for this question is in Python.
Hypothesis Testing Library (i.e. Property Based Testing):
https://hypothesis.readthedocs.io/en/latest/
Mutation Testing Library:
https://github.com/sixty-north/cosmic-ray

These are very different beasts but both would improve the value and quality of your tests. Both tools contribute to and make the "My code coverage is N%" statement more meaningful.
Hypothesis would help you to generate all sorts of test inputs in the defined scope for a function under test.
Usually, when you need to test a function, you provide multiple example values trying to cover all the use cases and edge cases driven by the code coverage reports - this is so called "Example based testing". Hypothesis on the other hand implements a property-based testing generating a whole bunch of different inputs and input combinations helping to catch different common errors like division by zero, None, 0, off-by-one errors etc and helping to find hidden bugs.
Mutation testing is all about changing your code under test on the fly while executing your tests against a modified version of your code.
This really helps to see if your tests are actually testing what are they supposed to be testing, to understand the value of your tests. Mutation testing would really shine if you already have a rich test code base and a good code coverage.
What helped me to get ahold of these concepts were these Python Podcasts:
Property-based Testing with Hypothesis
Validating Python tests with mutation testing
Hypothesis with David MacIver

I'm the author or mutmut, the (imo) best mutation tester for python. #alecxe has a very good answer but I would like to expand on it. Read his answer before mine for basic context.
There are some big other differences, like PBT requires mental work to specify the rules for each function under test while MT requires you to justify all behavior in the code which requires much less cognitive effort.
MT is effectively white box and PBT black box.
Another difference is that MT is the exploration of a (fairly small) finite space, while PBT is an exploration of an infinite space (practically speaking). A practical consequence is that you can trivially know when you are done with MT, while you can have a PBT run going for years and you can't know if it has searched the relevant parts of the space. Better rules for PBT radically cuts the run time for this reason.
Mutation testing also forces minimal code. This is a surprising effect, but it's something I have experienced again and again. This is a nice little bonus for MT.
You can also use MT as a simple checklist to get to 100% mutation coverage, you don't need to start with 100% coverage, not at all. But with PBT you can start way below 100% coverage, in essence at 0% before you start.
I hope this clarifies the situation a bit more.

Related

Python GEKKO Unexpected Behavior with Constraints

I've been playing around with GEKKO for solving flow optimizations and I have come across behavior that is confusing me.
Context:
Sources --> [mixing and delivery] --> Sinks
I have multiple sources (where my flow is coming from), and multiple sinks (where my flow goes to). For a given source (e.g., SOURCE_1), the total flow to the resulting sinks must equal to the volume from SOURCE_1. This is my idea of conservation of mass where the 'mixing' plant blends all the source volumes together.
Constraint Example (DOES NOT WORK AS INTENDED):
When I try to create a constraint for the two SINK volumes, and the one SOURCE volume:
m.Equation(volume_sink_1[i] + volume_sink_2[i] == max_volumes_for_source_1)
I end up with weird results. With that, I mean, it's not actually optimal, it ends up assigning values very poorly. I am off from the optimal by at least 10% (I tried with different max volumes).
Constraint Example (WORKS BUT I DON'T GET WHY):
When I try to create a constraint for the two SINK volumes, and the one SOURCE volume like this:
m.Equation(volume_sink_1[i] + volume_sink_2[i] <= max_volumes_for_source_1 * 0.999999)
With this, I get MUCH closer to the actual optimum to the point where I can just treat it as the optimum. Please note that I had to change it to a less than or equal to and also multiply by 0.999999 which was me messing around with it nonstop and eventually leading to that.
Also, please note that this uses practically all of the source (up to 99.9999% of it) as I would expect. So both formulations make sense to me but the first approach doesn't work.
The only thing I can think of for this behavior is that it's stricter to solve for == than <=. That doesn't explain to me why I have to multiply by 0.999999 though.
Why is this the case? Also, is there a way for me to debug occurrences like this easier?

This same improvement occurs with complementary constraints for conditional statements when using s1*s2<=0 (easier to solve) versus s1*s2==0 (harder to solve).
From the research papers I've seen, the justification is that the solver has more room to search to find the optimal solution even if it always ends up at s1*s2==0. It sounds like your problem may have multiple local minima as well if it converges to a solution, but it isn't the global optimum.
If you can post a complete and minimal problem that demonstrates the issue, we can give more specific suggestions.

Nash equilibrium in Python

Is there a Python library out there that solves for the Nash equilibrium of two-person zero-games? I know the solution can be written down in terms of linear constraints and, in theory, scipy should be able to optimize it. However, for two-person zero-games the solution is exact and unique, but some of the solvers fail to converge for certain problems.
Rather than listing any of the libraries on Linear programing on the Python website, I would like to know what library would be most effective in terms of ease of use and speed.

Raymond Hettinger wrote a recipe for solving zero-sum payoff matrices. It should serve your purposes alright.
As for a more general library for solving game theory, there's nothing specifically designed for that. But, like you said, scipy can tackle optimization problems like this. You might be able to do something with GarlicSim, which claims to be for "any kind of simulation: Physics, game theory..." but I've never used it before so I can't recommend it.

There is Gambit, which is a little difficult to set up, but has a python API.

I've just started putting together some game theory python code: http://drvinceknight.github.com/Gamepy/
There's code which:
solves matching games,
calculates shapley values in cooperative games,
runs agent based simulations to identify emergent behaviour in normal form games,
(clumsily - my python foo is still growing) uses the lrs library (written in C: http://cgm.cs.mcgill.ca/~avis/C/lrs.html) to calculate the solutions to normal form games (this is I believe what you want).
The code is all available on github and that site (the first link at the beginning of this answer) explains how the code works and gives user examples.
You might also want to check out 'Gambit' which I've never used.

How do you test that something is random? Or "random enough'?

I have to return a random entry from my database.
I wrote a function, and since I'm using the random module in Python, it's probably unless I used it in a stupid way.
Now, how can I write a unit test that check that this function works? After all, if it's a good random value, you can never know.
I'm not paranoid, my function is not that complex and the Python standard library is 1000 x
time good enough for my purpose. I'm not doing cryptography or something critical. I'm just curious to know if there is a way.

There are several statistical tests listed on RANDOM.ORG for testing randomness. See the last two sections of the linked article.
Also, if you can get a copy of Beautiful Testing there's a whole chapter by John D. Cook called Testing a Random Number Generator. He explains a lot of the statistical methods listed in the article above. If you really want to learn about RNGs, that chapter is a really good starting point. I've written about the subject myself, but John does a much better job of explaining it.

You cannot really tell (see cartoon).
However, you can measure the entropy of your generated sample, and test it against the entropy you would expect. As it has been mentioned before, random.org makes some pretty clever tests.

You could have the unit test call the function multiple times and make sure that the number of collisions is reasonably low. E.g. if your random result is in the range 1-1000000, call the function 100 times and record the results; then check if there are duplicates. If there are any (or more than 1 collision, depending of how afraid you are of false test failure) the test fails.
Obviously not perfect, but will catch it if you random number is from Dilbert:
http://www.random.org/analysis/

You've got two entangled issues. The first issue is testing that your random selection works. Seeding your PRNG allows you to write a test that's deterministic and that you can assert about. This should give you confidence about your code, given that the underlying functions live up to their responsibilities (i.e. random returns you a good-enough stream of random values).
The second issue you seem to be concerned about is python's random functions. You want to separate the concerns of your code from the concert about the random function. There are a number of randomness tests that you can read about but at the end of the day unless you're doing crypto I'd trust the python developers to have gotten it right-enough.

In addition to previous answers you also can mock random function (for example with mock or mox library) and return predefined sequence of values for which you know results. Yep, this wouldn't be a true test for all cases, but you can test some corner-cases. In some cases such tests could be reasonable.

Testing complex datatypes?

What's are some ways of testing complex data types such as video, images, music, etc. I'm using TDD and wonder are there alternatives to "gold file" testing for rendering algorithms. I understand that there's ways to test other parts of the program that don't render and using those results you can infer. However, I'm particularly interested in rendering algorithms specifically image/video testing.
The question came up while I was using OpenCV/Python to do some basic facial recognition and want to verify its correctness.
Even if there's nothing definitive any suggestion will help.

The idea how to test rendering is quite simple: to test a function use the inverse function and check if the input and output match (match is not equality in your case):
f(f^-1(x)) = x
To test a rendering algorithm you would encode the raw input, render the encoded values and analyze the difference between the rendered output and the raw input. One problem is to get the raw input, when encoding/decoding random input is not appropriate. Another challenge is to evaluate the differences between the raw input and rendering output. I suppose if you're writing some rendering software you should be able to do a frequency analysis on the data. (Some transformation should pop into your head now.)
If it is possible generate your test data. Text fixtures are a real maintenance problem. They only shine in the beginning. If they are changing in some kind everything breaks down. The main problem is that if your using a fixture your tests are going to repeat the fixture's content. This makes the interpretation of intent of your tests harder. If there is a magic value in your test what's the significant part of this value?
Fixture:
actual = parse("file.xml")
expected = "magic value"
assert(actual == expected)
Generated values:
expected = generate()
input = render(expected)
actual = parse()
assert(actual == expected)
The nice thing with generators is that you can build quite complex object graphs with them starting from primitive types and fields (python Quickcheck version).
Generator based tests are not deterministic by nature. But given enough trials they follow the Law of large numbers.
Their additional value is that they will produce a good test value range coverage. (Which is hard to achieve with test fixtures.) They will find unanticipated bugs in your code.
An alternative test approach is to test with a equivalent function:
f(x) = f'(x)
For example if you have a rendering function to compare against. This kind of test approach is useful if you have a working function. This function is your benchmark. It cannot be used in production because it is to slow or does use to much memory but can be easily debugged or proven to be correct.

What's wrong with the "gold file" technique? It's part of your test fixture. Every test has a data fixture that's the equivalent to the "gold file" in a media-intensive application.
When doing ordinary TDD of ordinary business applications, one often has a golden database fixture that must be used.
Even when testing simple functions and core classes of an application, the setUp method creates a kind of "gold file" fixture for that class or function.
What's wrong with that technique? Please update your question with the specific problems you're having.

Unit tests for automatically generated code: automatic or manual?

I know similar questions have been asked before but they don't really have the information I'm looking for - I'm not asking about the mechanics of how to generate unit tests, but whether it's a good idea.
I've written a module in Python which contains objects representing physical constants and units of measurement. A lot of the units are formed by adding on prefixes to base units - e.g. from m I get cm, dm, mm, hm, um, nm, pm, etc. And the same for s, g, C, etc. Of course I've written a function to do this since the end result is over 1000 individual units and it would be a major pain to write them all out by hand ;-) It works something like this (not the actual code):
def add_unit(name, value):
globals()[name] = value
for pfx, multiplier in prefixes:
globals()[pfx + name] = multiplier * value
add_unit('m', <definition of a meter>)
add_unit('g', <definition of a gram>)
add_unit('s', <definition of a second>)
# etc.
The problem comes in when I want to write unit tests for these units (no pun intended), to make sure they all have the right values. If I write code that automatically generates a test case for every unit individually, any problems that are in the unit generation function are likely to also show up in the test generation function. But given the alternative (writing out all 1000+ tests by hand), should I just go ahead and write a test generation function anyway, check it really carefully and hope it works properly? Or should I only test, say, one series of units (m, cm, dm, km, nm, um, and all other multiples of the meter), just enough to make sure the unit generation function seems to be working? Or something else?

You're right to identify the weakness of automatically generating test cases. The usefulness of a test comes from taking two different paths (your code, and your own mental reasoning) to come up with what should be the same answer -- if you use the same path both times, nothing is being tested.
In summary: Never write automatically generated tests, unless the algorithm for generating the test results is dramatically simpler than the algorithm that you are testing. (Testing of a sorting algorithm is an example of when automatically generated tests would be a good idea, since it's easy to verify that a list of numbers is in sorted order. Another good example would be a puzzle-solving program as suggested by ChrisW in a comment. In both cases, auto-generation makes sense because it is much easier to verify that a given solution is correct than to generate a correct solution.)
My suggestion for your case: Manually test a small, representative subset of the possibilities.
[Clarification: Certain types of automated tests are appropriate and highly useful, e.g. fuzzing. I mean that that it is unhelpful to auto-generate unit tests for generated code.]

If you auto-generate the tests:
You might find it faster to then read all the tests (to inspect them for correctness) that it would have been to write them all by hand.
They might also be more maintainable (easier to edit, if you want to edit them later).

I would say the best approach is to unit test the generation, and as part of the unit test, you might take a sample generated result (only enough to where the test tests something that you would consider significantly different over the other scenarios) and put that under a unit test to make sure that the generation is working correctly. Beyond that, there is little unit test value in defining every scenario in an automated way. There may be functional test value in putting together some functional test which exercise the generated code to perform whatever purpose you have in mind, in order to give wider coverage to the various potential units.

Write only just enough tests to make sure that your code generation works right (just enough to drive the design of the imperative code). Declarative code rarely breaks. You should only test things that can break. Mistakes in declarative code (such as your case and for example user interface layouts) are better found with exploratory testing, so writing extensive automated tests for them is waste of time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the difference between Property Based Testing and Mutation testing? - python

My context for this question is in Python. Hypothesis Testing Library (i.e. Property Based Testing): https://hypothesis.readthedocs.io/en/latest/ Mutation Testing Library: https://github.com/sixty-north/cosmic-ray

Related

Python GEKKO Unexpected Behavior with Constraints

Nash equilibrium in Python

How do you test that something is random? Or "random enough'?

Testing complex datatypes?

Unit tests for automatically generated code: automatic or manual?

Categories

Resources