Unit tests for automatically generated code: automatic or manual?

Unit tests for automatically generated code: automatic or manual? - python

I know similar questions have been asked before but they don't really have the information I'm looking for - I'm not asking about the mechanics of how to generate unit tests, but whether it's a good idea.
I've written a module in Python which contains objects representing physical constants and units of measurement. A lot of the units are formed by adding on prefixes to base units - e.g. from m I get cm, dm, mm, hm, um, nm, pm, etc. And the same for s, g, C, etc. Of course I've written a function to do this since the end result is over 1000 individual units and it would be a major pain to write them all out by hand ;-) It works something like this (not the actual code):
def add_unit(name, value):
globals()[name] = value
for pfx, multiplier in prefixes:
globals()[pfx + name] = multiplier * value
add_unit('m', <definition of a meter>)
add_unit('g', <definition of a gram>)
add_unit('s', <definition of a second>)
# etc.
The problem comes in when I want to write unit tests for these units (no pun intended), to make sure they all have the right values. If I write code that automatically generates a test case for every unit individually, any problems that are in the unit generation function are likely to also show up in the test generation function. But given the alternative (writing out all 1000+ tests by hand), should I just go ahead and write a test generation function anyway, check it really carefully and hope it works properly? Or should I only test, say, one series of units (m, cm, dm, km, nm, um, and all other multiples of the meter), just enough to make sure the unit generation function seems to be working? Or something else?

You're right to identify the weakness of automatically generating test cases. The usefulness of a test comes from taking two different paths (your code, and your own mental reasoning) to come up with what should be the same answer -- if you use the same path both times, nothing is being tested.
In summary: Never write automatically generated tests, unless the algorithm for generating the test results is dramatically simpler than the algorithm that you are testing. (Testing of a sorting algorithm is an example of when automatically generated tests would be a good idea, since it's easy to verify that a list of numbers is in sorted order. Another good example would be a puzzle-solving program as suggested by ChrisW in a comment. In both cases, auto-generation makes sense because it is much easier to verify that a given solution is correct than to generate a correct solution.)
My suggestion for your case: Manually test a small, representative subset of the possibilities.
[Clarification: Certain types of automated tests are appropriate and highly useful, e.g. fuzzing. I mean that that it is unhelpful to auto-generate unit tests for generated code.]

If you auto-generate the tests:
You might find it faster to then read all the tests (to inspect them for correctness) that it would have been to write them all by hand.
They might also be more maintainable (easier to edit, if you want to edit them later).

I would say the best approach is to unit test the generation, and as part of the unit test, you might take a sample generated result (only enough to where the test tests something that you would consider significantly different over the other scenarios) and put that under a unit test to make sure that the generation is working correctly. Beyond that, there is little unit test value in defining every scenario in an automated way. There may be functional test value in putting together some functional test which exercise the generated code to perform whatever purpose you have in mind, in order to give wider coverage to the various potential units.

Write only just enough tests to make sure that your code generation works right (just enough to drive the design of the imperative code). Declarative code rarely breaks. You should only test things that can break. Mistakes in declarative code (such as your case and for example user interface layouts) are better found with exploratory testing, so writing extensive automated tests for them is waste of time.

Related

Python GEKKO Unexpected Behavior with Constraints

I've been playing around with GEKKO for solving flow optimizations and I have come across behavior that is confusing me.
Context:
Sources --> [mixing and delivery] --> Sinks
I have multiple sources (where my flow is coming from), and multiple sinks (where my flow goes to). For a given source (e.g., SOURCE_1), the total flow to the resulting sinks must equal to the volume from SOURCE_1. This is my idea of conservation of mass where the 'mixing' plant blends all the source volumes together.
Constraint Example (DOES NOT WORK AS INTENDED):
When I try to create a constraint for the two SINK volumes, and the one SOURCE volume:
m.Equation(volume_sink_1[i] + volume_sink_2[i] == max_volumes_for_source_1)
I end up with weird results. With that, I mean, it's not actually optimal, it ends up assigning values very poorly. I am off from the optimal by at least 10% (I tried with different max volumes).
Constraint Example (WORKS BUT I DON'T GET WHY):
When I try to create a constraint for the two SINK volumes, and the one SOURCE volume like this:
m.Equation(volume_sink_1[i] + volume_sink_2[i] <= max_volumes_for_source_1 * 0.999999)
With this, I get MUCH closer to the actual optimum to the point where I can just treat it as the optimum. Please note that I had to change it to a less than or equal to and also multiply by 0.999999 which was me messing around with it nonstop and eventually leading to that.
Also, please note that this uses practically all of the source (up to 99.9999% of it) as I would expect. So both formulations make sense to me but the first approach doesn't work.
The only thing I can think of for this behavior is that it's stricter to solve for == than <=. That doesn't explain to me why I have to multiply by 0.999999 though.
Why is this the case? Also, is there a way for me to debug occurrences like this easier?

This same improvement occurs with complementary constraints for conditional statements when using s1*s2<=0 (easier to solve) versus s1*s2==0 (harder to solve).
From the research papers I've seen, the justification is that the solver has more room to search to find the optimal solution even if it always ends up at s1*s2==0. It sounds like your problem may have multiple local minima as well if it converges to a solution, but it isn't the global optimum.
If you can post a complete and minimal problem that demonstrates the issue, we can give more specific suggestions.

What is the difference between Property Based Testing and Mutation testing?

My context for this question is in Python.
Hypothesis Testing Library (i.e. Property Based Testing):
https://hypothesis.readthedocs.io/en/latest/
Mutation Testing Library:
https://github.com/sixty-north/cosmic-ray

These are very different beasts but both would improve the value and quality of your tests. Both tools contribute to and make the "My code coverage is N%" statement more meaningful.
Hypothesis would help you to generate all sorts of test inputs in the defined scope for a function under test.
Usually, when you need to test a function, you provide multiple example values trying to cover all the use cases and edge cases driven by the code coverage reports - this is so called "Example based testing". Hypothesis on the other hand implements a property-based testing generating a whole bunch of different inputs and input combinations helping to catch different common errors like division by zero, None, 0, off-by-one errors etc and helping to find hidden bugs.
Mutation testing is all about changing your code under test on the fly while executing your tests against a modified version of your code.
This really helps to see if your tests are actually testing what are they supposed to be testing, to understand the value of your tests. Mutation testing would really shine if you already have a rich test code base and a good code coverage.
What helped me to get ahold of these concepts were these Python Podcasts:
Property-based Testing with Hypothesis
Validating Python tests with mutation testing
Hypothesis with David MacIver

I'm the author or mutmut, the (imo) best mutation tester for python. #alecxe has a very good answer but I would like to expand on it. Read his answer before mine for basic context.
There are some big other differences, like PBT requires mental work to specify the rules for each function under test while MT requires you to justify all behavior in the code which requires much less cognitive effort.
MT is effectively white box and PBT black box.
Another difference is that MT is the exploration of a (fairly small) finite space, while PBT is an exploration of an infinite space (practically speaking). A practical consequence is that you can trivially know when you are done with MT, while you can have a PBT run going for years and you can't know if it has searched the relevant parts of the space. Better rules for PBT radically cuts the run time for this reason.
Mutation testing also forces minimal code. This is a surprising effect, but it's something I have experienced again and again. This is a nice little bonus for MT.
You can also use MT as a simple checklist to get to 100% mutation coverage, you don't need to start with 100% coverage, not at all. But with PBT you can start way below 100% coverage, in essence at 0% before you start.
I hope this clarifies the situation a bit more.

Comparing Root-finding (of a function) algorithms in Python

I would like to compare different methods of finding roots of functions in python (like Newton's methods or other simple calc based methods). I don't think I will have too much trouble writing the algorithms
What would be a good way to make the actual comparison? I read up a little bit about Big-O. Would this be the way to go?

The answer from #sarnold is right -- it doesn't make sense to do a Big-Oh analysis.
The principal differences between root finding algorithms are:
rate of convergence (number of iterations)
computational effort per iteration
what is required as input (i.e. do you need to know the first derivative, do you need to set lo/hi limits for bisection, etc.)
what functions it works well on (i.e. works fine on polynomials but fails on functions with poles)
what assumptions does it make about the function (i.e. a continuous first derivative or being analytic, etc)
how simple the method is to implement
I think you will find that each of the methods has some good qualities, some bad qualities, and a set of situations where it is the most appropriate choice.

Big O notation is ideal for expressing the asymptotic behavior of algorithms as the inputs to the algorithms "increase". This is probably not a great measure for root finding algorithms.
Instead, I would think the number of iterations required to bring the actual error below some epsilon ε would be a better measure. Another measure would be the number of iterations required to bring the difference between successive iterations below some epsilon ε. (The difference between successive iterations is probably a better choice if you don't have exact root values at hand for your inputs. You would use a criteria such as successive differences to know when to terminate your root finders in practice, so you could or should use them here, too.)
While you can characterize the number of iterations required for different algorithms by the ratios between them (one algorithm may take roughly ten times more iterations to reach the same precision as another), there often isn't "growth" in the iterations as inputs change.
Of course, if your algorithms take more iterations with "larger" inputs, then Big O notation makes sense.

Big-O notation is designed to describe how an alogorithm behaves in the limit, as n goes to infinity. This is a much easier thing to work with in a theoretical study than in a practical experiment. I would pick things to study that you can easily measure that and that people care about, such as accuracy and computer resources (time/memory) consumed.
When you write and run a computer program to compare two algorithms, you are performing a scientific experiment, just like somebody who measures the speed of light, or somebody who compares the death rates of smokers and non-smokers, and many of the same factors apply.
Try and choose an example problem or problems to solve that is representative, or at least interesting to you, because your results may not generalise to sitations you have not actually tested. You may be able to increase the range of situations to which your results reply if you sample at random from a large set of possible problems and find that all your random samples behave in much the same way, or at least follow much the same trend. You can have unexpected results even when the theoretical studies show that there should be a nice n log n trend, because theoretical studies rarely account for suddenly running out of cache, or out of memory, or usually even for things like integer overflow.
Be alert for sources of error, and try to minimise them, or have them apply to the same extent to all the things you are comparing. Of course you want to use exactly the same input data for all of the algorithms you are testing. Make multiple runs of each algorithm, and check to see how variable things are - perhaps a few runs are slower because the computer was doing something else at a time. Be aware that caching may make later runs of an algorithm faster, especially if you run them immediately after each other. Which time you want depends on what you decide you are measuring. If you have a lot of I/O to do remember that modern operating systems and computer cache huge amounts of disk I/O in memory. I once ended up powering the computer off and on again after every run, as the only way I could find to be sure that the device I/O cache was flushed.

You can get wildly different answers for the same problem just by changing starting points. Pick an initial guess that's close to the root and Newton's method will give you a result that converges quadratically. Choose another in a different part of the problem space and the root finder will diverge wildly.
What does this say about the algorithm? Good or bad?

I would suggest you to have a look at the following Python root finding demo.
It is a simple code, with some different methods and comparisons between them (in terms of the rate of convergence).
http://www.math-cs.gordon.edu/courses/mat342/python/findroot.py

I just finish a project where comparing bisection, Newton, and secant root finding methods. Since this is a practical case, I don't think you need to use Big-O notation. Big-O notation is more suitable for asymptotic view. What you can do is compare them in term of:
Speed - for example here newton is the fastest if good condition are gathered
Number of iterations - for example here bisection take the most iteration
Accuracy - How often it converge to the right root if there is more than one root, or maybe it doesn't even converge at all.
Input - What information does it need to get started. for example newton need an X0 near the root in order to converge, it also need the first derivative which is not always easy to find.
Other - rounding errors
For the sake of visualization you can store the value of each iteration in arrays and plot them. Use a function you already know the roots.

Although this is a very old post, my 2 cents :)
Once you've decided which algorithmic method to use to compare them (your "evaluation protocol", so to say), then you might be interested in ways to run your challengers on actual datasets.
This tutorial explains how to do it, based on an example (comparing polynomial fitting algorithms on several datasets).
(I'm the author, feel free to provide feedback on the github page!)

How do you test that something is random? Or "random enough'?

I have to return a random entry from my database.
I wrote a function, and since I'm using the random module in Python, it's probably unless I used it in a stupid way.
Now, how can I write a unit test that check that this function works? After all, if it's a good random value, you can never know.
I'm not paranoid, my function is not that complex and the Python standard library is 1000 x
time good enough for my purpose. I'm not doing cryptography or something critical. I'm just curious to know if there is a way.

There are several statistical tests listed on RANDOM.ORG for testing randomness. See the last two sections of the linked article.
Also, if you can get a copy of Beautiful Testing there's a whole chapter by John D. Cook called Testing a Random Number Generator. He explains a lot of the statistical methods listed in the article above. If you really want to learn about RNGs, that chapter is a really good starting point. I've written about the subject myself, but John does a much better job of explaining it.

You cannot really tell (see cartoon).
However, you can measure the entropy of your generated sample, and test it against the entropy you would expect. As it has been mentioned before, random.org makes some pretty clever tests.

You could have the unit test call the function multiple times and make sure that the number of collisions is reasonably low. E.g. if your random result is in the range 1-1000000, call the function 100 times and record the results; then check if there are duplicates. If there are any (or more than 1 collision, depending of how afraid you are of false test failure) the test fails.
Obviously not perfect, but will catch it if you random number is from Dilbert:
http://www.random.org/analysis/

You've got two entangled issues. The first issue is testing that your random selection works. Seeding your PRNG allows you to write a test that's deterministic and that you can assert about. This should give you confidence about your code, given that the underlying functions live up to their responsibilities (i.e. random returns you a good-enough stream of random values).
The second issue you seem to be concerned about is python's random functions. You want to separate the concerns of your code from the concert about the random function. There are a number of randomness tests that you can read about but at the end of the day unless you're doing crypto I'd trust the python developers to have gotten it right-enough.

In addition to previous answers you also can mock random function (for example with mock or mox library) and return predefined sequence of values for which you know results. Yep, this wouldn't be a true test for all cases, but you can test some corner-cases. In some cases such tests could be reasonable.

Testing complex datatypes?

What's are some ways of testing complex data types such as video, images, music, etc. I'm using TDD and wonder are there alternatives to "gold file" testing for rendering algorithms. I understand that there's ways to test other parts of the program that don't render and using those results you can infer. However, I'm particularly interested in rendering algorithms specifically image/video testing.
The question came up while I was using OpenCV/Python to do some basic facial recognition and want to verify its correctness.
Even if there's nothing definitive any suggestion will help.

The idea how to test rendering is quite simple: to test a function use the inverse function and check if the input and output match (match is not equality in your case):
f(f^-1(x)) = x
To test a rendering algorithm you would encode the raw input, render the encoded values and analyze the difference between the rendered output and the raw input. One problem is to get the raw input, when encoding/decoding random input is not appropriate. Another challenge is to evaluate the differences between the raw input and rendering output. I suppose if you're writing some rendering software you should be able to do a frequency analysis on the data. (Some transformation should pop into your head now.)
If it is possible generate your test data. Text fixtures are a real maintenance problem. They only shine in the beginning. If they are changing in some kind everything breaks down. The main problem is that if your using a fixture your tests are going to repeat the fixture's content. This makes the interpretation of intent of your tests harder. If there is a magic value in your test what's the significant part of this value?
Fixture:
actual = parse("file.xml")
expected = "magic value"
assert(actual == expected)
Generated values:
expected = generate()
input = render(expected)
actual = parse()
assert(actual == expected)
The nice thing with generators is that you can build quite complex object graphs with them starting from primitive types and fields (python Quickcheck version).
Generator based tests are not deterministic by nature. But given enough trials they follow the Law of large numbers.
Their additional value is that they will produce a good test value range coverage. (Which is hard to achieve with test fixtures.) They will find unanticipated bugs in your code.
An alternative test approach is to test with a equivalent function:
f(x) = f'(x)
For example if you have a rendering function to compare against. This kind of test approach is useful if you have a working function. This function is your benchmark. It cannot be used in production because it is to slow or does use to much memory but can be easily debugged or proven to be correct.

What's wrong with the "gold file" technique? It's part of your test fixture. Every test has a data fixture that's the equivalent to the "gold file" in a media-intensive application.
When doing ordinary TDD of ordinary business applications, one often has a golden database fixture that must be used.
Even when testing simple functions and core classes of an application, the setUp method creates a kind of "gold file" fixture for that class or function.
What's wrong with that technique? Please update your question with the specific problems you're having.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.