Testing complex datatypes? - python

What's are some ways of testing complex data types such as video, images, music, etc. I'm using TDD and wonder are there alternatives to "gold file" testing for rendering algorithms. I understand that there's ways to test other parts of the program that don't render and using those results you can infer. However, I'm particularly interested in rendering algorithms specifically image/video testing.
The question came up while I was using OpenCV/Python to do some basic facial recognition and want to verify its correctness.
Even if there's nothing definitive any suggestion will help.

The idea how to test rendering is quite simple: to test a function use the inverse function and check if the input and output match (match is not equality in your case):
f(f^-1(x)) = x
To test a rendering algorithm you would encode the raw input, render the encoded values and analyze the difference between the rendered output and the raw input. One problem is to get the raw input, when encoding/decoding random input is not appropriate. Another challenge is to evaluate the differences between the raw input and rendering output. I suppose if you're writing some rendering software you should be able to do a frequency analysis on the data. (Some transformation should pop into your head now.)
If it is possible generate your test data. Text fixtures are a real maintenance problem. They only shine in the beginning. If they are changing in some kind everything breaks down. The main problem is that if your using a fixture your tests are going to repeat the fixture's content. This makes the interpretation of intent of your tests harder. If there is a magic value in your test what's the significant part of this value?
Fixture:
actual = parse("file.xml")
expected = "magic value"
assert(actual == expected)
Generated values:
expected = generate()
input = render(expected)
actual = parse()
assert(actual == expected)
The nice thing with generators is that you can build quite complex object graphs with them starting from primitive types and fields (python Quickcheck version).
Generator based tests are not deterministic by nature. But given enough trials they follow the Law of large numbers.
Their additional value is that they will produce a good test value range coverage. (Which is hard to achieve with test fixtures.) They will find unanticipated bugs in your code.
An alternative test approach is to test with a equivalent function:
f(x) = f'(x)
For example if you have a rendering function to compare against. This kind of test approach is useful if you have a working function. This function is your benchmark. It cannot be used in production because it is to slow or does use to much memory but can be easily debugged or proven to be correct.

What's wrong with the "gold file" technique? It's part of your test fixture. Every test has a data fixture that's the equivalent to the "gold file" in a media-intensive application.
When doing ordinary TDD of ordinary business applications, one often has a golden database fixture that must be used.
Even when testing simple functions and core classes of an application, the setUp method creates a kind of "gold file" fixture for that class or function.
What's wrong with that technique? Please update your question with the specific problems you're having.

Related

How to measure similarity between two python code blocks?

Many would want to measure code similarity to catch plagiarisms, however my intention is to cluster a set of python code blocks (say answers to the same programming question) into different categories and distinguish different approaches taken by students.
If you have any idea how this could be achieved, I would appreciate it if you share it here.
You can choose any scheme you like that essentially hashes the contents of the code blocks, and place code blocks with identical hashes into the same category.
Of course, what will turn out to be similar will then depend highly on how you defined the hashing function. For instance, a truly stupid hashing function H(code)==0 will put everything in the same bin.
A hard problem is finding a hashing function that classifies code blocks in a way that seems similar in a natural sense. With lots of research, nobody has yet found anything better to judge this than I'll know if they are similar when I see them.
You surely do not want it to be dependent on layout/indentation/whitespace/comments, or slight changes to these will classify blocks differently even if their semantic content is identical.
There are three major schemes people have commonly used to find duplicated (or similar) code:
Metrics-based schemes, which compute the hash by counting various type of operators and operands by computing a metric. (Note: this uses lexical tokens). These often operate only at the function level. I know of no practical tools based on this.
Lexically based schemes, which break the input stream into lexemes, convert identifiers and literals into fixed special constants (e.g, treat them as undifferentiated), and then essentially hash N-grams (a sequence of N tokens) over these sequences. There are many clone detectors based on essentially this idea; they work tolerably well, but also find stupid matches because nothing forces alignment with program structure boundaries.
The sequence
return ID; } void ID ( int ID ) {
is an 11 gram which occurs frequently in C like languages but clearly isn't a useful clone). The result is that false positives tend to occur, e.g, you get claimed matches where there isn't one.
Abstract syntax tree based matching, (hashing over subtrees) which automatically aligns clones to language boundaries by virtue of using the ASTs, which represent the language structures directly. (I'm the author of the original paper on this, and build a commercial product CloneDR based on the idea, see my bio). These tools have the advantage that they can match code that contains sequences of tokens of different lengths in the middle of a match, e.g., one statement (of arbitrary size) is replaced by another.
This paper provides a survey of the various techniques: http://www.cs.usask.ca/~croy/papers/2009/RCK_SCP_Clones.pdf. It shows that AST-based clone detection tools appear to be the most effective at producing clones that people agree are similar blocks of code, which seems key to OP's particular interest; see Table 14.
[There are graph-based schemes that match control and data flow graphs. They should arguably produce even better matches but apparantly do not do much better in practice.]
One approach would be to count then number of functions, objects, keywords possibly grouped into categories such as branching, creating, manipulating, etc., and number variables of each type. Without relying on the methods and variables being called the same name(s).
For a given problem the similar approaches will tend to come out with similar scores for these, e.g.: A students who used decision tree would have a high number of branch statements while one who used a decision table would have much lower.
This approach would be much quicker to implement than parsing the code structure and comparing the results.

What is the difference between Property Based Testing and Mutation testing?

My context for this question is in Python.
Hypothesis Testing Library (i.e. Property Based Testing):
https://hypothesis.readthedocs.io/en/latest/
Mutation Testing Library:
https://github.com/sixty-north/cosmic-ray
These are very different beasts but both would improve the value and quality of your tests. Both tools contribute to and make the "My code coverage is N%" statement more meaningful.
Hypothesis would help you to generate all sorts of test inputs in the defined scope for a function under test.
Usually, when you need to test a function, you provide multiple example values trying to cover all the use cases and edge cases driven by the code coverage reports - this is so called "Example based testing". Hypothesis on the other hand implements a property-based testing generating a whole bunch of different inputs and input combinations helping to catch different common errors like division by zero, None, 0, off-by-one errors etc and helping to find hidden bugs.
Mutation testing is all about changing your code under test on the fly while executing your tests against a modified version of your code.
This really helps to see if your tests are actually testing what are they supposed to be testing, to understand the value of your tests. Mutation testing would really shine if you already have a rich test code base and a good code coverage.
What helped me to get ahold of these concepts were these Python Podcasts:
Property-based Testing with Hypothesis
Validating Python tests with mutation testing
Hypothesis with David MacIver
I'm the author or mutmut, the (imo) best mutation tester for python. #alecxe has a very good answer but I would like to expand on it. Read his answer before mine for basic context.
There are some big other differences, like PBT requires mental work to specify the rules for each function under test while MT requires you to justify all behavior in the code which requires much less cognitive effort.
MT is effectively white box and PBT black box.
Another difference is that MT is the exploration of a (fairly small) finite space, while PBT is an exploration of an infinite space (practically speaking). A practical consequence is that you can trivially know when you are done with MT, while you can have a PBT run going for years and you can't know if it has searched the relevant parts of the space. Better rules for PBT radically cuts the run time for this reason.
Mutation testing also forces minimal code. This is a surprising effect, but it's something I have experienced again and again. This is a nice little bonus for MT.
You can also use MT as a simple checklist to get to 100% mutation coverage, you don't need to start with 100% coverage, not at all. But with PBT you can start way below 100% coverage, in essence at 0% before you start.
I hope this clarifies the situation a bit more.

Python: Create Nomograms from Data (using PyNomo)

I am working on Python 2.7. I want to create nomograms based on the data of various variables in order to predict one variable. I am looking into and have installed PyNomo package.
However, the from the documentation here and here and the examples, it seems that nomograms can only be made when you have equation(s) relating these variables, and not from the data. For example, examples here show how to use equations to create nomograms. What I want, is to create a nomogram from the data and use that to predict things. How do I do that? In other words, how do I make the nomograph take data as input and not the function as input? Is it even possible?
Any input would be helpful. If PyNomo cannot do it, please suggest some other package (in any language). For example, I am trying function nomogram from package rms in R, but not having luck with figuring out how to properly use it. I have asked a separate question for that here.
The term "nomogram" has become somewhat confused of late as it now refers to two entirely different things.
A classic nomogram performs a full calculation - you mark two scales, draw a straight line across the marks and read your answer from a third scale. This is the type of nomogram that pynomo produces, and as you correctly say, you need a formula. As mentioned above, producing nomograms like this is definitely a two-step process.
The other use of the term (very popular, recently) is to refer to regression nomograms. These are graphical depictions of regression models (usually logistic regression models). For these, a group of parallel predictor variables are depicted with a common scale on the bottom; for each predictor you read the 'score' from the scale and add these up. These types of nomograms have become very popular in the last few years, and thats what the RMS package will draft. I haven't used this but my understanding is that it works directly from the data.
Hope this is of some use! :-)

How do you test that something is random? Or "random enough'?

I have to return a random entry from my database.
I wrote a function, and since I'm using the random module in Python, it's probably unless I used it in a stupid way.
Now, how can I write a unit test that check that this function works? After all, if it's a good random value, you can never know.
I'm not paranoid, my function is not that complex and the Python standard library is 1000 x
time good enough for my purpose. I'm not doing cryptography or something critical. I'm just curious to know if there is a way.
There are several statistical tests listed on RANDOM.ORG for testing randomness. See the last two sections of the linked article.
Also, if you can get a copy of Beautiful Testing there's a whole chapter by John D. Cook called Testing a Random Number Generator. He explains a lot of the statistical methods listed in the article above. If you really want to learn about RNGs, that chapter is a really good starting point. I've written about the subject myself, but John does a much better job of explaining it.
You cannot really tell (see cartoon).
However, you can measure the entropy of your generated sample, and test it against the entropy you would expect. As it has been mentioned before, random.org makes some pretty clever tests.
You could have the unit test call the function multiple times and make sure that the number of collisions is reasonably low. E.g. if your random result is in the range 1-1000000, call the function 100 times and record the results; then check if there are duplicates. If there are any (or more than 1 collision, depending of how afraid you are of false test failure) the test fails.
Obviously not perfect, but will catch it if you random number is from Dilbert:
http://www.random.org/analysis/
You've got two entangled issues. The first issue is testing that your random selection works. Seeding your PRNG allows you to write a test that's deterministic and that you can assert about. This should give you confidence about your code, given that the underlying functions live up to their responsibilities (i.e. random returns you a good-enough stream of random values).
The second issue you seem to be concerned about is python's random functions. You want to separate the concerns of your code from the concert about the random function. There are a number of randomness tests that you can read about but at the end of the day unless you're doing crypto I'd trust the python developers to have gotten it right-enough.
In addition to previous answers you also can mock random function (for example with mock or mox library) and return predefined sequence of values for which you know results. Yep, this wouldn't be a true test for all cases, but you can test some corner-cases. In some cases such tests could be reasonable.

Unit tests for automatically generated code: automatic or manual?

I know similar questions have been asked before but they don't really have the information I'm looking for - I'm not asking about the mechanics of how to generate unit tests, but whether it's a good idea.
I've written a module in Python which contains objects representing physical constants and units of measurement. A lot of the units are formed by adding on prefixes to base units - e.g. from m I get cm, dm, mm, hm, um, nm, pm, etc. And the same for s, g, C, etc. Of course I've written a function to do this since the end result is over 1000 individual units and it would be a major pain to write them all out by hand ;-) It works something like this (not the actual code):
def add_unit(name, value):
globals()[name] = value
for pfx, multiplier in prefixes:
globals()[pfx + name] = multiplier * value
add_unit('m', <definition of a meter>)
add_unit('g', <definition of a gram>)
add_unit('s', <definition of a second>)
# etc.
The problem comes in when I want to write unit tests for these units (no pun intended), to make sure they all have the right values. If I write code that automatically generates a test case for every unit individually, any problems that are in the unit generation function are likely to also show up in the test generation function. But given the alternative (writing out all 1000+ tests by hand), should I just go ahead and write a test generation function anyway, check it really carefully and hope it works properly? Or should I only test, say, one series of units (m, cm, dm, km, nm, um, and all other multiples of the meter), just enough to make sure the unit generation function seems to be working? Or something else?
You're right to identify the weakness of automatically generating test cases. The usefulness of a test comes from taking two different paths (your code, and your own mental reasoning) to come up with what should be the same answer -- if you use the same path both times, nothing is being tested.
In summary: Never write automatically generated tests, unless the algorithm for generating the test results is dramatically simpler than the algorithm that you are testing. (Testing of a sorting algorithm is an example of when automatically generated tests would be a good idea, since it's easy to verify that a list of numbers is in sorted order. Another good example would be a puzzle-solving program as suggested by ChrisW in a comment. In both cases, auto-generation makes sense because it is much easier to verify that a given solution is correct than to generate a correct solution.)
My suggestion for your case: Manually test a small, representative subset of the possibilities.
[Clarification: Certain types of automated tests are appropriate and highly useful, e.g. fuzzing. I mean that that it is unhelpful to auto-generate unit tests for generated code.]
If you auto-generate the tests:
You might find it faster to then read all the tests (to inspect them for correctness) that it would have been to write them all by hand.
They might also be more maintainable (easier to edit, if you want to edit them later).
I would say the best approach is to unit test the generation, and as part of the unit test, you might take a sample generated result (only enough to where the test tests something that you would consider significantly different over the other scenarios) and put that under a unit test to make sure that the generation is working correctly. Beyond that, there is little unit test value in defining every scenario in an automated way. There may be functional test value in putting together some functional test which exercise the generated code to perform whatever purpose you have in mind, in order to give wider coverage to the various potential units.
Write only just enough tests to make sure that your code generation works right (just enough to drive the design of the imperative code). Declarative code rarely breaks. You should only test things that can break. Mistakes in declarative code (such as your case and for example user interface layouts) are better found with exploratory testing, so writing extensive automated tests for them is waste of time.

Categories