A long definition of an object inside a Python unit test

A long definition of an object inside a Python unit test - python

I'm unit testing my application. What most of the tests do is calling a function with specific arguments and asserting the equality of the return value with an expected value.
In some tests the expected return value is a relatively big object. One of them, for example, is a dictionary which maps 5 strings to lists of tuples. It takes 40-50 repetitive lines of code to define that object, but that object is an expected value of one of the functions I'm testing. I don't want to have a 40-50 lines of code defining an expected return value inside a test function because most of my test functions consist of 3-6 lines of code. I'm looking for a best practice for such situations. What is the right way of putting lengthy definitions inside a test?
Here are the ideas I was thinking of to address the issue, ranked from the best to the worst as I see it:
Testing samples of the object: Making a few equality assertions based on a subset of the keys. This will sacrifice the thoroughness of the test for the sake of code elegance.
Defining the object in a separate module: Writing the lengthy 40-50 lines of code in a separate .py file, importing the module in the test and then make the equality assertion. This will make the test short and concise but I don't like having a separate file as a supplement to a test; the object definition is part of the test after all.
Defining the object inside the test function: This is the trivial solution which I wish to avoid. My tests are pretty simple and straightforward and the lengthy definition of that object doesn't fit.
Maybe I'm too obsessed with clean code, but I like none of the above solutions. Is there another common practice I haven't thought of?

I'd suggest using a separation of testing code and testing data. For this reason I usually create an abstract base class which contains the methods I'd like to test and create several specific test case classes to tie the methods to the data. (I use the Django framework, so all abstract test classes I put into testbase.py):
testbase.py:
class TestSomeFeature(unittest.TestCase):
test_data_A = ...
def test_A(self):
... #perform test
and now the implementations in test.py
class TestSomeFeatureWithDataXY(testbase.TestSomeFeature):
test_data_A = XY
The test data can also be externalized, e.g a JSON file:
class TestSomeFeatureWithDataXYZ(testbase.TestSomeFeature):
#property
def test_data_A(self):
return json.load("data/XYZ.json")
I hope I made my points clear enough. In your case I'd strongly opt for using data files. Django supports this out-of-the-box by using test fixtures to be loaded into the database prior executing any tests.

It really depends on what you want to test.
If you want to test that a dictionary contains certain keys with certain values, then I would suggest separate assertions to check each key. This way your test will still be valid if the dictionary is extended, and test failures should clearly identify the problem (an error message telling you that one 50-line long dictionary is not equal to a second 50 line long dictionary is not exactly clear).
If you really do want to verify that the dictionary contains only the given keys, then a single assertion might be appropriate. Define the object you are comparing against where it is most clear. If defining it in a separate file (as Constantinius's answer suggests) makes things more readable then consider doing that.
In both cases, the guiding principle is to only test the behaviour you care about. If you test behaviour you don't care about, you may find your test suite more obstructive than helpful when refactoring.

Related

Can you efficiently analyse the type of method parameters in Python

I am making a PyCharm plugin that will generate code examples based on a Python class. The user will be able to select a class and the plugin will generate code examples that utilise the attributes and methods of the class.
Basic Class
class Rectangle:
# Hardcoded values
width = 5
length = 10
def calculate_area(self):
return self.width*self.length
Basic Example
r = Rectangle()
r.calculate_area()
The problem arises when the class contains methods that take parameters. For example, if the class contains a method that takes a string parameter, then I want the generated examples to pass either an empty or a random string to the method. Is there a good way to analyse the potential type of the parameters (when not strongly defined) to generate code examples?
Potential Solutions
So far I have come up with 2 applicable solutions (passing None parameters when calling the method is not applicable):
A potential solution would be to keep track of the parameters and analyse the operations that they use/are used in to determine their type. In the code below num will most likely be an integer, as determined by the method body:
def calculate_even(num):
return num % 2 == 0
I know that you can execute a string containing Python code and check if it compiles correctly (source). A possible solution (one that I would love to avoid) is to generate code examples, passing different types as method parameters and determining which ones work after executing said code examples.
I can see huge downsides with both solutions, as the first one will require the handling of lots of operations for the different types and will still most likely be inaccurate, and the second one becomes incredibly slow for methods that take multiple parameters. I am really hoping that there's a more elegant solution that I have missed.
Edit
I am looking for a possible solution that won't require the use of type annotations/hints

Python: Separate class methods from the class in an own package

I wrote a class with many different parameters, depending on the parameter the class uses different actions.
I can easily differentiate my methods into different cases, each set of methods belonging to certain parameters.
This resulted in a huge .py file, implementing all methods in the one class. For better readability, is it possible to write multiple methods in an own file and load it (similar as a package) into the class to treat them as class methods?
To give more details, my class is a decision tree. A parameter for example is the pruning method, used to shrink the tree. As I use different pruning methods, this takes a lot of lines in my class. I need to have a set of methods for each pruning parameter. It would be nice to simply load the methods for pruning from another file into the class and therefore shrinking the size of my decision tree .py file.

For better readability, I got a few suggestions.
Collapse all functions definitions, which can easily be achieved by
one click in most of the popular text editors.
Place related methods next to each other.
Give proper names to categorise and differentiate methods, for
example.
def search_board_first(): pass
def search_deep_first(): pass
Regarding splitting a huge class into a Object oriented behaviour, my rule of thumb is to consider the re-usability. If functions can be reused by other classes, they should be put in separate files and make them independent(static) of other classes.
If the methods are tied to the class and no where else, it is better to just to enclose that method within the class itself. Think it this way, to review the code, you need to refer the class properties anyway. Logically it doesn't quite make sense to split files just for splitting.

Python Style: nested vs extra function

I'm quite new to python (2.7) and have a question about what's the most Pythonic way to do something; my code (part of a class) Looks like this (a somewhat naive Version):
def calc_pump_height(self):
for i in range(len(self.primary_)):
for j in range(len(self.primary_)):
if self.connections_[i][j].sub_kind_ in [1,4]:
self.calc_spec_pump_height(i,j)
def calc_spec_pump_height(self,i,j):
pass
(obviously pass will be replaced by something else, manipulating attributes of the object of this class, without generating a return value)
I'd like to ask how I should do this: I could avoid the second function and write the extra code directly into the first function, getting rid of one function (Simple is better than complex), but creating a heavily nested function at the same time (Flat is better than nested).
I could also create some sort of list comprehension to avoid using a double Loop, eg:
def calc_pump_height(self):
ra = range(len(self.primary_))
[self.calc_spec_pump_height(i,j) for i,j in zip(ra, ra)]
(I'd have to move the if condition into the 2nd function; this would also create a null-list but I don't care about this, since calc_spec_pump_height is supposed to manipulate the object, not return something useful)
In essence: I'm iterating over a 2D list, testing each object for a certain characteristic and then do something with that object.
Which of the above methods is 'the best'? Or is there another way that I'm missing?

The key thing about functions/methods is that they should do one thing.
calc_pump_height implements two things: It finds elements in a 2D list that match some criteria, and then it calculates a value for each of those elements. It's ok for its purpose to be combining the other two operations, if that makes sense for the object's public API, but its not ok for it to implement either or both.
Finding the elements that match the criteria is a discrete step; that should be a function.
Calculating your value is clearly a discrete step; that should be a function.
I would implement the element matcher as a (private) generator, that takes the test condition as an argument, and yields all matching elements. It's just an iterator over your data structure, masked by the logical test. You can wrap that in a named public method called get_1_4_subkinds() or something that makes more sense in your domain. That generalises the code and gives you the flexibility to implement other conditions in the future. Also, your i and j are tightly coupled, so it makes sense to pass them around as a single concept. Then your code becomes:
def calc_pump_height(self):
for subkind_indices in self.get_1_4_subkinds():
self.calc_pump_spec_height(subkind_indices)

You have misunderstood “simplicity”:
write the extra code directly into the first function, getting rid of one function (Simple is better than complex)
That's not simple. Breaking complex sequences into discrete, focussed functions increases simplicity.
In that light, I would say that yes, you should definitely prefer calc_spec_pump_height as a separate function.

You can eliminate one level of nesting in your first function by using itertools.product to generate your i and j values at the same time (itertools.product(range(len(self.primary_)), repeat=2). The zip you use in the your second version won't work correctly, it will only yield identical pairs, 0,0, 1,1, 2,2, etc.
As for the overall design, you should not use a list comprehension if you don't care about the return value from the function you're calling. Use an explicit loop when it's the looping you want (rather than a list of computed values).
If there's a non-trivial amount of code that will go in calc_spec_pump_height, it makes perfect sense to make it as a separate method. If it's a one or two liner, then it might be OK to inline within calc_pump_height, but that method's loops and condition testing may be complicated enough already to justify factoring out the inner part of the algorithm.
You should usually think about splitting a big function up when it is too long to fit onto a single screen in your editor. That is about the limit of how many details (variable names, etc.) we can keep in our mind simultaneously. On the other hand, you shouldn't waste time (either your own programming time or function call overhead at run time) by factoring out every little piece of every problem. Factor part of a function out if you're using it from more than one place, or if you can't keep the details of the whole function in your head at once otherwise.
So, other than the (marginal) improvement of itertools.product and given the limited information you've provided about what calc_spec_pump_height will do, I think your code is already about as good as it can get!

How do I design a class in Python?

I've had some really awesome help on my previous questions for detecting paws and toes within a paw, but all these solutions only work for one measurement at a time.
Now I have data that consists off:
about 30 dogs;
each has 24 measurements (divided into several subgroups);
each measurement has at least 4 contacts (one for each paw) and
each contact is divided into 5 parts and
has several parameters, like contact time, location, total force etc.
Obviously sticking everything into one big object isn't going to cut it, so I figured I needed to use classes instead of the current slew of functions. But even though I've read Learning Python's chapter about classes, I fail to apply it to my own code (GitHub link)
I also feel like it's rather strange to process all the data every time I want to get out some information. Once I know the locations of each paw, there's no reason for me to calculate this again. Furthermore, I want to compare all the paws of the same dog to determine which contact belongs to which paw (front/hind, left/right). This would become a mess if I continue using only functions.
So now I'm looking for advice on how to create classes that will let me process my data (link to the zipped data of one dog) in a sensible fashion.

How to design a class.
Write down the words. You started to do this. Some people don't and wonder why they have problems.
Expand your set of words into simple statements about what these objects will be doing. That is to say, write down the various calculations you'll be doing on these things. Your short list of 30 dogs, 24 measurements, 4 contacts, and several "parameters" per contact is interesting, but only part of the story. Your "locations of each paw" and "compare all the paws of the same dog to determine which contact belongs to which paw" are the next step in object design.
Underline the nouns. Seriously. Some folks debate the value of this, but I find that for first-time OO developers it helps. Underline the nouns.
Review the nouns. Generic nouns like "parameter" and "measurement" need to be replaced with specific, concrete nouns that apply to your problem in your problem domain. Specifics help clarify the problem. Generics simply elide details.
For each noun ("contact", "paw", "dog", etc.) write down the attributes of that noun and the actions in which that object engages. Don't short-cut this. Every attribute. "Data Set contains 30 Dogs" for example is important.
For each attribute, identify if this is a relationship to a defined noun, or some other kind of "primitive" or "atomic" data like a string or a float or something irreducible.
For each action or operation, you have to identify which noun has the responsibility, and which nouns merely participate. It's a question of "mutability". Some objects get updated, others don't. Mutable objects must own total responsibility for their mutations.
At this point, you can start to transform nouns into class definitions. Some collective nouns are lists, dictionaries, tuples, sets or namedtuples, and you don't need to do very much work. Other classes are more complex, either because of complex derived data or because of some update/mutation which is performed.
Don't forget to test each class in isolation using unittest.
Also, there's no law that says classes must be mutable. In your case, for example, you have almost no mutable data. What you have is derived data, created by transformation functions from the source dataset.

The following advices (similar to #S.Lott's advice) are from the book, Beginning Python: From Novice to Professional
Write down a description of your problem (what should the problem do?). Underline all the nouns, verbs, and adjectives.
Go through the nouns, looking for potential classes.
Go through the verbs, looking for potential methods.
Go through the adjectives, looking for potential attributes
Allocate methods and attributes to your classes
To refine the class, the book also advises we can do the following:
Write down (or dream up) a set of use cases—scenarios of how your program may be used. Try to cover all the functionally.
Think through every use case step by step, making sure that everything we need is covered.

I like the TDD approach...
So start by writing tests for what you want the behaviour to be. And write code that passes. At this point, don't worry too much about design, just get a test suite and software that passes. Don't worry if you end up with a single big ugly class, with complex methods.
Sometimes, during this initial process, you'll find a behaviour that is hard to test and needs to be decomposed, just for testability. This may be a hint that a separate class is warranted.
Then the fun part... refactoring. After you have working software you can see the complex pieces. Often little pockets of behaviour will become apparent, suggesting a new class, but if not, just look for ways to simplify the code. Extract service objects and value objects. Simplify your methods.
If you're using git properly (you are using git, aren't you?), you can very quickly experiment with some particular decomposition during refactoring, and then abandon it and revert back if it doesn't simplify things.
By writing tested working code first you should gain an intimate insight into the problem domain that you couldn't easily get with the design-first approach. Writing tests and code push you past that "where do I begin" paralysis.

The whole idea of OO design is to make your code map to your problem, so when, for example, you want the first footstep of a dog, you do something like:
dog.footstep(0)
Now, it may be that for your case you need to read in your raw data file and compute the footstep locations. All this could be hidden in the footstep() function so that it only happens once. Something like:
class Dog:
def __init__(self):
self._footsteps=None
def footstep(self,n):
if not self._footsteps:
self.readInFootsteps(...)
return self._footsteps[n]
[This is now a sort of caching pattern. The first time it goes and reads the footstep data, subsequent times it just gets it from self._footsteps.]
But yes, getting OO design right can be tricky. Think more about the things you want to do to your data, and that will inform what methods you'll need to apply to what classes.

After skimming your linked code, it seems to me that you are better off not designing a Dog class at this point. Rather, you should use Pandas and dataframes. A dataframe is a table with columns. You dataframe would have columns such as: dog_id, contact_part, contact_time, contact_location, etc.
Pandas uses Numpy arrays behind the scenes, and it has many convenience methods for you:
Select a dog by e.g. : my_measurements['dog_id']=='Charly'
save the data: my_measurements.save('filename.pickle')
Consider using pandas.read_csv() instead of manually reading the text files.

Writing out your nouns, verbs, adjectives is a great approach, but I prefer to think of class design as asking the question what data should be hidden?
Imagine you had a Query object and a Database object:
The Query object will help you create and store a query -- store, is the key here, as a function could help you create one just as easily. Maybe you could stay: Query().select('Country').from_table('User').where('Country == "Brazil"'). It doesn't matter exactly the syntax -- that is your job! -- the key is the object is helping you hide something, in this case the data necessary to store and output a query. The power of the object comes from the syntax of using it (in this case some clever chaining) and not needing to know what it stores to make it work. If done right the Query object could output queries for more then one database. It internally would store a specific format but could easily convert to other formats when outputting (Postgres, MySQL, MongoDB).
Now let's think through the Database object. What does this hide and store? Well clearly it can't store the full contents of the database, since that is why we have a database! So what is the point? The goal is to hide how the database works from people who use the Database object. Good classes will simplify reasoning when manipulating internal state. For this Database object you could hide how the networking calls work, or batch queries or updates, or provide a caching layer.
The problem is this Database object is HUGE. It represents how to access a database, so under the covers it could do anything and everything. Clearly networking, caching, and batching are quite hard to deal with depending on your system, so hiding them away would be very helpful. But, as many people will note, a database is insanely complex, and the further from the raw DB calls you get, the harder it is to tune for performance and understand how things work.
This is the fundamental tradeoff of OOP. If you pick the right abstraction it makes coding simpler (String, Array, Dictionary), if you pick an abstraction that is too big (Database, EmailManager, NetworkingManager), it may become too complex to really understand how it works, or what to expect. The goal is to hide complexity, but some complexity is necessary. A good rule of thumb is to start out avoiding Manager objects, and instead create classes that are like structs -- all they do is hold data, with some helper methods to create/manipulate the data to make your life easier. For example, in the case of EmailManager start with a function called sendEmail that takes an Email object. This is a simple starting point and the code is very easy to understand.
As for your example, think about what data needs to be together to calculate what you are looking for. If you wanted to know how far an animal was walking, for example, you could have AnimalStep and AnimalTrip (collection of AnimalSteps) classes. Now that each Trip has all the Step data, then it should be able to figure stuff out about it, perhaps AnimalTrip.calculateDistance() makes sense.

How many private variables are too many? Capsulizing classes? Class Practices?

Okay so i am currently working on an inhouse statistics package for python, its mainly geared towards a combination of working with arcgis geoprocessor, for modeling comparasion and tools.
Anyways, so i have a single class, that calculates statistics. Lets just call it Stats. Now my Stats class, is getting to the point of being very large. It uses statistics calculated by other statistics, to calculate other statistics sets, etc etc. This leads to alot of private variables, that are kept simply to prevent recalculation. however there is certain ones, while used quite frequintly they are often only used by one or two key subsections of functionality. (e.g. summation of matrix diagonals, and probabilities). However its starting to become a major eyeesore, and i feel as if i am doing this terribly wrong.
So is this bad?
I was recommended by a coworker, to simply start putting core and common functionality togther, in the main class, then simply having capsules, that take a reference to the main class, and simply do what ever functionality they need to within themselves. E.g. for calculating accuracy of model predictions, i would create a capsule, who simply takes a reference to the parent, and it will offload all of the calculations needed, for model predictions.
Is something like this really a good idea? Is there a better way? Right now i have over a dozen different sub statistics that are dumped to a text file to make a smallish report. The code base is growing, and i would just love it if i could start splitting up more and more of my python classes. I am just not sure really what the best way about doing stuff like this is.

Why not create a class for each statistic you need to compute and when of the statistics requires other, just pass an instance of the latter to the computing method? However, there is little known about your code and required functionalities. Maybe you could describe in a broader fashion, what kind of statistics you need calculate and how they depend on each other?
Anyway, if I had to count certain statistics, I would instantly turn to creating separate class for each of them. I did once, when I was writing code statistics library for python. Every statistic, like how many times class is inherited or how often function was called, was a separate class. This way each of them was simple, however I didn't need to use any of them in the other.

I can think of a couple of solutions. One would be to simply store values in an array with an enum like so:
StatisticType = enum('AveragePerDay','MedianPerDay'...)
Another would be to use a inheritance like so:
class StatisticBase
....
class AveragePerDay ( StatisticBase )
...
class MedianPerDay ( StatisticBase )
...
There is no hard and fast rule on "too many", however a guideline is that if the list of fields, properties, and methods when collapsed, is longer than a single screen full, it's probably too big.

It's a common anti-pattern for a class to become "too fat" (have too much functionality and related state), and while this is commonly observed about "base classes" (whence the "fat base class" monicker for the anti-pattern), it can really happen without any inheritance involved.
Many design patterns (DPs for short_ can help you re-factor your code to whittle down the large, untestable, unmaintainable "fat class" to a nice package of cooperating classes (which can be used through "Facade" DPs for simplicity): consider, for example, State, Strategy, Memento, Proxy.
You could attack this problem directly, but I think, especially since you mention in a comment that you're looking at it as a general class design topic, it may offer you a good opportunity to dig into the very useful field of design patterns, and especially "refactoring to patterns" (Fowler's book by that title is excellent, though it doesn't touch on Python-specific issues).
Specifically, I believe you'll be focusing mostly on a few Structural and Behavioral patterns (since I don't think you have much need for Creational ones for this use case, except maybe "lazy initialization" of some of your expensive-to-compute state that's only needed in certain cases -- see this wikipedia entry for a pretty exhaustive listing of DPs, with classification and links for further explanations of each).

Since you are asking about best practices you might want to check out pylint (http://www.logilab.org/857). It has many good suggestions about code style including ones relating to how many private variables in a class.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.