TDD with large data in Python

TDD with large data in Python - python

I wonder if TDD could help my programming. However, I cannot use it simply as most of my functions take large network objects (many nodes and links) and do operations on them. Or I even read SQL tables.
Most of them time it's not really the logic that breaks (i.e. not semantic bugs), but rather some functions calls after refactoring :)
Do you think I can use TDD with such kind of data? What do you suggest for that? (mock frameworks etc?)
Would I somehow take real data, process it with a function, validate the output, save input/output states to some kind of mock object, and then write a test on it? I mean just in case I cannot provide hand made input data.
I haven't started TDD yet, so references are welcome :)

You've pretty much got it. Database testing is done by starting with a clean, up-to-date schema and adding a small amount of known, fixed data into the database. You can then do operations on this controlled environment, knowing what results you expect to see.
Working with network objects is a bit more complex, but it normally involves stubbing them (i.e. removing the inner functionality entirely) or mocking them so that a fixed set of known data is returned.
There is always a way to test your code. If it's proving difficult, it's normally the code design that needs some rethinking.
I don't know any Python specific TDD resources, but a great resource on TDD in general is "Test Driven Development: A Practical Guide" by Coad. It uses Java as the language, but the principles are the same.

most of my functions take large network objects
Without knowing anything about your code, it is hard to assess this claim, but you might want to redesign your code so it is easier to unit test, by decomposing it into smaller methods. Although some high-level methods might deal with those troublesome large objects, perhaps low-level methods do not. You can then unit test those low-level methods, relying on integration tests to test the high-level methods.
Edited:
Before getting to grips with TDD you might want to try just adding some unit tests.
Unit testing is about testing at the fine-grained level: see my answer to the question How do you unit test the real world.
You might have to introduce some indirection into your program, to isolate parts that are impossible to unit test.
You might find it useful to decompose your data into an assembly of smaller classes, which can be tested individually.

Related

Automatically detect non-deterministic behaviour in Python

This may be impossible, but I am just wondering if there are any tools to help detect non-deterministic behaviour when I run a Python script. Some fancy options in a debugger perhaps? I guess I am imagining that theoretically it might be possible to compare the stack instruction-by-instruction or something between two subsequent runs of the same code, and thus pick out where any divergence begins.
I realise that a lot is going on under the hood though, so that this might be far too difficult to ask of a debugger or any tool...
Essentially my problem is that I have a test failing occasionally, almost certainly because somewhere the code relies accidentally on the ordering of output from iterating over a dictionary, or some such thing where the ordering isn't actually guaranteed. I'd just like a tool to help me locate the source of this sort of problem :).
The following question is similar, but there was not much suggestion of how to really deal with this in an automated or general way: Testing for non-deterministic behavior of python function

I'm not aware of a way to do this automatically, but what I would recommend doing is starting a debugger when a test fails, then running it automatically (overnight?) until you get a failure. You can then examine the variables and see if anything stands out. If you're using pytest, running with the --pdb flag will start a debugger on failure.
You might also consider using Hypothesis to run generative test cases.
You might also consider running the tests over and over, collecting the output of each run (success or failure). When you have a representative sample, compare the two, particularly the ordering of tests that were run!

I'm fairly certain this is fundamentally impossible with our current understanding of computation and automata theory. I could be wrong though.
Anyone more directly knowledgeable (rigorous background) feel free to pipe in, most of what follows is self-taught and heavily based on professional observations over the last decade doing Systems Engineering/SRE/automation.
Modern day computers are an implementation of Automata Theory and Computation Theory. At the signal level, they require certain properties to do work.
Signal Step, Determinism and Time Invariance are a few of those required properties.
Deterministic behavior, and deterministic properties rely on there being a unique solution, or a 1:1 mapping of directed nodes (instructions & data context) on the state graph from the current state to the next state. There is only one unique path. Most of this is implemented and hidden away at low level abstractions (i.e. the Signal, Firmware, and kernel/shell level).
Non-deterministic behavior is the absence of the determinism signal property. This can be introduced into any interconnected system due to a large range of issues (i.e. hardware failures, strong EM fields, Cosmic Rays, and even poor programming between interfaces).
Anytime determinism breaks, computers are unable to do useful work, the scope may be limited depending on where it happens. Usually it will either be caught as an error and the program or shell will halt, or it may continue running indefinitely or provide bogus data, both because of the class of problem it turns into, and the fundamental limitations on the types of problems turing machines can solve (i.e. computers).
Please bare in mind, I am not a Computer Science major, nor do I hold a degree in Computer Engineering or related IT field. I'm self taught, no degree.
Most of this explanation has been driven by doing years of automation, segmenting problem domains, design and seeking a more generalized solution to many of the issues I ran into, mostly to come to a better usage of my time (hence this non-rigorous explanation).
The class of non-deterministic behavior is the most costly type of errors I've run into because this behavior is the absence of the expected. There isn't a test for non-determinism as a set or group. You can infer it by the absence of properties which you can test (at least interactively)
Normal computer behavior is emergent from the previous mentioned required signals and systems properties, and we see problems when they don't always hold true and we can't quickly validate the system for non-determinism due to its nature.
Interestingly, testing for the presence of those properties interactively, is a useful shortcut, as if the properties are not present it will fall into this class of troubles which we as human beings can solve, but computers cannot, but it can only effectively be done by humans as you can run into issues with the halting problem, and other more theoretical aspects which I didn't bother understanding during my independent studies.
Unfortunately, know how to test for these properties does often require knowledgeable view of the systems and architecture being tested spanning most abstraction layers (depending on where the problem originates).
More formal or rigorous material may use NFAs v. DFAs with more complex vocabularies, non-finite versus discrete-finite automata iirc.
The differences being basically the presence of that 1:1 state map/path or its absence that define determinism.
Where most people trip up with this property, with regards to programming is between interfaces where the interface fails to preserve data and this property by extension, such as accidentally using the empty or NULL state of an output field to mean more than one thing that gets passed to another program.
A theoretical view of a shell program running a series of piped commands might look like this:
DFA->OutInterface ->DFA->OutInterface->NFA->crash/meaningless/unexpected data/infinite loop, etc depending on the code that comes after the NFA, the behavior varies unpredictably in indeterminable ways. (OutInterface being pipe at the shell '|' )
For an actual example in the wild, ldd on recent versions of linux had two such errors that injected non-determinism into the pipe. Trying to identify linked dependencies for a arbitrary binary, for use with a build system was not possible using ldd because of this issue.
More specifically, the in-memory structures, and then also the flattening of the output fields in a non-deterministic way that varies across different binaries.
Most of the material mentioned above is normally covered in a BS Compiler design course at the undergraduate level, one can also find it in the dragon compiler book which is what I did instead, it does require a decent background in math fundamentals (i.e. Abstract Algebra/Linear Algebra) to grok the basis and examples, and the properties are best described in Oppenheim's Signals and Systems.
Without knowing how to test that certain system properties hold true, you can easily waste months of labor trying to document and/or trying to narrow the issue down. All you really have in those non-deterministic cases is a guess and check model/strategy which becomes very expensive especially if you don't realize its an underlying systems property issue.

Using annotate instead of model property

The annotate function is very useful to define computed fields for each row of the table. Model properties can also be used for defining computed fields, but are limited (can not be used for sorting, for instance).
Can a model property be completely replaced by an annotated field? When is it adequate to use each?

Differences in annotations, properties, and other methods
There are some cases where annotations are definitely better and easier than properties. These are usually calculations that are easy to make in the database and where the logic is easy to read.
Django #property, on the other hand, is a very easy and Pythonic way to write calculation logic into your models. Some think they are neat things, others think properties should be burned and hidden away because they mix program logic into data objects, which increases complexity. I think that especially the #cached_property decorator is rather neat.
Properties and annotations are not, however, the only ways to query and calculate things in the Django ORM.
In many complex cases properties or Model, Manager, or QuerySet methods, especially with custom QuerySets and Managers have the most flexibility when it comes to customizing your queries.
Most of the time it does not matter which method you use speed-wise, and should use the cleanest or most compact option that is the simplest to write and easiest to read. Try to keep it simple and stupid and you will have the least amount of complex code to maintain.
Exploring, benchmarking, and optimizing
In some cases when you have performance problems and end up analyzing your SQL queries you might be forced to use annotations and custom queries to optimize the complexities of queries you are making. This can be especially true when you are making complex lookups in the database and have to resort to calculating things in either properties or create custom SQL queries.
Calculating stuff in properties can be horrible for complexity if you have large querysets, because you have to run those calculations in Python where objects are large and iteration is slow. On the other hand, calculating stuff via custom SQL queries can be a nightmare to maintain, especially if the SQL you are maintaining is not, well, written by you.
In the end it comes down to the speed requirements and costs of calculation. If calculating in plain Python doesn't slow your service down or cost you money, you shouldn't probably optimize. If you are paying for a fleet of servers then, of course, reasonable optimization might bring you savings that you can use elsewhere. Using 10 hours on optimizing some snippet might not really pay itself back, so be very careful here.
In optimization cases you have to weigh different up and downsides and try different solutions if you are not a prophet and instinctively know what the problem is. If the problem was obvious, it would have been probably optimized away earlier, right?
When experimenting with different options the Django Debug Toolbar, SQL EXPLAIN and ANALYZE and Python profiling are your friends.
Remember that many query problems are also database related problems and you might be hurting your performance with poor database design or maintenance. Remember to run VACUUM periodically and try to normalize your database design.
Toolwise Django Debug Toolbar is especially useful because it can be help with both profiling and SQL analyzing. Many IDEs such as PyCharm also offer profiling on even a running server. This is pretty useful if you want to have your development setup and integrate different tools into it.

Performance between Django and raw Python

I was wondering what the performance difference is between using plain python files to make web pages and using Django. I was just wondering if there was a significant difference between the two. Thanks

Django IS plain Python. So the execution time of each like statement or expression will be the same. What needs to be understood, is that many many components are put together to offer several advantages when developing for the web:
Removal of common tasks into libraries (auth, data access, templating, routing)
Correctness of algorithms (cookies/sessions, crypto)
Decreased custom code (due to libraries) which directly influences bug count, dev time etc
Following conventions leads to improved team work, and the ability to understand code
Plug-ability; Create or find new functionality blocks that can be used with minimal integration cost
Documentation and help; many people understand the tech and are able to help (StackOverflow?)
Now, if you were to write your own site from scratch, you'd need to implement at least several components yourself. You also lose most of the above benefits unless you spend an extraordinary amount of time developing your site. Django, and other web frameworks for every other language, are designed to provide the common stuff, and let you get straight to work on business requirements.
If you ever banged out custom session code and data access code in PHP before the rise of web frameworks, you won't even think of the performance cost associated with a framework that makes your job interesting and eas(y)ier.
Now, that said, Django ships with a LOT of components. It is designed in such a way that most of the time, they won't affect you. Still, a surprising amount of code is executed for each request. If you build out a site with Django, and the performance just doesn't cut it, you can feel free to remove all the bits you don't need. Or, you can use a 'slim' python framework.
Really, just use Django. It is quite awesome. It powers many sites millions times larger than anything you (or I) will build. There are ways to improve performance significantly, like utilizing caching, rather than optimizing a loop over custom Middleware.

Depends on how your "plain Python" makes web pages. If it uses a templating engine, for instance, the performance of that engine is going make a huge difference. If it uses a database, what kind of data access layer you use (in the context of the requirements for that layer) is going to make a difference.
The question, thus, becomes a question of whether your arbitrary (and presently unstated) toolchain choices have better runtime performance than the ones selected by Django. If performance is your primary, overriding goal, you certainly should be able to make more optimal selections. However, in terms of overall cost -- ie. buying more web servers for the slower-runtime option, vs buying more programmer-hours for the more-work-to-develop option -- the question simply has too many open elements to be answerable.

Premature optimisation is the root of all evil.
Django makes things extremely convenient if you're doing web development. That plus a great community with hundreds of plugins for common tasks is a real boon if you're doing serious work.
Even if your "raw" implementation is faster, I don't think it will be fast enough to seriously affect your web application. Build it using tools that work at the right level of abstraction and if performance is a problem, measure it and find out where the bottlenecks are and apply optimisations. If after all this you find out that the abstractions that Django creates are slowing your app down (which I don't expect that they will), you can consider moving to another framework or writing something by hand. You will probably find that you can get performance boosts by caching, load balancing between multiple servers and doing the "usual tricks" rather than by reimplementing the web framework itself.

Django is also plain Python.
See the performance mostly relies on how efficient your code is.
Most of the performance issues of software arise from the inefficient code, rather than choice of tools and language. So the implementation matters. AFAIK Django does this excellently and it's performance is above the mark.

Where are the layman non-prerequisite introductory materials for everything-testing in Python?

This question has been modified
If I had to really pick the kind of testing I would like to learn (I have no idea which way this translates to Python) is found here http://butunclebob.com/ArticleS.UncleBob.TheThreeRulesOfTdd. I don't know if this is agile, extreme or if it's just called TDD. Is this unit testing, doc testing a combination of both or am I missing something? Is there something even better, similar or is this style simply not applicable? I am looking for extreme beginners material on how to do testing (specifically for Python) as described in my link. I am open to ideas and all Python resources. Thanks!
I pasted the original message here for reference
http://dpaste.com/274603/plain/

The simplest thing to start with is Python's unittest module. There are frameworks to do things more automatically later on, but unittest works fine for simple cases.
The basic idea is that you have a class for a particular test suite. You define a set of test methods, and optional setUp and tearDown methods to be executed before and after each test in that suite. Within each test, you can use various assert* methods to check that things work.
Finally, you call unittest.main() to run all the tests you've defined.
Have a look at this example: http://docs.python.org/library/unittest#basic-example

What resources do you fellas have for starting out in testing in particular to Python? ... What are the most excellent places to start out at, anybody?
Step 1. Write less.
Step 2. Focus.
Step 3. Identify in clear, simple sentences what you are testing. Is it software? What does it do? What architecture does it run on? Specifically list the specific things you're actually going to test. Specific. Focused.
Step 4. For one thing you're going to test. One thing. Pick a requirement it must meet. One requirement. I.e., "Given x and y as input, computes z."
Looking at your question, I feel you might find this to be very, very hard. But it's central to testing. Indeed, this is all that testing is.
You must have something to test. (A
"fixture".)
You must have requirements against
which to test it. (A "TestCase".)
You have have measurable pass/fail
criteria. (An "assertion".)
If you don't have that, you can't test. It helps to write it down in words. Short, focused lists of fixtures, cases and assertions. Short. Focused.
Once you've got one requirement, testing is just coding the requirements into a language that asserts the results of each test case. Nothing more.
Your unittest.TestCase uses setUp to create a fixture. A TestCase can have one or more test methods to exercise the fixture in different ways. Each test method has one or more assertions about the fixture.
Once you have a test case which passes, you can move back to Step 4 and do one more requirement.
Once you have all the requirements for a fixture, you go back to Step 3 and do one more fixture.
Build up your tests slowly. In pieces. Write less. Focus.

I like this one and this one too

Testing is more of an art, and you get better through practice.
For myself, I have read this book Pragmatic Unit Testing in C# with NUnit, and I'll really recommend it to you.
Although its about .Net and C# and not about Python, you can skip the exact code, but try to understand the principles being taught.
It contains good examples about what one should look out for when designing tests. And also, what kind of code should be written that supports the tests-first paradigm.

Here's a link to C. Titus Brown's Software Carpentry notes on testing:
http://ivory.idyll.org/articles/advanced-swc/#testing-your-software
It talks basics about testing, including ideas of how to test your code (and retrofitting tests onto existing code). Seems up your alley.

Does OOP make sense for small scripts?

I mostly write small scripts in python, about 50 - 250 lines of code. I usually don't use any objects, just straightforward procedural programming.
I know OOP basics and I have used object in other programming languages before, but for small scripts I don't see how objects would improve them. But maybe that is just my limited experience with OOP.
Am I missing something by not trying harder to use objects, or does OOP just not make a lot of sense for small scripts?

I use whatever paradigm best suits the issue at hand -- be it procedural, OOP, functional, ... program size is not a criterion, though (by a little margin) a larger program may be more likely to take advantage of OOP's strengths -- multiple instances of a class, subclassing and overriding, special method overloads, OOP design patterns, etc. Any of these opportunities can perfectly well occur in a small script, there's just a somewhat higher probability that it will occur in a larger one.
In addition, I detest the global statement, so if the natural procedural approach would require it, I will almost invariably switch to OOP instead -- even if the only advantage is the ability to use a qualified name instead of the barename which would require global.
There's definitely no need to "try harder" in general -- just ask yourself "is there an opportunity here to use (a) multiple instances (etc etc)" and it will soon become second nature, i.e., you'll spot the opportunities without needing to consciously remind yourself every time to look for them, and your programming will improve as a result.

Object-Oriented Programming, while useful for representing systems as real-world objects (and hopefully making large software system easier to understand) is not the silver bullet to every solution (despite what some people teach).
If your system does not benefit from what OOP provides (things such as data abstraction, encapsulation, modularity, polymorphism, and inheritance), then it would not make sense to incur all the overhead of doing OOP. However, if you find that as your system grows these things become a bigger concern to you, then you may want to consider moving to an OOP solution.
Edit: As an update, you may want to head over to Wikipedia to read the articles on various criticisms of OOP. Remember that OOP is a tool, and just like you wouldn't use a hammer for everything, OOP should not be used for everything. Consider the best tool for the job.

One of the unfortunate habits developed with oop is Objectophrenia - the delusion of seeing objects in every piece of code we write.
The reason why that happens is due our delusion of believing in the existence of a unified objects theorem.
Every piece of code you write, you begin to see it as a template for objects and how they fit into our personal scheme of things. Even though it might be a small task at hand, we get tempted by the question - is this something I could place into my class repository which I could also use for the future? Do I see a pattern here with code I have previously written and with code which my object clairvoyance tells me that I will one day write? Can I structure my present task into one of these patterns.
It is an annoying habit. Frequently, it is better not to have it. But when you find that every bit of code you write somehow falls into patterns and you refactor/realign those patterns until it covers most of your needs, you tend to get a feeling of satisfaction and accomplishment.
Problems begins to appear when a programmer gets delusional (compulsive obsessive object oriented disorder) and does not realise that there are exceptions to patterns and trying to over-manipulate patterns to cover more cases is wrong. It's like my childhood obsession with trying to cover a piece of bread completely with butter or jam spread every morning I had breakfast. That sometimes, it is just better to leave the object oriented perception behind and just perform the task at hand quick and dirty.
The accepted industrial adage of 80-20 might be a good measure. Using this adage in a different manner than it is normally perceived, we could say 80% of the time have an object oriented perception. 20% of the time - code it quick and dirty.
Be immersed by objects but eventually you have to resist its consuming you.
You probably have not done enough programming yet because if you have, you would see all the patterns that you had done and you will also begin to believe in patterns that you have yet to apply. When you begin to see such objectophrenia visions, it's time to be careful not to be consumed by them.

If you plan to use the script independently, then no. However if you plan to import it and reuse some of it, then yes. In the second case, it's best to write some classes providing the functionality that's required and then have a conditional run (if __name__=='__main__':) with code to execute the "script" version of the script.

My experience is that any purely procedural script longer than a few dozen lines becomes difficult to maintain. For one thing, if I'm setting or modifying a variable in one place and using it in another place, and those two places can't fit on a single screen, trouble will follow.
The answer, of course, is to tighten the scope and make the different parts of your application more encapsulated. OOP is one way to do that, and can be a useful way to model your environment. I like OOP, as I find I can mentally jump from thinking about how the inside of a particular object will work, to thinking about how the objects will work together, and I stay saner.
However, OOP is certainly not the only way to make your code better encapsulated; another approach would be a set of small, well-named functions with carefully defined inputs and outputs, and a master script that calls those functions as appropriate.

OOP is a tool to manage complexity in code, 50-250 lines of code are rarely complicated. Most scripts I have written are primarily procedural. So yes, for small scripts just go with procedural programming.
Note that for significantly complicated scripts OOP may be more relevant, but there is still not hard and fast rule that says use OOP for them. It is a matter of personal preference then.

Use the right tool for the right job. For small scripts that don't require complex data structures and algorithms, there is likely no use for object oriented concepts.

Am I missing something by not trying harder to use objects, or does OOP just not make a lot of sense for small scripts?
Objects buy you encapsulation and reuse (through inheritance). Neither is likely to be terribly useful when writing small scripts. When you have written a collection of similar scripts or you find yourself repeatedly changing your scripts, then maybe you should consider where objects might help.

In your case I'd say that OOP would be helpful only if it makes the scripts more readable and understandable. If not, you probably don't need to bother.

OOP is just another paradigm. A lot of problems can be solved using both procedural or OOP.
I use OOP when I see clear need of inheritance in the code i am writing, its easier to manage common behaviour and common attributes.
It sometimes makes it easy to understand, and manage. Even if the code is small.

Another benefit of OOP is to communicate intent (whether to other developers, managers, or yourself some point in the future). If the script is small enough where it can be fully communicated in a couple of sentences then OOP is probably not necessary, in my opinion.

Using OOP for few hundred lines of code rarely makes sense. But if your script is useful, it will probably grow rather quickly because new features will be added. If this is the case, it is better to start coding OOP way it will pay in the long run.

First of all - what do you mean by objects? In Python functions are objects and you're most likely using them. :)
If by objects you mean classes and instances thereof, then I would say something obvious: no, there is no reason to saying that using them is making your code better by itself. In small scripts there is not going to be any leverage coming from sophisticated OO design.

OOP is about what you get if you add polymorphism on top of modular programming.
The latter of both promotes low coupling, encapsulation, separation of responsibility and some other concepts, that usually produce code, that is short, expressive, maintainable, flexible, extensible, reusable and robust.
This is not so much a question about size, but about length of the software's life cycle. If you write any code, as short as it may be, as long as it is complex enough that you don't want to rewrite it, when your requirements change, it is important that it meets the aforementioned criteria.
OOP makes modular programming easier in that it has established solutions for implementing the concepts promoted by modular programming, and that polymorphism allows really low coupling through dependency injection.
I personally find it simpler to use OOP to achieve modularity (and reusability in particular), but I guess, that is a matter of habit.
To put it in one sentence. OOP will not help you in solving a given problem better, than procedural programming, but instead yields a solution, that is easier to apply to other problems.

It really depends on what the script is an what it's doing and how you think about the world.
Personally after a script has made it past 20-30 lines of code I can usually find a way that OOP makes more sense to me (especially in Python).
For instance, say I'm writing a script that parses a log file. Well, conceptually I can imagine this "log parser" machine... I can throw all these sheets of paper into it and it will sort them, chop parts out of some pages and paste them onto another and eventually hand me a nice report.
So then I start thinking, well, what does this parser do? Well, first off he's (yes, the parser is a he. I don't know how many of my programs are women, but this one is definitely a guy) going to read the pages, so I'll need a method called page reader. Then he's going to find all of the data referring to the new Frobnitz process we're using. Then he's going to move all the references about the Frobnitz process to appear next to the Easter Bunny graph. Ooh, so now I need a findeasterbunny method. After he's done that, then he's going to take the rest of the logs, remove every 3rd word, and reverse the order of the text. So I'll need a thirdwordremover and a textreversal method, too. So an empty shell class would look like so:
class LogParser(Object):
def __init__(self):
#do self stuff here
def pageReader(self):
#do the reading stuff here, probably call some of the other functions
def findFrobnitz(self):
pass
def findEasterBunny(self):
pass
def thirdWordRemover(self):
pass
def textReversal(self):
pass
That's a really contrived example, and honestly probably not a situation I'd use OOP for... but it really just depends on what's easiest for me to comprehend at that particular moment in time.

"Script" means "sequential" and "procedural". It's a definition.
All of the objects your script deals with are -- well -- objects. All programming involves objects. The objects already exist in the context in which you're writing your script.
Some languages allow you to clearly identify the objects. Some languages don't clearly identify the objects. The objects are always there. It's a question of whether the language makes it clear or obscure.
Since the objects are always there, I find it helps to use a language that allows clear identification of the objects, their attributes, methods and relationships. Even for short "scripts", I find that explicit objects and an OO language helps.
The point is this.
There's no useful distinction between "procedural", "script" and "OO".
It's merely a shift in emphasis. The objects are always there. The world is inherently object-oriented. The real question is "Do you use a language that makes the objects explicit?"

as someone who does a lot of scripts, if you get the idea that your code may at some point go beyond 250 line start to go oop. I use a lot of vba and vbscript and I would agree that anything under 100 lines is usually pretty straightforward and taking the time to plan a good oop design is just a waste.
That being said I have one script that came to about 500 line + and looking back on it, because i didn't do it oop it quickly turned into an unholy mess of spaghetti. so now anything over 200 lines I make sure i have a good oop plan ahead of time

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.