How do I compare two nested data structures for unittesting?

How do I compare two nested data structures for unittesting? - python

For those who know perl, I'm looking for something similar to Test::Deep::is_deeply() in Python.
In Python's unittest I can conveniently compare nested data structures already, if I expect them to be equal:
self.assertEqual(os.walk('some_path'),
my.walk('some_path'),
"compare os.walk with my own implementation")
However, in the wanted test, the order of files in the respective sublist of the os.walk tuple shall be of no concern.
If it was just this one test it would be ok to code an easy solution. But I envision several tests on differently structured nested data. And I am hoping for a general solution.
I checked Python's own unittest documentation, looked at pyUnit, and at nose and it's plugins. Active maintenance would also be an important aspect for usage.
The ultimate goal for me would be to have a set of descriptive types like UnorderedIterable, SubsetOf, SupersetOf, etc which can be called to describe a nested data structure, and then use that description to compare two actual sets of data.
In the os.walk example I'd like something like:
comparison = OrderedIterable(
OrderedIterable(
str,
UnorderedIterable(),
UnorderedIterable()
)
)
The above describes the kind of data structure that list(os.walk()) would return. For comparison of data A and data B in a unit test, the current path names would be cast into a str(), and the dir and file lists would be compared ignoring the order with:
self.assertDeep(A, B, comparison, msg)
Is there anything out there? Or is it such a trivial task that people write their own? I feel comfortable doing it, but I don't want to reinvent, and especially would not want to code the full orthogonal set of types, tests for those, etc. In short, I wouldn't publish it and thus the next one has to rewrite again...

Python Deep seems to be a project to reimplement perl's Test::Deep. It is written by the author of Test::Deep himself. Last development happened in early 2016.
Update (2018/Aug): Latest release (2016/Feb) is located on PyPi/Deep
I have done some P3k porting work on github

Not a solution, but the currently implemented workaround to solve the particular example listed in the question:
os_walk = list(os.walk('some_path'))
dt_walk = list(my.walk('some_path'))
self.assertEqual(len(dt_walk), len(os_walk), "walk() same length")
for ((osw, osw_dirs, osw_files), (dt, dt_dirs, dt_files)) in zip(os_walk, dt_walk):
self.assertEqual(dt, osw, "walk() currentdir")
self.assertSameElements(dt_dirs, osw_dirs, "walk() dirlist")
self.assertSameElements(dt_files, osw_files, "walk() fileList")
As we can see from this example implementation that's quite a bit of code. As we can also see, Python's unittest has most of the ingredients required.

Related

How to dynamically store and execute equations and functions

During my current project, I have been receiving data from a set of long-range sensors, which are sending data as a series of bytes. Generally, due to having multiple types of sensors, the bytes structures and data contained are different, hence the need to make the functionality more dynamic as to avoid having to hard-code every single setup in the future (which is not practical).
The server will be using Django, which I believe is irrelevant to the issue at hand but I have mentioned just in case it might have something that can be used.
The bytes data I am receiving looks like this:
b'B\x10Vu\x87%\x00x\r\x0f\x04\x01\x00\x00\x00\x00\x1e\x00\x00\x00\x00ad;l'
And my current process looks like this:
Take the first bytes to get the deviceID (deviceID = val[0:6].hex())
Look up the format to be used in the struct.unpack() (here: >BBHBBhBHhHL after removing the first bytes for the id.
Now, the issue is the next step. Many of the datas I have have different forms of per-processing that needs to be done. F.e. some values need to be ran with a join statement (e.g. ".".join(str(values[2]) ) while others need some simple mathematical changes (-113 + 2 * values[4]) and finally, others needs a simple logic check (values[7]==0x80) to return a boolean value.
My question is, what's the best way to code those methods? I would really like to avoid hardcoding them, but it almost seems like the best idea. another idea I saw was to store the functionalities as a string and execute them such as seen here, but I've been reading that its a very bad idea, and that it also slows down execution. The last idea I had was to hardcode some general functions only and use something similar to here, but this doesn't solve the issue of having to hard-code every new sensor-type, which is not realistic in a live-installation. Are there any better methods to achieve the same thing?
I have also looked at here, with the idea that some functionality can be somehow optimized as an equation, but I didn't see that a possibility for every occurrence, especially when any string manipulation is needed at all.
Additionally, is there a possibility of using some maths to apply some basic string manipulation? I can hard-code one string manipulation maybe, but to be honest this whole thing has been bugging me...
Finally, I am considering if I go with the function storing as string then executing, is there a way to set some "security" to avoid any malicious exploitation? Since such a method is... awful insecure to say the least.
However, after almost a week total of searching I am so far unable to find a better solution than storing functions as a string and running eval on them, despite not liking that option. If anyone finds a better option before then, I would be extremely grateful to any tips or ideas.
Appendum: Minimum code that can be used to show-case and test different methods:
import struct
def decode(input):
val = bytearray(input)
deviceID = val[0:6].hex()
del(val[0:6])
print(deviceID)
values = list(struct.unpack('>BBHBBhBHhHL', val))
print(values)
# Now what?
decode(b'B\x10Vu\x87%\x00x\r\x0f\x04\x01\x00\x00\x00\x00\x1e\x00\x00\x00\x00ad;l')

How do you test more complicated functions?

I'm a total amateur/hobbyist developer trying to learn more about testing the software I write. While I understand the core concept of testing, as the functions get more complicated, I feel as though it's a rabbit hole of varations, outcomes, conditions etc. For example...
The function below reads files from a directory into a Pandas DataFrame. A few columns adjustments are made before the data is passed to a different function that ultimately imports the data to our database.
I've already coded a test for the convert_date_string function. But what about this entire function as as whole - how do I write a test for it? In my mind, much of the Pandas library is already tested - thus making sure core functionality there works with my setup seems like a waste. But, maybe it isn't. Or, maybe this is a refactoring question to break this down into smaller parts?
Anyway, here is the code... any insight would be appreciated!
def process_file(import_id=None):
all_files = glob.glob(config.IMPORT_DIRECTORY + "*.txt")
if len(all_files) == 0:
return []
import_data = (pd.read_csv(f, sep='~', encoding='latin-1',
warn_bad_lines=True, error_bad_lines=False,
low_memory=False) for f in all_files)
data = pd.concat(import_data, ignore_index=True, sort=False)
data.columns = [col.lower() for col in data.columns]
data = data.where((pd.notnull(data)), None)
data['import_id'] = import_id
data['date'] = data['date'].apply(lambda x: convert_date_string(x))
insert_data_into_database(data=data, table='sales')
return all_files

There are mainly two kind of tests - proper unit tests, and integration tests.
Unit tests, as the name implies, test "units" of your program (functions, classes...) in isolation (without considering how they interact with other units). This of course require those units can be tested in isolation. For example, a pure function (a function that compute a results from it's inputs, where the result depends only on the inputs and will always be the same for the same inputs, and which doesn't have any side effect) is very easy to test, while a function that reads data from a hardcoded path on your filesystem, makes http requests to a hardcoded url and updates a database (whose connection data are also hardcoded) is almost impossible to test in isolation (and actually almost impossible to test).
So the first point is to write your code with testability in mind: favour small, focused units with a single clear responsability and as few dependencies as possible (and preferably taking their dependencies as arguments so you can pass a mock instead). This is of course a bit of a platonic ideal, but it's a worthy goal still. As a last resort, when you cannot get rid of dependencies or parameterize them, you can use a package like mock that will replace your dependencies with bogus objects having a similar interface.
Integration testing is about testing whole subsystems from a much higher level - for example for a website project, you may want to test that if you submit the "contact" form an email is sent to a given address and that the data are also stored in the database. You obviously want to do so with a disposable test database and a disposable test mailbox.
The function you posted is possibly doing a bit too much - it reads files, builds a panda dataframe, applies some processing, and stores thing in a database. You may want to try and factor it into more functions - one to get the files list, one to collect data from the files, one to process the data etc, you already have the one storing the data in the database - and rewrite your "process_files" (which is actually doing more than processing) to call those functions. This will make it easier to test each part in isolation. Once done with this, you can use mock to test the "process_file" functions and check that it calls the other functions with the expected arguments, or run it against a test directory and a test database and check the results in the database.

In general, I wouldn't go down the road of testing pandas or any other dependencies. The way I see it, it is important to make sure that a package that i use is well developed and well supported, then making tests for it will be redundant. Pandas is a very well supported package.
As to your question about the specific function and interest in testing in general, I will highly recommend checking out the Hypothesis python package (you'r in luck - its currently only for python). It provides mock data and generates edge cases for testing purposes.
an example from their docs:
from hypothesis import given
from hypothesis.strategies import text
#given(text())
def test_decode_inverts_encode(s):
assert decode(encode(s)) == s
here you tell it that the function needs to receive text as input, and the package will run it multiple times with different variables that answer the criteria. It will also try all kind of of edge cases.
It can do much more once implemented.

Maintaining two versions of an ipython notebook

I often need to create two versions of an ipython notebook: One contains tasks to be carried out (usually including some python code and output), the other contains the same text plus solutions. Let's call them the assignment and the solution.
It is easy to generate the solution document first, then strip the answers to generate the assignment (or vice versa). But if I subsequently need to make changes (and I always do), I need to repeat the stripping process. Is there a reasonable workflow that will allow changes in the assignment to be propagated to the solutions document?
Partial self-answer: I have experimented with leveraging mercurial's hg copy, which will let two files with different names share history. But I can only get this to work if assignment and solution are in different directories, in two linked hg repositories. I would much prefer a simpler set-up. I've also noticed that diff gets very confused when one JSON file has more sections than another, making a VCS-based solution even less attractive. (To be clear: Ordinary use of a VCS with notebooks is fine; it's the parallel versions that stumble).
This question covers similar ground, but does not solve my problem. In fact an answer to my question would solve the OP's second remaining problem, "pulling changes" (see the Update section).

It sounds like you are maintaining an assignment and an answer key of some kind and want to be able to distribute the assignments (without solutions) to students, and still have the answers for yourself or a TA.
For something like this, I would create two branches "unsolved" and "solved". First write the questions on the "unsolved" branch. Then create the "solved" branch from there and add the solutions. If you ever need to update a question, update back to the "unsolved" branch, make the update and merge the change into "solved" and fix the solution.
You could try going the other way, but my hunch is that going "backwards" from solved to unsolved might be strange to maintain.

After some experimentation I concluded that it is best to tackle this by processing the notebook's JSON code. Version control systems are not the right approach, for the following reasons:
JSON doesn't diff very well when adding or deleting cells. A minimal change leads to mis-matched braces and a very messy diff.
In my use case, the superset version of the file (containing both the assignments and their solutions) must be the source document. This is because the assignment includes example code and output that depends on earlier parts, to be written by the students. This model does not play well with version control, as pointed out by #ChrisPhillips in his answer.
I ended up filtering the JSON structure for the notebook and stripping out the solution cells; they may be recognized via special metadata (which can be set interactively using the metadata button in the interface), or by pattern-matching on the cell contents. The following snippet shows how to filter out cells whose first line starts with # SOLUTION:
def stripcell(cell, pattern):
"""Check if the first line of the cell's content matches `pattern`"""
if cell["cell_type"] == "code":
content = cell["input"]
else:
content = cell["source"]
return ( len(content) > 0 and re.search(pattern, content[0]) )
pattern = r"^# SOLUTION:"
struct = json.load(open("input.ipynb"))
cells = struct["worksheets"][0]["cells"]
struct["worksheets"][0]["cells"] = [ c for c in cells if not stripcell(c, pattern) ]
json.dump(struct, open("output.ipynb", "wb"), indent=1)
I used the generic json library rather than the notebook API. If there's a better way to go about it, please let me know.

Where to Store Borrowed Python Code?

Recently, I have been working on a Python project with usual directory structure, and have received help from someone else who has given me a code snippet (a single function definition, about 30 lines long) which I would like to import into my code. What is the most proper directory/location in a Python project to store borrowed code of this size? Is it best to store the snippet into an entirely different module and import it from there?

I generally find it easiest to put such code in a separate file, because for clarity you don't want more than one different copyright/licensing term to apply within a single file. So in Python this does indeed mean a separate module. Then the file can contain whatever attribution and other legal boilerplate you need.
As long as your file headers don't accidentally claim copyright on something to which you do not own the copyright, I don't think it's actually a legal problem to mix externally-licensed or public domain code into files you mostly own. I may be wrong, though, which is why I normally avoid giving myself reason to think about it. A comment saying "this is external code from the following source with the following license:" may well be clearer than dividing code into different files that naturally wouldn't be. So I do occasionally do that.
I don't see any definite need for a separate directory (or package) per separate external source. If that's already part of your project structure (that is, it already uses external libraries by incorporating their source) then I suppose you might as well continue the trend.

I usually place scripts I copy off the internet in a folder/package called borrowed so I know all of the code here is stuff that I didn't write myself.
That is, if it's something more substantial than a one or two-liner demonstrating how something works.

Pythonic way to ID a mystery file, then call a filetype-specific parser for it? Class creation q's

(note) I would appreciate help generalizing the title. I am sure that this is a class of problems in OO land, and probably has a reasonable pattern, I just don't know a better way to describe it.
I'm considering the following -- Our server script will be called by an outside program, and have a bunch of text dumped at it, (usually XML).
There are multiple possible types of data we could be getting, and multiple versions of the data representation we could be getting, e.g. "Report Type A, version 1.2" vs. "Report Type A, version 2.0"
We will generally want to do the same thing action with all the data -- namely, determine what sort and version it is, then parse it with a custom parser, then call a synchronize-to-database function on it.
We will definitely be adding types and versions as time goes on.
So, what's a good design pattern here? I can come up with two, both seem like they may have som problems.
Option 1
Write a monolithic ID script which determines the type, and then
imports and calls the properly named class functions.
Benefits
Probably pretty easy to debug,
Only one file that does the parsing.
Downsides
Seems hack-ish.
It would be nice to not have to create
knowledge of dataformats in two places, once for ID, once for actual
merging.
Option 2
Write an "ID" function for each class; returns Yes / No / Maybe when given identifying text.
the ID script now imports a bunch of classes, instantiates them on the text and asks if the text and class type match.
Upsides:
Cleaner in that everything lives in one module?
Downsides:
Slower? Depends on logic of running through the classes.
Put abstractly, should Python instantiate a bunch of Classes, and consume an ID function, or should Python instantiate one (or many) ID classes which have a paired item class, or some other way?

You could use the Strategy pattern which would allow you to separate the logic for the different formats which need to be parsed into concrete strategies. Your code would typically parse a portion of the file in the interface and then decide on a concrete strategy.
As far as defining the grammar for your files I would find a fast way to identify the file without implementing the full definition, perhaps a header or other unique feature at the beginning of the document. Then once you know how to handle the file you can pick the best concrete strategy for that file handling the parsing and writes to the database.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.