Unit testing functions that access files - python

I have two functions—one that builds the path to a set of files and another that reads the files. Below are the two functions:
def pass_file_name(self):
self.log_files= []
file_name = self.path+"\\access_"+self.appliacation+".log"
if os.path.isfile(file_name):
self.log_files.append(file_name)
for i in xrange(7):
file_name = self.path+"\\access_"+self.appliacation+".log"+"."+str(i+1)
if os.path.isfile(file_name):
self.log_files.append(file_name)
return self.log_files
def read_log_files (self, log_file_names):
self.log_entrys = []
self.log_line = []
for i in log_file_names:
self.f = open(i)
for line in self.f:
self.log_line = line.split(" ")
#print self.log_line
self.log_entrys.append(self.log_line)
return self.log_entrys
What would be the best way to unit test these two functions?

You have two units here:
One that generate file paths
Second that reads them
Thus there should be two unit-test-cases (i.e. classes with tests). First would test only file paths generation. Second would test reading from predefined set of files you prepared in special subdirectory of tests directory, it should test in isolation from first test case.
In your case, you could probably have very short log files for tests. In this case for better readability and maintenance it is good idea to embed them right in test code. But in this case you'll have to improve your reading function a bit so it can take either file name or file-like object:
from cStringIO import StringIO
# ...
def test_some_log_reading_scenario(self):
log1 = '\n'.join([
'log line',
'another log line'
])
log2 = '\n'.join([
'another log another line',
'lala blah blah'
])
# ...
result = myobj.read_log_files([StringIO(log1), StringIO(log2)])
# assert result

Personally, I'd build a test harness that set up the required files before testing those two functions.
For each test case (where you expect the file to be present - remember to test failure cases too!), write some known logs into the appropriately named files; then call the functions under test and check the results.

I'm no expert but I'll give it a go. First a bit of refactoring: make them functional (remove all class stuff), remove unneeded things. This should make it much easier to test. You can always make the class call these functions if you really want it in a class.
def pass_file_name(base_filename, exists):
"""return a list of filenames that exist
based upon `base_filename`.
use `os.path.isfile` for `exists`"""
log_files = []
if exists(base_filename):
log_files.append(base_filename)
for i in range(1, 8):
filename = base_filename + "." + str(i)
if exists(filename):
log_files.append(filename)
return log_files
def read_log_files (self, log_files):
"""read and parse each line from log_files
use `pass_file_name` for `log_files`"""
log_entrys = []
for filename in log_files:
with open(filename) as myfile:
for line in myfile:
log_entrys.append(line.split())
return log_entrys
Now we can easily test pass_file_name by passing in a custom function to exists.
class Test_pass_file_name(unittest.TestCase):
def test_1(self):
"""assume every file exists
make sure all logs file are there"""
exists = lambda _: True
log_files = pass_file_name("a", exists)
self.assertEqual(log_files,
["a", "a.1", "a.2", "a.3",
"a.4", "a.5", "a.6", "a.7"])
def test_2(self):
"""assume no files exists
make sure nothing returned"""
exists = lambda _: False
log_files = pass_file_name("a", exists)
self.assertEqual(log_files, [])
# ...more tests here ...
As we assume os.path.isfile works we should have got pretty good testing of the first function. Though you could always have the test actually create some files then call pass_file_name with exists = os.path.isfile.
The second one is harder to test; I have been told that the best (unit)tests don't touch the network, databases, GUI or the hard-drive. So maybe some more refactoring would make it easier. Mocking open could work; or would could actually write some long file in the test function and read them in.
How do I mock an open used in a with statement (using the Mock framework in Python)?

Bind the open name in the module to a function that mocks the file opening.

Related

How to test a function with input() call (unittest)

I want to test the following module. There are hundreds of similar modules, and the test data for each module is stored on input_suffix.txt and output_suffix.txt (suffix is the name of the module).
def func(data):
# do something
return something
def reader():
for case_no in range(int(input())):
n_rows, n_cols = map(int, input().strip('\n').split())
data = [[int(n) for n in input().strip('\n').split()] for _ in range(n_rows)]
print(f'Case #{case_no + 1}: {func(data)}')
if __name__ == '__main__':
reader()
I could simply read both input_suffix.txt and output_suffix.txt files for the module to be tested, and process the input data to feed data right into the func() and get the job done by comparing the returned value to the expected output, which also could be processed from output data.
However, there are hundreds of similar modules, and they don't always share the same structure. The first line of input_suffix.txt always indicate the number of included tests, but after that, it varies all over. Sometimes, each test starts with a line that contains a number of rows(num_rows) and columns(num_cols) and the following num_rows lines is a representation of an array (just like the above). But input_suffix.txt may not have num_rows or an array representation at all, or may have multiple arrays, or other types of data. Likewise, arguments for the func() varies from module to module as well.
Because of the variations, I need to modify hundreds of test drivers to feed correctly pre-processed data to func() equivalents of each module. To avoid the tiresome task, these are what I need;
I need a way to pass the data (taken from input_suffix.txt) to the input() of each reader() on a line by line basis (each reader() of each module already has an adequate data processing scheme, so if I just pass the content of input_suffix.txt line by line, it will work).
I need a way to capture the printed data from reader() to compare it to the expected output taken from output_suffix.txt.
I have input_suffix.txt and output_suffix.txt on i and o, but I didn't know how to parse the printed value to compare it to o. That was the point where I was stuck, but after some trouble, I managed to get a working code. But I assume this is not the standard way for the job.
from contextlib import redirect_stdout
from io import StringIO
from unittest import TestCase
from unittest.mock import MagicMock, patch
from suffix import *
class Tester(TestCase):
def test(self):
with open('input_suffix.txt', 'r') as f:
i = [line.strip('\n') for line in f.readlines()]
with open('output_suffix.txt', 'r') as f:
o = [line.strip('\n') for line in f.readlines()]
m = MagicMock()
f = StringIO()
length = len(i)
with patch('builtins.input', side_effect=i):
reader()
for _ in range(length):
m()
result = f.getvalue().split('\n')[:-1]
for case_no, (actual, expected) in enumerate(zip(result, to)):
self.assertEqual(actual, expected)
print(f'Test {case_no + 1} Passed')
Could you suggest the proper way of doing this? Any help would be appreciated. Thanks.
You could mock the input using the patch that is available from the unittest library. The simple code to mock the input from the user is shown below.
Actual sample code: placed in path utils/sample.py
def take_input():
response = input("please type a:")
if response == 'a':
return "Correct"
return "Not Correct"
Actual test code:
from utils.sample import take_input
import unittest
from unittest.mock import patch
class TestCache(unittest.TestCase):
#patch("utils.sample.input", return_value='a')
def test_take_input(self, mock_input):
ans = take_input()
self.assertEqual(ans, 'Correct')
Take a note on how the patch is written and the input function is being called within the patch along with the specified return_value.

Instantiating a class gives dubious results the 2nd time around while looping

Edit:
Firstly, thank you #martineau and #jonrsharpe for your prompt reply.
I was initially hesitant to write a verbose description, but I now realize that I am sacrificing clarity for brevity. (thanks #jonrsharpe for the link).
So here's my attempt to describe what I am upto as succinctly as possible:
I have implemented the Lempel-Ziv-Welch text file compression algorithm in form of a python package. Here's the link to the repository.
Basically, I have a compress class in the lzw.Compress module, which takes in as input the file name(and a bunch of other jargon parameters) and generates the compressed file which is then decompressed by the decompress class within the lzw.Decompress module generating the original file.
Now what I want to do is to compress and decompress a bunch of files of various sizes stored in a directory and save and visualize graphically the time taken for compression/decompression along with the compression ratio and other metrics. For this, I am iterating over the list of the file names and passing them as parameters to instantiate the compress class and begin compression by calling the encode() method on it as follows:
import os
os.chdir('/path/to/files/to/be/compressed/')
results = dict()
results['compress_time'] = []
results['other_metrics'] = []
file_path = '/path/to/files/to/be/compressed/'
comp_path = '/path/to/store/compressed/files/'
decomp_path = '/path/to/store/decompressed/file'
files = [_ for _ in os.listdir()]
for f in files:
from lzw.Compress import compress as comp
from lzw.Decompress import decompress as decomp
c = comp(file_path+f,comp_path) #passing the input file and the output path for storing compressed file.
c.encode()
#Then measure time required for comression using time.monotonic()
del c
del comp
d = decomp('/path/to/compressed/file',decomp_path) #Decompressing
d.decode()
#Then measure time required for decompression using
#time.monotonic()
#append metrics to lists in the results dict for this particular
#file
if decompressed_file_size != original_file_size:
print("error")
break
del d
del decomp
I have run this code independently for each file without the for loop and have achieved compression and decompression successfully. So there are no problems in the files I wish to compress.
What happens is that whenever I run this loop, the first file (the first iteration) runs successfully and the on the next iteration, after the entire process happens for the 2nd file, "error" is printed and the loop exits. I have tried reordering the list or even reversing it(maybe a particular file is having a problem), but to no avail.
For the second file/iteration, the decompressed file contents are dubious(not matching the original file). Typically, the decompressed file size is nearly double that of the original.
I strongly suspect that there is something to do with the variables of the class/package retaining their state somehow among different iterations of the loop. (To counter this I am deleting both the instance and the class at the end of the loop as shown in the above snippet, but no success.)
I have also tried to import the classes outside the loop, but no success.
P.S.: I am a python newbie and don't have much of an expertise, so forgive me for not being "pythonic" in my exposition and raising a rather naive issue.
Update:
Thanks to #martineau, one of the problem was regarding the importing of global variables from another submodule.
But there was another issue which crept in owing to my superficial knowledge about the 'del' operator in python3.
I have this trie data structure in my program which is basically just similar to a binary tree.
I had a self_destruct method to delete the tree as follows:
class trie():
def __init__(self):
self.next = {}
self.value = None
self.addr = None
def insert(self, word=str(),addr=int()):
node = self
for index,letter in enumerate(word):
if letter in node.next.keys():
node = node.next[letter]
else:
node.next[letter] = trie()
node = node.next[letter]
if index == len(word) - 1:
node.value = word
node.addr = addr
def self_destruct(self):
node = self
if node.next == {}:
return
for i in node.next.keys():
node.next[i].self_destruct()
del node
Turns out that this C-like recursive deletion of objects makes no sense in python as here simply its association in the namespace is removed while the real work is done by the garbage collector.
Still, its kinda weird why python is retaining the state/association of variables even on creating a new object(as shown in my loop snippet in the edit).
So 2 things solved the problem. Firstly, I removed the global variables and made them local to the module where I need them(so no need to import). Also, I deleted the self_destruct method of the trie and simple did: del root where root = trie() after use.
Thanks #martineau & #jonrsharpe.

Access variables and lists from function

I am new to unit testing with Python. I would like to test some functions in my code. In particular I need to test if the outputs have specific dimensions or the same dimensions.
My Python script for unit testing looks like this:
import unittest
from func import *
class myTests(unittest.TestCase):
def setUp(self):
# I am not really sure whats the purpose of this function
def test_main(self):
# check if outputs of the function "main" are not empty:
self.assertTrue(main, msg = 'The main() function provides no return values!')
# check if "run['l_modeloutputs']" and "run['l_modeloutputs']", within the main() function have the same size:
self.assertCountEqual(self, run['l_modeloutputs'], run['l_dataoutputs'], msg=None)
# --> Doesn't work so far!
# check if the dimensions of "props['k_iso']", within the main() function are (80,40,100):
def tearDown(self):
# I am also not sure of the purpose of this function
if _name__ == "__main__":
unittest.main()
Here is the code under test:
def main(param_file):
# Load parameter file
run, model, sequences, hydraulics, flowtrans, elements, mg = hu.model_setup(param_file)
# some other code
...
if 'l_modeloutputs' in run:
if hydraulics['flag_gen'] is False:
print('No hydraulic parameters generated. No model outputs saved')
else:
save_models(realdir, realname, mg, run['l_modeloutputs'], flowtrans, props['k_iso'], props['ktensors'])
I need to access the parameters run['l_modeloutputs'] and run['l_dataoutputs'] of the main function from func.py. How can I pass the dimensions of these parameters to the unit testing script?
It sounds a bit like one of two things at the moment. Either your code isn't laid out at the moment in a way that is easy to test, or maybe you are trying to test or call too much code in one go.
If your code is laid out like the following:
main(file_name):
with open(file_name) as file:
... do work ...
results = outcome_of_work
and you are trying to test what you have got from the file_name as well as the size of results, then you may want to think of refactoring this so that you can test a smaller action. Maybe:
main(file_name):
# `get_file_contents` appears to be `hu.model_setup`
# `file_contents` would be `run`
file_contents = get_file_contents(file_name)
results = do_work_on_file_contents(file_contents)
Of course, if you already have a similar setup then the following is also applicable. This you can do easier tests, as you have easy control to both what's going into test (file_name or file_contents) and can then test the outcome (file_contents or results) for expected results.
With the unittest module you would basically be creating a small function for each test:
class Test(TestCase):
def test_get_file_contents(self):
# ... set up example `file-like object` ...
run = hu.model_setup(file_name)
self.assertCountEqual(
run['l_modeloutputs'], run['l_dataoutputs'])
... repeat for other possible files ...
def test_do_work_on_file_contents(self):
example_input = ... setup input ...
example_output = do_work_on_file_contents(example_input)
assert example_output == as_expected
This can then be repeated for different sets of potential inputs, both good and edge cases.
Its probably worth looking about for a more in-depth tutorial as this is obviously only a very quick look over.
And setUp and tearDown are only needed if there is something to be done for each test you have written (i.e. you have set up an object in a particular way, for several tests, this can be done in setUp and its run before each test function.

Pyparsing operating on context

How can I build a pyparsing program that allows operations being executed on a context/state object?
An example of my program looks like this:
load 'data.txt'
remove line 1
remove line 4
The first line should load a file and line 2 and 3 are commands that operate on the content of the file. As a result, I expect the content of the file after all commands have been executed.
load_cmd = Literal('load') + filename
remove_cmd = Literal('remove line') + line_no
more_cmd = ...
def load_action(s, loc, toks):
# load file, where should I store it?
load_cmd.setParseAction(load_action)
def remove_line_action(s, loc, toks):
# remove line, how to obtain data to operate on? where to write result?
remove_line_cmd.setParseAction(remove_cmd)
# Is this the right way to define a whole program, i.e. not only one line?
program = load_cmd + remove_cmd | more_cmd |...
# How do I obtain the result?
program.scanString("""
load 'data.txt'
remove line 1
remove line 4
""")
I have written a few pyparsing examples of this command-parsing style, you can find them online at:
http://pyparsing.wikispaces.com/file/view/simpleBool.py/451074414/simpleBool.py
http://pyparsing.wikispaces.com/file/view/eval_arith.py/68273277/eval_arith.py
I have also written a simple Adventure-style game processor, which accepts parsed command structures and executes them against a game "world", which functions as the command executor. I presented this at PyCon 2006, but the link from the conference page has gone stale - you can find it now at http://www.ptmcg.com/geo/python/confs/pyCon2006_pres2.html (the presentation is written using S5 - mouse over the lower right corner to see navigation buttons). The code is at http://www.ptmcg.com/geo/python/confs/adventureEngine.py.txt, and UML diagram for the code is at http://www.ptmcg.com/geo/python/confs/pyparsing_adventure.pdf.
The general pattern I have found to work best is similar to the old Model-View-Controller pattern.
The Model is your virtual machine, which maintains the context from command to command. In simple_bool the context is just the inferred local variable scope, since each parsed statement is just evaled. In eval_arith, this context is kept in the EvalConstant._vars dict, containing the names and values of pre-defined and parsed variables. In the Adventure engine, the context is kept in the Player object (containing attributes that point to the current Room and the collection of Items), which is passed to the parsed command object to execute the command.
The View is the parser itself. It extracts the pieces of each command and composes an instance of a command class. The interface to the command class's exec method depends on how you have set up the Model. But in general you can envision that the exec method you define will take the Model as one of, if not its only, parameter.
Then the Controller is a simple loop that implements the following pseudo-code:
while not finished
read a command, assign to commandstring
parse commandstring, use parsed results to create commandobj (null if bad parse)
if commandobj is not null:
commandobj.exec(context)
finished = context.is_finished()
If you implement your parser using pyparsing, then you can define your Command classes as subclasses of this abstract class:
class Command(object):
def __init__(self, s, l, t):
self.parameters = t
def exec(self, context):
self._do_exec(context)
When you define each command, the corresponding subclass can be passed directly as the command expression's parse action. For instance, a simplified GO command for moving through a maze would look like:
goExpression = Literal("GO") + oneOf("NORTH SOUTH EAST WEST")("direction")
goExpression.setParseAction(GoCommand)
For the abstract Command class above, a GoCommand class might look like:
class GoCommand(Command):
def _do_exec(self, context):
if context.is_valid_move(self.parameters.direction):
context.move(self.parameters.direction)
else:
context.report("Sorry, you can't go " +
self.parameters.direction +
" from here.")
By parsing a statement like "GO NORTH", you would get back not a ParseResults containing the tokens "GO" and "NORTH", but a GoCommand instance, whose parameters include a named token "direction", giving the direction parameter for the GO command.
So the design steps to do this are:
design your virtual machine, and its command interface
create a class to capture the state/context in the virtual machine
design your commands, and their corresponding Command subclasses
create the pyparsing parser expressions for each command
attach the Command subclass as a parse action to each command's pyparsing expression
create an overall parser by combining all the command expressions using '|'
implement the command processor loop
I would do something like this:
cmdStrs = '''
load
remove line
add line
some other command
'''
def loadParse(val): print 'Load --> ' + val
def removeParse(val): print 'remove --> ' + val
def addLineParse(val): print 'addLine --> ' + val
def someOtherCommandParse(val): print 'someOther --> ' + val
commands = [ l.strip() for l in cmdStrs.split('\n') if l.strip() !='' ]
functions = [loadParse,
removeParse,
addLineParse,
someOtherCommandParse]
funcDict = dict( zip(commands, functions) )
program = '''
# This is a comment
load 'data.txt' # This is another comment
remove line 1
remove line 4
'''
for l in program.split('\n'):
l = l.strip().split('#')[0].strip() # remove comments
if l == '': continue
commandFound = False
for c in commands:
if c in l:
funcDict[c](l.split(c)[-1])
commandFound = True
if not commandFound:
print 'Error: Unknown command : ', l
Of course, you can put the entire thing within a class and make it an object, but you see the general structure. If you have an object, then you can go ahead and create a version which can handle contextual/state information. Then, the functions above will simply be member functions.
Why do I get a sense that you are starting on Python after learning Haskell? Generally people go the other way. In Python you get state for free. You don't need Classes. You can use classes to handle more than one state within the same program :).

How to get AppEngine map reduce to scale out?

I have written a simple MapReduce flow to read in lines from a CSV from a file on Google Cloud Storage and subsequently make an Entity. However, I can't seem to get it to run on more than one shard.
The code makes use of mapreduce.control.start_map and looks something like this.
class LoadEntitiesPipeline(webapp2.RequestHandler):
id = control.start_map(map_name,
handler_spec="backend.line_processor",
reader_spec="mapreduce.input_readers.FileInputReader",
queue_name=get_queue_name("q-1"),
shard_count=shard_count,
mapper_parameters={
'shard_count': shard_count,
'batch_size': 50,
'processing_rate': 1000000,
'files': [gsfile],
'format': 'lines'})
I have shard_count in both places, because I'm not sure what methods actually need it. Setting shard_count anywhere from 8 to 32, doesn't change anything as the status page always says 1/1 shards running. To separate things, I've made everything run on a backend queue with a large number of instances. I've tried adjusting the queue parameters per this wiki. In the end, it seems to just run serially.
Any ideas? Thanks!
Update (Still no success):
In trying to isolate things, I tried making the call using direct calls to pipeline like so:
class ImportHandler(webapp2.RequestHandler):
def get(self, gsfile):
pipeline = LoadEntitiesPipeline2(gsfile)
pipeline.start(queue_name=get_queue_name("q-1"))
self.redirect(pipeline.base_path + "/status?root=" + pipeline.pipeline_id)
class LoadEntitiesPipeline2(base_handler.PipelineBase):
def run(self, gsfile):
yield mapreduce_pipeline.MapperPipeline(
'loadentities2_' + gsfile,
'backend.line_processor',
'mapreduce.input_readers.FileInputReader',
params={'files': [gsfile], 'format': 'lines'},
shards=32
)
With this new code, it still only runs on one shard. I'm starting to wonder if mapreduce.input_readers.FileInputReader is capable of parallelizing input by line.
It looks like FileInputReader can only shard via files. The format params only change the way mapper function got call. If you pass more than one files to the mapper, it will start to run on more than one shard. Otherwise it will only use one shard to process the data.
EDIT #1:
After dig deeper in the mapreduce library. MapReduce will decide whether or not to split file into pieces based on the can_split method return for each file type it defined. Currently, the only format which implement split method is ZipFormat. So, if your file format is not zip, it won't split the file to run on more than one shard.
#classmethod
def can_split(cls):
"""Indicates whether this format support splitting within a file boundary.
Returns:
True if a FileFormat allows its inputs to be splitted into
different shards.
"""
https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/file_formats.py
But it looks like it is possible to write your own file format split method. You can try to hack and add split method on _TextFormat first and see if more than one shard running.
#classmethod
def split(cls, desired_size, start_index, opened_file, cache):
pass
EDIT #2:
An easy workaround would be left the FileInputReader run serially but move the time-cosuming task to parallel reduce stage.
def line_processor(line):
# serial
yield (random.randrange(1000), line)
def reducer(key, values):
# parallel
entities = []
for v in values:
entities.append(CREATE_ENTITY_FROM_VALUE(v))
db.put(entities)
EDIT #3:
If try to modify the FileFormat, here is an example (haven't been test yet)
from file_formats import _TextFormat, FORMATS
class _LinesSplitFormat(_TextFormat):
"""Read file line by line."""
NAME = 'split_lines'
def get_next(self):
"""Inherited."""
index = self.get_index()
cache = self.get_cache()
offset = sum(cache['infolist'][:index])
self.get_current_file.seek(offset)
result = self.get_current_file().readline()
if not result:
raise EOFError()
if 'encoding' in self._kwargs:
result = result.encode(self._kwargs['encoding'])
return result
#classmethod
def can_split(cls):
"""Inherited."""
return True
#classmethod
def split(cls, desired_size, start_index, opened_file, cache):
"""Inherited."""
if 'infolist' in cache:
infolist = cache['infolist']
else:
infolist = []
for i in opened_file:
infolist.append(len(i))
cache['infolist'] = infolist
index = start_index
while desired_size > 0 and index < len(infolist):
desired_size -= infolist[index]
index += 1
return desired_size, index
FORMATS['split_lines'] = _LinesSplitFormat
Then the new file format can be called via change the mapper_parameters from lines to split_line.
class LoadEntitiesPipeline(webapp2.RequestHandler):
id = control.start_map(map_name,
handler_spec="backend.line_processor",
reader_spec="mapreduce.input_readers.FileInputReader",
queue_name=get_queue_name("q-1"),
shard_count=shard_count,
mapper_parameters={
'shard_count': shard_count,
'batch_size': 50,
'processing_rate': 1000000,
'files': [gsfile],
'format': 'split_lines'})
It looks to me like FileInputReader should be capable of sharding based on a quick reading of:
https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/input_readers.py
It looks like 'format': 'lines' should split using: self.get_current_file().readline()
Does it seem to be interpreting the lines correctly when it is working serially? Maybe the line breaks are the wrong encoding or something.
From experience FileInputReader will do a max of one shard per file.
Solution: Split your big files. I use split_file in https://github.com/johnwlockwood/karl_data to shard files before uploading them to Cloud Storage.
If the big files are already up there, you can use a Compute Engine instance to pull them down and do the sharding because the transfer speed will be fastest.
FYI: karld is in the cheeseshop so you can pip install karld

Categories