How to implement a ByteCountingStreamReader? - python

How to implement a ByteCountingStreamReader?
The ByteCountingStreamReader should wrap a file descriptor stream and count the bytes it passed.
A bit like codecs.StreamReader, but the content should not be changed, just counted.
Use case: Solve http://bugs.python.org/issue24259
The tarfile library does not compare the file size of the TarInfo with the actual bytes read from the tar.
Like this Java class, but for Python: http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/CountingInputStream.html

Here is a small wrapper function that replaces the read method of the (file) stream. It should also work for other types of streams, and a similar wrapper for the write function could be added.
Beware: readline() seems not to use read() internally, so it has to be wrapped, too, if you use it instead of plain vanilla read().
def ByteCountingStreamReader(stream):
fr = stream.read
stream.count = 0
def inner(size=-1):
s = fr(size)
stream.count += len(s)
return s
stream.read=inner
return stream
# testing it
myStream = open('/etc/hosts', 'r')
with ByteCountingStreamReader(myStream) as f:
while True:
s = f.read(20)
if s == '':
break
print (s, end='')
print (f.count)

Related

Mocking "with open()"

I am trying to unit test a method that reads the lines from a file and process it.
with open([file_name], 'r') as file_list:
for line in file_list:
# Do stuff
I tried several ways described on another questions but none of them seems to work for this case. I don't quite understand how python uses the file object as an iterable on the lines, it internally use file_list.readlines() ?
This way didn't work:
with mock.patch('[module_name].open') as mocked_open: # also tried with __builtin__ instead of module_name
mocked_open.return_value = 'line1\nline2'
I got an
AttributeError: __exit__
Maybe because the with statement have this special attribute to close the file?
This code makes file_list a MagicMock. How do I store data on this MagicMock to iterate over it ?
with mock.patch("__builtin__.open", mock.mock_open(read_data="data")) as mock_file:
Best regards
The return value of mock_open (until Python 3.7.1) doesn't provide a working __iter__ method, which may make it unsuitable for testing code that iterates over an open file object.
Instead, I recommend refactoring your code to take an already opened file-like object. That is, instead of
def some_method(file_name):
with open([file_name], 'r') as file_list:
for line in file_list:
# Do stuff
...
some_method(file_name)
write it as
def some_method(file_obj):
for line in file_obj:
# Do stuff
...
with open(file_name, 'r') as file_obj:
some_method(file_obj)
This turns a function that has to perform IO into a pure(r) function that simply iterates over any file-like object. To test it, you don't need to mock open or hit the file system in any way; just create a StringIO object to use as the argument:
def test_it(self):
f = StringIO.StringIO("line1\nline2\n")
some_method(f)
(If you still feel the need to write and test a wrapper like
def some_wrapper(file_name):
with open(file_name, 'r') as file_obj:
some_method(file_obj)
note that you don't need the mocked open to do anything in particular. You test some_method separately, so the only thing you need to do to test some_wrapper is verify that the return value of open is passed to some_method. open, in this case, can be a plain old mock with no special behavior.)

Wrapping urllib3.HTTPResponse in io.TextIOWrapper

I use AWS boto3 library which returns me an instance of urllib3.response.HTTPResponse. That response is a subclass of io.IOBase and hence behaves as a binary file. Its read() method returns bytes instances.
Now, I need to decode csv data from a file received in such a way. I want my code to work on both py2 and py3 with minimal code overhead, so I use backports.csv which relies on io.IOBase objects as input rather than on py2's file() objects.
The first problem is that HTTPResponse yields bytes data for CSV file, and I have csv.reader which expects str data.
>>> import io
>>> from backports import csv # actually try..catch statement here
>>> from mymodule import get_file
>>> f = get_file() # returns instance of urllib3.HTTPResponse
>>> r = csv.reader(f)
>>> list(r)
Error: iterator should return strings, not bytes (did you open the file in text mode?)
I tried to wrap HTTPResponse with io.TextIOWrapper and got error 'HTTPResponse' object has no attribute 'read1'. This is expected becuase TextIOWrapper is intended to be used with BufferedIOBase objects, not IOBase objects. And it only happens on python2's implementation of TextIOWrapper because it always expects underlying object to have read1 (source), while python3's implementation checks for read1 existence and falls back to read gracefully (source).
>>> f = get_file()
>>> tw = io.TextIOWrapper(f)
>>> list(csv.reader(tw))
AttributeError: 'HTTPResponse' object has no attribute 'read1'
Then I tried to wrap HTTPResponse with io.BufferedReader and then with io.TextIOWrapper. And I got the following error:
>>> f = get_file()
>>> br = io.BufferedReader(f)
>>> tw = io.TextIOWrapper(br)
>>> list(csv.reader(f))
ValueError: I/O operation on closed file.
After some investigation it turns out that the error only happens when the file doesn't end with \n. If it does end with \n then the problem does not happen and everything works fine.
There is some additional logic for closing underlying object in HTTPResponse (source) which is seemingly causing the problem.
The question is: how can I write my code to
work on both python2 and python3, preferably with no try..catch or version-dependent branching;
properly handle CSV files represented as HTTPResponse regardless of whether they end with \n or not?
One possible solution would be to make a custom wrapper around TextIOWrapper which would make read() return b'' when the object is closed instead of raising ValueError. But is there any better solution, without such hacks?
Looks like this is an interface mismatch between urllib3.HTTPResponse and file objects. It is described in this urllib3 issue #1305.
For now there is no fix, hence I used the following wrapper code which seemingly works fine:
class ResponseWrapper(io.IOBase):
"""
This is the wrapper around urllib3.HTTPResponse
to work-around an issue shazow/urllib3#1305.
Here we decouple HTTPResponse's "closed" status from ours.
"""
# FIXME drop this wrapper after shazow/urllib3#1305 is fixed
def __init__(self, resp):
self._resp = resp
def close(self):
self._resp.close()
super(ResponseWrapper, self).close()
def readable(self):
return True
def read(self, amt=None):
if self._resp.closed:
return b''
return self._resp.read(amt)
def readinto(self, b):
val = self.read(len(b))
if not val:
return 0
b[:len(val)] = val
return len(val)
And use it as follows:
>>> f = get_file()
>>> r = csv.reader(ResponseWrapper(io.TextIOWrapper(io.BufferedReader(f))))
>>> list(r)
The similar fix was proposed by urllib3 maintainers in the bug report comments but it would be a breaking change hence for now things will probably not change, so I have to use wrapper (or do some monkey patching which is probably worse).

How to determine if file is opened in binary or text mode?

Given a file object, how do I determine whether it is opened in bytes mode (read returns bytes) or in text mode (read returns str)? It should work with reading and writing.
In other words:
>>> with open('filename', 'rb') as f:
... is_binary(f)
...
True
>>> with open('filename', 'r') as f:
... is_binary(f)
...
False
(Another question which sounds related is not. That question is about guessing whether a file is binary or not from it's contents.)
File objects have a .mode attribute:
def is_binary(f):
return 'b' in f.mode
This limits the test to files; in-memory file objects like TextIO and BytesIO do not have that attribute. You could also test for the appropriate abstract base classes:
import io
def is_binary(f):
return isinstance(f, (io.RawIOBase, io.BufferedIOBase))
or the inverse
def is_binary(f):
return not isinstance(f, io.TextIOBase)
For streams opened as reading, perhaps the most reliable way to determine its mode is to actually read from it:
def is_binary(f):
return isinstance(f.read(0), bytes)
Through it does have a caveat that it won't work if the stream was already closed (which may raise IOError) it would reliably determine binary-ness of any custom file-like objects neither extending from appropriate io ABCs nor providing the mode attribute.
If only Python 3 support is required, it is also possible to determine text/binary mode of writable streams given the clear distinction between bytes and text:
def is_binary(f):
read = getattr(f, 'read', None)
if read is not None:
try:
data = read(0)
except (TypeError, ValueError):
pass # ValueError is also a superclass of io.UnsupportedOperation
else:
return isinstance(data, bytes)
try:
# alternatively, replace with empty text literal
# and swap the following True and False.
f.write(b'')
except TypeError:
return False
return True
Unless you are to frequently test if a stream is in binary mode or not (which is unnecessary since binary-ness of a stream should not change for the lifetime of the object), I suspect any performance drawbacks resulting from extensive usage of catching exceptions would be an issue (you could certainly optimize for the likelier path, though).
There is one library called mimetypes where guess_type returns the The return value is a tuple (type, encoding) where type is None if the type can’t be guessed (missing or unknown suffix) or a string of the form 'type/subtype'
import mimetypes
file= mimetypes.guess_type(file)

Treat separate files as one file object in python

I have a splitted file (lets say name.a0, name.a1, name.a2, ...)
Is there a way to have one readable file-like object, that will be a concatenation of those, without using a temporary file and without loading them all to the memory?
The fileinput module in the python standard library is used for exactly this purpose.
import fileinput
with fileinput.input(files=('name.a0', 'name.a1', 'name.a2')) as f:
for line in f:
process(line)
You can always create a proxy object that treats a series of files as one. You need to implement just enough of the file object interface to satisfy your program's needs.
For example, if all you do is iterate over the lines in all these files, the following object would suffice for Python 2:
class MultiFile(object):
def __init__(self, *filenames, mode='r'):
self._filenames = reversed(filenames) # reversed iterable
self._mode = mode
sef._openfile = open(next(self._filenames), self._mode)
def __enter__(self):
return self
def __exit__(self, *exception_info):
self._openfile.close()
__del__ = __exit__
def __iter__(self):
return self
def __next__(self):
try:
return next(self._openfile)
except StopIteration:
# find next file to yield from, raises StopIteration
# when self._filenames has run out
while True:
self._opefile.close()
self._openfile = next(self._filenames)
try:
return next(self._openfile, self._mode)
except StopIteration:
continue
This lets you read through a series of files as if it was one, reading lines as you go (so never everything into memory):
import glob
for line in MultiFile(glob.glob('name.a?')):
# ...
Note that in Python 3 (or when using he io library in Python 2) you'll need to implement one of the appropriate base classes for the file mode (raw, buffered or text).

Will passing open() as json.load() parameter leave the file handle open?

I have written a small web application, and with each request I should open and read a JSON file. I am using pickledb for this purpose.
What concerns me about it, is that the library passes open() as a parameter for the json.load() function . So it got me thinking ..
When I write code like this:
with open("filename.json", "rb") as json_data:
my_data = json.load(json_data)
or
json_data = open("filename.json", "rb")
my_data = json.load(json_data)
json_data.close()
I am pretty sure that the file handle is being closed.
But when I open it this way :
my_data = json.load(open("filename.json", "rb"))
The docs say that json.load() is expecting a .read()-supporting file-like object containing a JSON document.
So the question is, will this handle stay open and eat more memory with each request? Who is responsible for closing the handle in that case?
Close method of the file will be called when object is destroyed, as json.load expects only read method on input object.
What happens depends on garbage collection implementation then. You can read more in Is explicitly closing files important?
Generally speaking it's a good practice to take care of closing the file.
I tried to somehow fake file-like object with read() and close() methods, and stick it into json.load(). Then I observed, that close() is not being called upon leaving context. Hence, I would recommend to close the file object explicitly. Anyway, doc says that the loading method expects read() method, but does not say it expects close() method on the object.
In test.json:
{ "test":0 }
In test.py:
import json
class myf:
def __init__(self):
self.f = None
#staticmethod
def open(path, mode):
obj = myf()
obj.f = open(path, mode)
return obj
def read(self):
print ("READING")
return self.f.read()
def close(self):
print ("CLOSING")
return self.f.close()
def mytest():
s = json.load(myf.open("test.json","r"))
print (s)
mytest()
print("DONE")
Output:
$> python test.py
READING
{u'test': 0}
DONE
$>

Categories