I have written a small web application, and with each request I should open and read a JSON file. I am using pickledb for this purpose.
What concerns me about it, is that the library passes open() as a parameter for the json.load() function . So it got me thinking ..
When I write code like this:
with open("filename.json", "rb") as json_data:
my_data = json.load(json_data)
or
json_data = open("filename.json", "rb")
my_data = json.load(json_data)
json_data.close()
I am pretty sure that the file handle is being closed.
But when I open it this way :
my_data = json.load(open("filename.json", "rb"))
The docs say that json.load() is expecting a .read()-supporting file-like object containing a JSON document.
So the question is, will this handle stay open and eat more memory with each request? Who is responsible for closing the handle in that case?
Close method of the file will be called when object is destroyed, as json.load expects only read method on input object.
What happens depends on garbage collection implementation then. You can read more in Is explicitly closing files important?
Generally speaking it's a good practice to take care of closing the file.
I tried to somehow fake file-like object with read() and close() methods, and stick it into json.load(). Then I observed, that close() is not being called upon leaving context. Hence, I would recommend to close the file object explicitly. Anyway, doc says that the loading method expects read() method, but does not say it expects close() method on the object.
In test.json:
{ "test":0 }
In test.py:
import json
class myf:
def __init__(self):
self.f = None
#staticmethod
def open(path, mode):
obj = myf()
obj.f = open(path, mode)
return obj
def read(self):
print ("READING")
return self.f.read()
def close(self):
print ("CLOSING")
return self.f.close()
def mytest():
s = json.load(myf.open("test.json","r"))
print (s)
mytest()
print("DONE")
Output:
$> python test.py
READING
{u'test': 0}
DONE
$>
Related
I'm currently using a library that loads a file using open(filename).
I don't want to mess with the file system, so I tried to download this file in memory using BytesIO:
obj = BytesIO(requests(url).content)
But, if I pass the obj to the library, I'll get an error.
How can I transform my object so it could be "opened" by open(object)?
You can override the built-in open function to return the first argument directly if the argument is a file-like object (which can be identified if it has a read attribute):
import builtins
original_open = open
builtins.open = lambda f, *args, **kwargs: f if hasattr(f, 'read') else original_open(f, *args, **kwargs)
so that:
from io import BytesIO
print(open(BytesIO(b'hello world'), 'rb').read())
outputs:
b'hello world'
You can't unless you want to save it as a file because the open() method can only be used for files contained in the file system. Instead, you can check out the python docs on io streams (found here: https://docs.python.org/3/library/io.html) and learn how to access your data through io methods.
I am trying to unit test a method that reads the lines from a file and process it.
with open([file_name], 'r') as file_list:
for line in file_list:
# Do stuff
I tried several ways described on another questions but none of them seems to work for this case. I don't quite understand how python uses the file object as an iterable on the lines, it internally use file_list.readlines() ?
This way didn't work:
with mock.patch('[module_name].open') as mocked_open: # also tried with __builtin__ instead of module_name
mocked_open.return_value = 'line1\nline2'
I got an
AttributeError: __exit__
Maybe because the with statement have this special attribute to close the file?
This code makes file_list a MagicMock. How do I store data on this MagicMock to iterate over it ?
with mock.patch("__builtin__.open", mock.mock_open(read_data="data")) as mock_file:
Best regards
The return value of mock_open (until Python 3.7.1) doesn't provide a working __iter__ method, which may make it unsuitable for testing code that iterates over an open file object.
Instead, I recommend refactoring your code to take an already opened file-like object. That is, instead of
def some_method(file_name):
with open([file_name], 'r') as file_list:
for line in file_list:
# Do stuff
...
some_method(file_name)
write it as
def some_method(file_obj):
for line in file_obj:
# Do stuff
...
with open(file_name, 'r') as file_obj:
some_method(file_obj)
This turns a function that has to perform IO into a pure(r) function that simply iterates over any file-like object. To test it, you don't need to mock open or hit the file system in any way; just create a StringIO object to use as the argument:
def test_it(self):
f = StringIO.StringIO("line1\nline2\n")
some_method(f)
(If you still feel the need to write and test a wrapper like
def some_wrapper(file_name):
with open(file_name, 'r') as file_obj:
some_method(file_obj)
note that you don't need the mocked open to do anything in particular. You test some_method separately, so the only thing you need to do to test some_wrapper is verify that the return value of open is passed to some_method. open, in this case, can be a plain old mock with no special behavior.)
I'm trying to understand if there is there a difference between these, and what that difference might be.
Option One:
file_obj = open('test.txt', 'r')
with file_obj as in_file:
print in_file.readlines()
Option Two:
with open('test.txt', 'r') as in_file:
print in_file.readlines()
I understand that with Option One, the file_obj is in a closed state after the with block.
I don't know why no one has mentioned this yet, because it's fundamental to the way with works. As with many language features in Python, with behind the scenes calls special methods, which are already defined for built-in Python objects and can be overridden by user-defined classes. In with's particular case (and context managers more generally), the methods are __enter__ and __exit__.
Remember that in Python everything is an object -- even literals. This is why you can do things like 'hello'[0]. Thus, it does not matter whether you use the file object directly as returned by open:
with open('filename.txt') as infile:
for line in infile:
print(line)
or store it first with a different name (for example to break up a long line):
the_file = open('filename' + some_var + '.txt')
with the_file as infile:
for line in infile:
print(line)
Because the end result is that the_file, infile, and the return value of open all point to the same object, and that's what with is calling the __enter__ and __exit__ methods on. The built-in file object's __exit__ method is what closes the file.
These behave identically. As a general rule, the meaning of Python code is not changed by assigning an expression to a variable in the same scope.
This is the same reason that these are identical:
f = open("myfile.txt")
vs
filename = "myfile.txt"
f = open(filename)
Regardless of whether you add an alias, the meaning of the code stays the same. The context manager has a deeper meaning than passing an argument to a function, but the principle is the same: the context manager magic is applied to the same object, and the file gets closed in both cases.
The only reason to choose one over the other is if you feel it helps code clarity or style.
There is no difference between the two - either way the file is closed when you exit the with block.
The second example you give is the typical way the files are used in Python 2.6 and newer (when the with syntax was added).
You can verify that the first example also works in a REPL session like this:
>>> file_obj = open('test.txt', 'r')
>>> file_obj.closed
False
>>> with file_obj as in_file:
... print in_file.readlines()
<Output>
>>> file_obj.closed
True
So after the with blocks exits, the file is closed.
Normally the second example is how you would do this sort of thing, though.
There's no reason to create that extra variable file_obj... anything that you might want to do with it after the end of the with block you could just use in_file for, because it's still in scope.
>>> in_file
<closed file 'test.txt', mode 'r' at 0x03DC5020>
If you just fire up Python and use either of those options, the net effect is the same if the base instance of Python's file object is not changed. (In Option One, the file is only closed when file_obj goes out of scope vs at the end of the block in Option Two as you have already observed.)
There can be differences with use cases with a context manager however. Since file is an object, you can modify it or subclass it.
You can also open a file by just calling file(file_name) showing that file acts like other objects (but no one opens files that way in Python unless it is with with):
>>> f=open('a.txt')
>>> f
<open file 'a.txt', mode 'r' at 0x1064b5ae0>
>>> f.close()
>>> f=file('a.txt')
>>> f
<open file 'a.txt', mode 'r' at 0x1064b5b70>
>>> f.close()
More generally, the opening and closing of some resource called the_thing (commonly a file, but can be anything) you follow these steps:
set up the_thing # resource specific, open, or call the obj
try # generically __enter__
yield pieces from the_thing
except
react if the_thing is broken
finally, put the_thing away # generically __exit__
You can more easily change the flow of those subelements using the context manager vs procedural code woven between open and the other elements of the code.
Since Python 2.5, file objects have __enter__ and __exit__ methods:
>>> f=open('a.txt')
>>> f.__enter__
<built-in method __enter__ of file object at 0x10f836780>
>>> f.__exit__
<built-in method __exit__ of file object at 0x10f836780>
The default Python file object uses those methods in this fashion:
__init__(...) # performs initialization desired
__enter__() -> self # in the case of `file()` return an open file handle
__exit__(*excinfo) -> None. # in the case of `file()` closes the file.
These methods can be changed for your own use to modify how a resource is treated when it is opened or closed. A context manager makes it really easy to modify what happens when you open or close a file.
Trivial example:
class Myopen(object):
def __init__(self, fn, opening='', closing='', mode='r', buffering=-1):
# set up the_thing
if opening:
print(opening)
self.closing=closing
self.f=open(fn, mode, buffering)
def __enter__(self):
# set up the_thing
# could lock the resource here
return self.f
def __exit__(self, exc_type, exc_value, traceback):
# put the_thing away
# unlock, or whatever context applicable put away the_thing requires
self.f.close()
if self.closing:
print(self.closing)
Now try that:
>>> with Myopen('a.txt', opening='Hello', closing='Good Night') as f:
... print f.read()
...
Hello
[contents of the file 'a.txt']
Good Night
Once you have control of entry and exit to a resource, there are many use cases:
Lock a resource to access it and use it; unlock when you are done
Make a quirky resource (like a memory file, database or web page) act more like a straight file resource
Open a database and rollback if there is an exception but commit all writes if there are no errors
Temporarily change the context of a floating point calculation
Time a piece of code
Change the exceptions that you raise by returning True or False from the __exit__ method.
You can read more examples in PEP 343.
Is remarkable that with works even if return or sys.exit() is called inside (that means __exit__ is called anyway):
#!/usr/bin/env python
import sys
class MyClass:
def __enter__(self):
print("Enter")
return self
def __exit__(self, type, value, trace):
print("type: {} | value: {} | trace: {}".format(type,value,trace))
# main code:
def myfunc(msg):
with MyClass() as sample:
print(msg)
# also works if uncomment this:
# sys.exit(0)
return
myfunc("Hello")
return version will show:
Enter
Hello
type: None | value: None | trace: None
exit(0) version will show:
Enter
Hello
type: <class 'SystemExit'> | value: 0 | trace: <traceback object at 0x7faca83a7e00>
I use AWS boto3 library which returns me an instance of urllib3.response.HTTPResponse. That response is a subclass of io.IOBase and hence behaves as a binary file. Its read() method returns bytes instances.
Now, I need to decode csv data from a file received in such a way. I want my code to work on both py2 and py3 with minimal code overhead, so I use backports.csv which relies on io.IOBase objects as input rather than on py2's file() objects.
The first problem is that HTTPResponse yields bytes data for CSV file, and I have csv.reader which expects str data.
>>> import io
>>> from backports import csv # actually try..catch statement here
>>> from mymodule import get_file
>>> f = get_file() # returns instance of urllib3.HTTPResponse
>>> r = csv.reader(f)
>>> list(r)
Error: iterator should return strings, not bytes (did you open the file in text mode?)
I tried to wrap HTTPResponse with io.TextIOWrapper and got error 'HTTPResponse' object has no attribute 'read1'. This is expected becuase TextIOWrapper is intended to be used with BufferedIOBase objects, not IOBase objects. And it only happens on python2's implementation of TextIOWrapper because it always expects underlying object to have read1 (source), while python3's implementation checks for read1 existence and falls back to read gracefully (source).
>>> f = get_file()
>>> tw = io.TextIOWrapper(f)
>>> list(csv.reader(tw))
AttributeError: 'HTTPResponse' object has no attribute 'read1'
Then I tried to wrap HTTPResponse with io.BufferedReader and then with io.TextIOWrapper. And I got the following error:
>>> f = get_file()
>>> br = io.BufferedReader(f)
>>> tw = io.TextIOWrapper(br)
>>> list(csv.reader(f))
ValueError: I/O operation on closed file.
After some investigation it turns out that the error only happens when the file doesn't end with \n. If it does end with \n then the problem does not happen and everything works fine.
There is some additional logic for closing underlying object in HTTPResponse (source) which is seemingly causing the problem.
The question is: how can I write my code to
work on both python2 and python3, preferably with no try..catch or version-dependent branching;
properly handle CSV files represented as HTTPResponse regardless of whether they end with \n or not?
One possible solution would be to make a custom wrapper around TextIOWrapper which would make read() return b'' when the object is closed instead of raising ValueError. But is there any better solution, without such hacks?
Looks like this is an interface mismatch between urllib3.HTTPResponse and file objects. It is described in this urllib3 issue #1305.
For now there is no fix, hence I used the following wrapper code which seemingly works fine:
class ResponseWrapper(io.IOBase):
"""
This is the wrapper around urllib3.HTTPResponse
to work-around an issue shazow/urllib3#1305.
Here we decouple HTTPResponse's "closed" status from ours.
"""
# FIXME drop this wrapper after shazow/urllib3#1305 is fixed
def __init__(self, resp):
self._resp = resp
def close(self):
self._resp.close()
super(ResponseWrapper, self).close()
def readable(self):
return True
def read(self, amt=None):
if self._resp.closed:
return b''
return self._resp.read(amt)
def readinto(self, b):
val = self.read(len(b))
if not val:
return 0
b[:len(val)] = val
return len(val)
And use it as follows:
>>> f = get_file()
>>> r = csv.reader(ResponseWrapper(io.TextIOWrapper(io.BufferedReader(f))))
>>> list(r)
The similar fix was proposed by urllib3 maintainers in the bug report comments but it would be a breaking change hence for now things will probably not change, so I have to use wrapper (or do some monkey patching which is probably worse).
How to implement a ByteCountingStreamReader?
The ByteCountingStreamReader should wrap a file descriptor stream and count the bytes it passed.
A bit like codecs.StreamReader, but the content should not be changed, just counted.
Use case: Solve http://bugs.python.org/issue24259
The tarfile library does not compare the file size of the TarInfo with the actual bytes read from the tar.
Like this Java class, but for Python: http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/CountingInputStream.html
Here is a small wrapper function that replaces the read method of the (file) stream. It should also work for other types of streams, and a similar wrapper for the write function could be added.
Beware: readline() seems not to use read() internally, so it has to be wrapped, too, if you use it instead of plain vanilla read().
def ByteCountingStreamReader(stream):
fr = stream.read
stream.count = 0
def inner(size=-1):
s = fr(size)
stream.count += len(s)
return s
stream.read=inner
return stream
# testing it
myStream = open('/etc/hosts', 'r')
with ByteCountingStreamReader(myStream) as f:
while True:
s = f.read(20)
if s == '':
break
print (s, end='')
print (f.count)