Why doesn't tempfile.SpooledTemporaryFile implement readable, writable, seekable? - python

In Python 3.6.1, I've tried wrapping a tempfile.SpooledTemporaryFile in an io.TextIOWrapper:
with tempfile.SpooledTemporaryFile() as tfh:
do_some_download(tfh)
tfh.seek(0)
wrapper = io.TextIOWrapper(tfh, encoding='utf-8')
yield from do_some_text_formatting(wrapper)
The line wrapper = io.TextIOWrapper(tfh, encoding='utf-8') gives me an error:
AttributeError: 'SpooledTemporaryFile' object has no attribute 'readable'
If I create a simple class like this, I can bypass the error (I get similar errors for writable and seekable):
class MySpooledTempfile(tempfile.SpooledTemporaryFile):
#property
def readable(self):
return self._file.readable
#property
def writable(self):
return self._file.writable
#property
def seekable(self):
return self._file.seekable
Is there a good reason why tempfile.SpooledTemporaryFile doesn't already have these attributes?

SpooledTemporaryFile actually uses 2 different _file implementations under the hood - initially an io Buffer (StringIO or BytesIO), until it rolls over and creates a "file-like object" via tempfile.TemporaryFile() (for example, when max_size is exceeded).
io.TextIOWrapper requires a BufferedIOBase base class/interface, which is provided by io.StringIO and io.BytesIO, but not necessarily by the object returned by TemporaryFile() (though in my testing on OSX, TemporaryFile() returned an _io.BufferedRandom object, which had the desired interface, my theory is this may depend on platform).
So, I would expect your MySpooledTempfile wrapper to possibly fail on some platforms after rollover.

This is fixed in python 3.11. Changelog for reference

Related

Why to use marshal to dump and pickle to load?

I'm trying to understand the python2 library remotely which helps to run code remotely through xmlrpc.
On the client, the author dump the object with marshal and load the result sent back from the server with pickle:
def run(self, func, *args, **kwds):
code_str = base64.b64encode(marshal.dumps(func.func_code))
output = self.proxy.run(self.api_key, self.a_sync, code_str, *args, **kwds)
return pickle.loads(base64.b64decode(output))
And on the server side, he's doing the other way around:
def run(self, api_key, a_sync, func_str, *args, **kwds):
#... truncated code
code = marshal.loads(base64.b64decode(func_str))
func = types.FunctionType(code, globals(), "remote_func")
#... truncated code
output = func(*args, **kwds)
output = base64.b64encode(pickle.dumps(output))
return output
What's the purpose of dumping with marshal and loading the result with pickle? (and vice-versa)
The object that is getting sent first using marshal is of a very specific type. It's a code object, and only that type needs to be supported. That is a type that the marshal module is designed to handle. The return value on the other hand, can be of any type, it's determined by what the func function returns. The pickle module has a much more general protocol that can serialize many different types of objects, so there's a decent chance it will support the return value.
While you could use pickle for both data items being passed between your programs, the marshal module's output is a little more compact and efficient for passing code objects, as pickle just wraps around it. If you try dumping the same code object with both marshal and pickle (using protocol zero, the default in Python 2), you'll see the output from marshal inside the output from pickle!
So to summarize, the marshal module is used to sent the code because only code objects need to be sent and it's a lower-level serialization, which can be a little more efficient some of the time. The return value is sent back using pickle because the program can't predict what type of object it will be, and pickle can serialize more complicated values than marshal can, at the cost of some additional complexity and (sometimes) the size of the serialization.

Can a Python class be written such that it may be passed to write()?

I'd like to pass an instance of my class to write() and have it written to a file. The underlying data is simply a bytearray.
mine = MyClass()
with open('Test.txt', 'wb') as f:
f.write(mine)
I tried implementing __bytes__ to no avail. I'm aware of the buffer protocol but I believe it can only be implemented via the C API (though I did see talk of delegation to an underlying object that implemented the protocol).
No, you can't, there are no special methods you can implement that'll make your Python class support the buffer protocol.
Yes, the io.RawIOBase.write() and io.BufferedIOBase.write() methods document that they accept a bytes-like object, but the buffer protocol needed to make something bytes-like is a C-level protocol only. There is an open Python issue to add Python hooks but no progress has been made on this.
The __bytes__ special method is only called if you passed an object to the bytes() callable; .write() does not do this.
So, when writing to a file, only actual bytes-like objects are accepted, everything else must be converted to such an object first. I'd stick with:
with open('Test.txt', 'wb') as f:
f.write(bytes(mine))
which will call the MyClass.__bytes__() method, provided it is defined, or provide a method on your class that causes it to write itself to a file object:
with open('Test.txt', 'wb') as f:
mine.dump(f)

Implement Custom Str or Buffer in Python

I'm working with python-gnupg to decrypt a file and the decrypted file content is very large so loading the entire contents into memory is not feasible.
I would like to short-circuit the write method in order to to manipulate the decrypted contents as it is written.
Here are some failed attempts:
import gpg
from StringIO import StringIO
# works but not feasible due to memory limitations
decrypted_data = gpg_client.decrypt_file(decrypted_data)
# works but no access to the buffer write method
gpg_client.decrypt_file(decrypted_data, output=buffer())
# fails with TypeError: coercing to Unicode: need string or buffer, instance found
class TestBuffer:
def __init__(self):
self.buffer = StringIO()
def write(self, data):
print('writing')
self.buffer.write(data)
gpg_client.decrypt_file(decrypted_data, output=TestBuffer())
Can anyone think of any other ideas that would allow me to create a file-like str or buffer object to output the data to?
You can implement a subclass of one of the classes in the io module described in I/O Base Classes, presumably io.BufferedIOBase. The standard library contains an example of something quite similar in the form of the zipfile.ZipExtFile class. At least this way, you won't have to implement complex functions like readline yourself.

Wrapping urllib3.HTTPResponse in io.TextIOWrapper

I use AWS boto3 library which returns me an instance of urllib3.response.HTTPResponse. That response is a subclass of io.IOBase and hence behaves as a binary file. Its read() method returns bytes instances.
Now, I need to decode csv data from a file received in such a way. I want my code to work on both py2 and py3 with minimal code overhead, so I use backports.csv which relies on io.IOBase objects as input rather than on py2's file() objects.
The first problem is that HTTPResponse yields bytes data for CSV file, and I have csv.reader which expects str data.
>>> import io
>>> from backports import csv # actually try..catch statement here
>>> from mymodule import get_file
>>> f = get_file() # returns instance of urllib3.HTTPResponse
>>> r = csv.reader(f)
>>> list(r)
Error: iterator should return strings, not bytes (did you open the file in text mode?)
I tried to wrap HTTPResponse with io.TextIOWrapper and got error 'HTTPResponse' object has no attribute 'read1'. This is expected becuase TextIOWrapper is intended to be used with BufferedIOBase objects, not IOBase objects. And it only happens on python2's implementation of TextIOWrapper because it always expects underlying object to have read1 (source), while python3's implementation checks for read1 existence and falls back to read gracefully (source).
>>> f = get_file()
>>> tw = io.TextIOWrapper(f)
>>> list(csv.reader(tw))
AttributeError: 'HTTPResponse' object has no attribute 'read1'
Then I tried to wrap HTTPResponse with io.BufferedReader and then with io.TextIOWrapper. And I got the following error:
>>> f = get_file()
>>> br = io.BufferedReader(f)
>>> tw = io.TextIOWrapper(br)
>>> list(csv.reader(f))
ValueError: I/O operation on closed file.
After some investigation it turns out that the error only happens when the file doesn't end with \n. If it does end with \n then the problem does not happen and everything works fine.
There is some additional logic for closing underlying object in HTTPResponse (source) which is seemingly causing the problem.
The question is: how can I write my code to
work on both python2 and python3, preferably with no try..catch or version-dependent branching;
properly handle CSV files represented as HTTPResponse regardless of whether they end with \n or not?
One possible solution would be to make a custom wrapper around TextIOWrapper which would make read() return b'' when the object is closed instead of raising ValueError. But is there any better solution, without such hacks?
Looks like this is an interface mismatch between urllib3.HTTPResponse and file objects. It is described in this urllib3 issue #1305.
For now there is no fix, hence I used the following wrapper code which seemingly works fine:
class ResponseWrapper(io.IOBase):
"""
This is the wrapper around urllib3.HTTPResponse
to work-around an issue shazow/urllib3#1305.
Here we decouple HTTPResponse's "closed" status from ours.
"""
# FIXME drop this wrapper after shazow/urllib3#1305 is fixed
def __init__(self, resp):
self._resp = resp
def close(self):
self._resp.close()
super(ResponseWrapper, self).close()
def readable(self):
return True
def read(self, amt=None):
if self._resp.closed:
return b''
return self._resp.read(amt)
def readinto(self, b):
val = self.read(len(b))
if not val:
return 0
b[:len(val)] = val
return len(val)
And use it as follows:
>>> f = get_file()
>>> r = csv.reader(ResponseWrapper(io.TextIOWrapper(io.BufferedReader(f))))
>>> list(r)
The similar fix was proposed by urllib3 maintainers in the bug report comments but it would be a breaking change hence for now things will probably not change, so I have to use wrapper (or do some monkey patching which is probably worse).

How do you check if an object is an instance of 'file'?

It used to be in Python (2.6) that one could ask:
isinstance(f, file)
but in Python 3.0 file was removed.
What is the proper method for checking to see if a variable is a file now? The What'sNew docs don't mention this...
def read_a_file(f)
try:
contents = f.read()
except AttributeError:
# f is not a file
substitute whatever methods you plan to use for read. This is optimal if you expect that you will get passed a file like object more than 98% of the time. If you expect that you will be passed a non file like object more often than 2% of the time, then the correct thing to do is:
def read_a_file(f):
if hasattr(f, 'read'):
contents = f.read()
else:
# f is not a file
This is exactly what you would do if you did have access to a file class to test against. (and FWIW, I too have file on 2.6) Note that this code works in 3.x as well.
In python3 you could refer to io instead of file and write
import io
isinstance(f, io.IOBase)
Typically, you don't need to check an object type, you could use duck-typing instead i.e., just call f.read() directly and allow the possible exceptions to propagate -- it is either a bug in your code or a bug in the caller code e.g., json.load() raises AttributeError if you give it an object that has no read attribute.
If you need to distinguish between several acceptable input types; you could use hasattr/getattr:
def read(file_or_filename):
readfile = getattr(file_or_filename, 'read', None)
if readfile is not None: # got file
return readfile()
with open(file_or_filename) as file: # got filename
return file.read()
If you want to support a case when file_of_filename may have read attribute that is set to None then you could use try/except over file_or_filename.read -- note: no parens, the call is not made -- e.g., ElementTree._get_writer().
If you want to check certain guarantees e.g., that only one single system call is made (io.RawIOBase.read(n) for n > 0) or there are no short writes (io.BufferedIOBase.write()) or whether read/write methods accept text data (io.TextIOBase) then you could use isinstance() function with ABCs defined in io module e.g., look at how saxutils._gettextwriter() is implemented.
Works for me on python 2.6... Are you in a strange environment where builtins aren't imported by default, or where somebody has done del file, or something?

Categories