I'm working with python-gnupg to decrypt a file and the decrypted file content is very large so loading the entire contents into memory is not feasible.
I would like to short-circuit the write method in order to to manipulate the decrypted contents as it is written.
Here are some failed attempts:
import gpg
from StringIO import StringIO
# works but not feasible due to memory limitations
decrypted_data = gpg_client.decrypt_file(decrypted_data)
# works but no access to the buffer write method
gpg_client.decrypt_file(decrypted_data, output=buffer())
# fails with TypeError: coercing to Unicode: need string or buffer, instance found
class TestBuffer:
def __init__(self):
self.buffer = StringIO()
def write(self, data):
print('writing')
self.buffer.write(data)
gpg_client.decrypt_file(decrypted_data, output=TestBuffer())
Can anyone think of any other ideas that would allow me to create a file-like str or buffer object to output the data to?
You can implement a subclass of one of the classes in the io module described in I/O Base Classes, presumably io.BufferedIOBase. The standard library contains an example of something quite similar in the form of the zipfile.ZipExtFile class. At least this way, you won't have to implement complex functions like readline yourself.
Related
In python2 I have this in my test method:
mock_file = MagicMock(spec=file)
I'm moving to python3, and I can't figure out how to do a similar mock. I've tried:
from io import IOBase
mock_file = MagicMock(spec=IOBase)
mock_file = create_autospec(IOBase)
What am I missing?
IOBase does not implement crucial file methods such as read and write and is therefore usually unsuitable as a spec to create a mocked file object with. Depending on whether you want to mock a raw stream, a binary file or a text file, you can use RawIOBase, BufferedIOBase or TextIOBase as a spec instead:
from io import BufferedIOBase
mock_file = MagicMock(spec=BufferedIOBase)
I'd like to pass an instance of my class to write() and have it written to a file. The underlying data is simply a bytearray.
mine = MyClass()
with open('Test.txt', 'wb') as f:
f.write(mine)
I tried implementing __bytes__ to no avail. I'm aware of the buffer protocol but I believe it can only be implemented via the C API (though I did see talk of delegation to an underlying object that implemented the protocol).
No, you can't, there are no special methods you can implement that'll make your Python class support the buffer protocol.
Yes, the io.RawIOBase.write() and io.BufferedIOBase.write() methods document that they accept a bytes-like object, but the buffer protocol needed to make something bytes-like is a C-level protocol only. There is an open Python issue to add Python hooks but no progress has been made on this.
The __bytes__ special method is only called if you passed an object to the bytes() callable; .write() does not do this.
So, when writing to a file, only actual bytes-like objects are accepted, everything else must be converted to such an object first. I'd stick with:
with open('Test.txt', 'wb') as f:
f.write(bytes(mine))
which will call the MyClass.__bytes__() method, provided it is defined, or provide a method on your class that causes it to write itself to a file object:
with open('Test.txt', 'wb') as f:
mine.dump(f)
I use AWS boto3 library which returns me an instance of urllib3.response.HTTPResponse. That response is a subclass of io.IOBase and hence behaves as a binary file. Its read() method returns bytes instances.
Now, I need to decode csv data from a file received in such a way. I want my code to work on both py2 and py3 with minimal code overhead, so I use backports.csv which relies on io.IOBase objects as input rather than on py2's file() objects.
The first problem is that HTTPResponse yields bytes data for CSV file, and I have csv.reader which expects str data.
>>> import io
>>> from backports import csv # actually try..catch statement here
>>> from mymodule import get_file
>>> f = get_file() # returns instance of urllib3.HTTPResponse
>>> r = csv.reader(f)
>>> list(r)
Error: iterator should return strings, not bytes (did you open the file in text mode?)
I tried to wrap HTTPResponse with io.TextIOWrapper and got error 'HTTPResponse' object has no attribute 'read1'. This is expected becuase TextIOWrapper is intended to be used with BufferedIOBase objects, not IOBase objects. And it only happens on python2's implementation of TextIOWrapper because it always expects underlying object to have read1 (source), while python3's implementation checks for read1 existence and falls back to read gracefully (source).
>>> f = get_file()
>>> tw = io.TextIOWrapper(f)
>>> list(csv.reader(tw))
AttributeError: 'HTTPResponse' object has no attribute 'read1'
Then I tried to wrap HTTPResponse with io.BufferedReader and then with io.TextIOWrapper. And I got the following error:
>>> f = get_file()
>>> br = io.BufferedReader(f)
>>> tw = io.TextIOWrapper(br)
>>> list(csv.reader(f))
ValueError: I/O operation on closed file.
After some investigation it turns out that the error only happens when the file doesn't end with \n. If it does end with \n then the problem does not happen and everything works fine.
There is some additional logic for closing underlying object in HTTPResponse (source) which is seemingly causing the problem.
The question is: how can I write my code to
work on both python2 and python3, preferably with no try..catch or version-dependent branching;
properly handle CSV files represented as HTTPResponse regardless of whether they end with \n or not?
One possible solution would be to make a custom wrapper around TextIOWrapper which would make read() return b'' when the object is closed instead of raising ValueError. But is there any better solution, without such hacks?
Looks like this is an interface mismatch between urllib3.HTTPResponse and file objects. It is described in this urllib3 issue #1305.
For now there is no fix, hence I used the following wrapper code which seemingly works fine:
class ResponseWrapper(io.IOBase):
"""
This is the wrapper around urllib3.HTTPResponse
to work-around an issue shazow/urllib3#1305.
Here we decouple HTTPResponse's "closed" status from ours.
"""
# FIXME drop this wrapper after shazow/urllib3#1305 is fixed
def __init__(self, resp):
self._resp = resp
def close(self):
self._resp.close()
super(ResponseWrapper, self).close()
def readable(self):
return True
def read(self, amt=None):
if self._resp.closed:
return b''
return self._resp.read(amt)
def readinto(self, b):
val = self.read(len(b))
if not val:
return 0
b[:len(val)] = val
return len(val)
And use it as follows:
>>> f = get_file()
>>> r = csv.reader(ResponseWrapper(io.TextIOWrapper(io.BufferedReader(f))))
>>> list(r)
The similar fix was proposed by urllib3 maintainers in the bug report comments but it would be a breaking change hence for now things will probably not change, so I have to use wrapper (or do some monkey patching which is probably worse).
The python codecs module provides StreamWriter classes for transparently encoding output streams. For instance:
outstream = codecs.getwriter('utf8')(sys.__stdout__)
outstream.write(u'\u2713')
outstream.write(' A-OK!\n') # I want this to fail!
outstream.close()
However the problem I have with the default StreamWriter is that it will except str objects as well as unicode objects. If my program is writing a str to this stream, it is a bug and I want it to fail! Is there a way to make this happen without writing my own StreamWriter that enforces the type of objects written?
Also, I don't want my solution to be sensitive to sys.stdout.encoding, sys.stdout.isatty(), locale.getpreferredencoding(), sys.getfilesystemencoding(), os.environ["PYTHONIOENCODING"] or whatever other ways python has of trying to be clever.
If possible, do what you're trying to do in Python 3, which has a much stronger distinction between unicode and bytes. Failing that, you'll need to subclass StreamWriter, for example:
import codecs
class StrictUTF8Writer(codecs.StreamWriter):
'''A StreamWriter for utf8 that requires written objects be unicode'''
encode = codecs.utf_8_encode
def write(self, object):
if not isinstance(object, unicode):
raise ValueError('write() requires unicode object')
return codecs.StreamWriter.write(self, object)
I would like to create a class that describes a file resource and then pickle it. This part is straightforward. To be concrete, let's say that I have a class "A" that has methods to operate on a file. I can pickle this object if it does not contain a file handle. I want to be able to create a file handle in order to access the resource described by "A". If I have an "open()" method in class "A" that opens and stores the file handle for later use, then "A" is no longer pickleable. (I add here that opening the file includes some non-trivial indexing which cannot be cached--third party code--so closing and reopening when needed is not without expense). I could code class "A" as a factory that can generate file handles to the described file, but that could result in multiple file handles accessing the file contents simultaneously. I could use another class "B" to handle the opening of the file in class "A", including locking, etc. I am probably overthinking this, but any hints would be appreciated.
The question isn't too clear; what it looks like is that:
you have a third-party module which has picklable classes
those classes may contain references to files, which makes the classes themselves not picklable because open files aren't picklable.
Essentially, you want to make open files picklable. You can do this fairly easily, with certain caveats. Here's an incomplete but functional sample:
import pickle
class PicklableFile(object):
def __init__(self, fileobj):
self.fileobj = fileobj
def __getattr__(self, key):
return getattr(self.fileobj, key)
def __getstate__(self):
ret = self.__dict__.copy()
ret['_file_name'] = self.fileobj.name
ret['_file_mode'] = self.fileobj.mode
ret['_file_pos'] = self.fileobj.tell()
del ret['fileobj']
return ret
def __setstate__(self, dict):
self.fileobj = open(dict['_file_name'], dict['_file_mode'])
self.fileobj.seek(dict['_file_pos'])
del dict['_file_name']
del dict['_file_mode']
del dict['_file_pos']
self.__dict__.update(dict)
f = PicklableFile(open("/tmp/blah"))
print f.readline()
data = pickle.dumps(f)
f2 = pickle.loads(data)
print f2.read()
Caveats and notes, some obvious, some less so:
This class should operate directly on the file object you got from open. If you're using wrapper classes on files, like gzip.GzipFile, those should go above this, not below it. Logically, treat this as a decorator class on top of file.
If the file doesn't exist when you unpickle, it can't be unpickled and will throw an exception.
If it's a different file, the behavior may or may not make sense.
If the file mode includes file creation ('w+'), and the file doesn't exist, it'll be created; we don't know what file permissions to use, since that's not stored with the file. If this is important--it probably shouldn't be--then store the correct permissions in the class when you first create it.
If the file isn't seekable, trying to seek to the old position may raise IOError; if you're using a file like that you'll need to decide how to handle that.
The file classes in Python 2 and Python 3 are different; there's no file class in Python 3. Even if you're only using Python 2 right now, don't subclass file.
I'd steer away from doing this; having pickled data dependent on external files not changing and staying in the same place is brittle. This makes it difficult to even relocate files, since your pickled data won't make sense.
If you open a pointer to a file, pickle it, then attempt to reconstitute is later, there is no guarantee that file will still be available for opening.
To elaborate, the file pointer really represents a connection to the file. Just like a database connection, you can't "pickle" the other end of the connection, so this won't work.
Is it possible to keep the file pointer around in memory in its own process instead?
It sounds like you know you can't pickle the handle, and you're ok with that, you just want to pickle the part that can be pickled. As your object stands now, it can't be pickled because it has the handle. Do I have that right? If so, read on.
The pickle module will let your class describe its own state to pickle, for exactly these cases. You want to define your own __getstate__ method. The pickler will invoke it to get the state to be pickled, only if the method is missing does it go ahead and do the default thing of trying to pickle all the attributes.