How can I make a python StreamWriter REQUIRE unicode input? - python

The python codecs module provides StreamWriter classes for transparently encoding output streams. For instance:
outstream = codecs.getwriter('utf8')(sys.__stdout__)
outstream.write(u'\u2713')
outstream.write(' A-OK!\n') # I want this to fail!
outstream.close()
However the problem I have with the default StreamWriter is that it will except str objects as well as unicode objects. If my program is writing a str to this stream, it is a bug and I want it to fail! Is there a way to make this happen without writing my own StreamWriter that enforces the type of objects written?
Also, I don't want my solution to be sensitive to sys.stdout.encoding, sys.stdout.isatty(), locale.getpreferredencoding(), sys.getfilesystemencoding(), os.environ["PYTHONIOENCODING"] or whatever other ways python has of trying to be clever.

If possible, do what you're trying to do in Python 3, which has a much stronger distinction between unicode and bytes. Failing that, you'll need to subclass StreamWriter, for example:
import codecs
class StrictUTF8Writer(codecs.StreamWriter):
'''A StreamWriter for utf8 that requires written objects be unicode'''
encode = codecs.utf_8_encode
def write(self, object):
if not isinstance(object, unicode):
raise ValueError('write() requires unicode object')
return codecs.StreamWriter.write(self, object)

Related

Can a Python class be written such that it may be passed to write()?

I'd like to pass an instance of my class to write() and have it written to a file. The underlying data is simply a bytearray.
mine = MyClass()
with open('Test.txt', 'wb') as f:
f.write(mine)
I tried implementing __bytes__ to no avail. I'm aware of the buffer protocol but I believe it can only be implemented via the C API (though I did see talk of delegation to an underlying object that implemented the protocol).
No, you can't, there are no special methods you can implement that'll make your Python class support the buffer protocol.
Yes, the io.RawIOBase.write() and io.BufferedIOBase.write() methods document that they accept a bytes-like object, but the buffer protocol needed to make something bytes-like is a C-level protocol only. There is an open Python issue to add Python hooks but no progress has been made on this.
The __bytes__ special method is only called if you passed an object to the bytes() callable; .write() does not do this.
So, when writing to a file, only actual bytes-like objects are accepted, everything else must be converted to such an object first. I'd stick with:
with open('Test.txt', 'wb') as f:
f.write(bytes(mine))
which will call the MyClass.__bytes__() method, provided it is defined, or provide a method on your class that causes it to write itself to a file object:
with open('Test.txt', 'wb') as f:
mine.dump(f)

Implement Custom Str or Buffer in Python

I'm working with python-gnupg to decrypt a file and the decrypted file content is very large so loading the entire contents into memory is not feasible.
I would like to short-circuit the write method in order to to manipulate the decrypted contents as it is written.
Here are some failed attempts:
import gpg
from StringIO import StringIO
# works but not feasible due to memory limitations
decrypted_data = gpg_client.decrypt_file(decrypted_data)
# works but no access to the buffer write method
gpg_client.decrypt_file(decrypted_data, output=buffer())
# fails with TypeError: coercing to Unicode: need string or buffer, instance found
class TestBuffer:
def __init__(self):
self.buffer = StringIO()
def write(self, data):
print('writing')
self.buffer.write(data)
gpg_client.decrypt_file(decrypted_data, output=TestBuffer())
Can anyone think of any other ideas that would allow me to create a file-like str or buffer object to output the data to?
You can implement a subclass of one of the classes in the io module described in I/O Base Classes, presumably io.BufferedIOBase. The standard library contains an example of something quite similar in the form of the zipfile.ZipExtFile class. At least this way, you won't have to implement complex functions like readline yourself.

Wrapping urllib3.HTTPResponse in io.TextIOWrapper

I use AWS boto3 library which returns me an instance of urllib3.response.HTTPResponse. That response is a subclass of io.IOBase and hence behaves as a binary file. Its read() method returns bytes instances.
Now, I need to decode csv data from a file received in such a way. I want my code to work on both py2 and py3 with minimal code overhead, so I use backports.csv which relies on io.IOBase objects as input rather than on py2's file() objects.
The first problem is that HTTPResponse yields bytes data for CSV file, and I have csv.reader which expects str data.
>>> import io
>>> from backports import csv # actually try..catch statement here
>>> from mymodule import get_file
>>> f = get_file() # returns instance of urllib3.HTTPResponse
>>> r = csv.reader(f)
>>> list(r)
Error: iterator should return strings, not bytes (did you open the file in text mode?)
I tried to wrap HTTPResponse with io.TextIOWrapper and got error 'HTTPResponse' object has no attribute 'read1'. This is expected becuase TextIOWrapper is intended to be used with BufferedIOBase objects, not IOBase objects. And it only happens on python2's implementation of TextIOWrapper because it always expects underlying object to have read1 (source), while python3's implementation checks for read1 existence and falls back to read gracefully (source).
>>> f = get_file()
>>> tw = io.TextIOWrapper(f)
>>> list(csv.reader(tw))
AttributeError: 'HTTPResponse' object has no attribute 'read1'
Then I tried to wrap HTTPResponse with io.BufferedReader and then with io.TextIOWrapper. And I got the following error:
>>> f = get_file()
>>> br = io.BufferedReader(f)
>>> tw = io.TextIOWrapper(br)
>>> list(csv.reader(f))
ValueError: I/O operation on closed file.
After some investigation it turns out that the error only happens when the file doesn't end with \n. If it does end with \n then the problem does not happen and everything works fine.
There is some additional logic for closing underlying object in HTTPResponse (source) which is seemingly causing the problem.
The question is: how can I write my code to
work on both python2 and python3, preferably with no try..catch or version-dependent branching;
properly handle CSV files represented as HTTPResponse regardless of whether they end with \n or not?
One possible solution would be to make a custom wrapper around TextIOWrapper which would make read() return b'' when the object is closed instead of raising ValueError. But is there any better solution, without such hacks?
Looks like this is an interface mismatch between urllib3.HTTPResponse and file objects. It is described in this urllib3 issue #1305.
For now there is no fix, hence I used the following wrapper code which seemingly works fine:
class ResponseWrapper(io.IOBase):
"""
This is the wrapper around urllib3.HTTPResponse
to work-around an issue shazow/urllib3#1305.
Here we decouple HTTPResponse's "closed" status from ours.
"""
# FIXME drop this wrapper after shazow/urllib3#1305 is fixed
def __init__(self, resp):
self._resp = resp
def close(self):
self._resp.close()
super(ResponseWrapper, self).close()
def readable(self):
return True
def read(self, amt=None):
if self._resp.closed:
return b''
return self._resp.read(amt)
def readinto(self, b):
val = self.read(len(b))
if not val:
return 0
b[:len(val)] = val
return len(val)
And use it as follows:
>>> f = get_file()
>>> r = csv.reader(ResponseWrapper(io.TextIOWrapper(io.BufferedReader(f))))
>>> list(r)
The similar fix was proposed by urllib3 maintainers in the bug report comments but it would be a breaking change hence for now things will probably not change, so I have to use wrapper (or do some monkey patching which is probably worse).

Dealing with ctypes and ASCII strings when porting Python 2 code to Python 3

I got fed up last night and started porting PyVISA to Python 3 (progress here: https://github.com/thevorpalblade/pyvisa).
I've gotten it to the point where everything works, as long as I pass device addresses (well, any string really) as an ASCII string rather than the default unicode string (For example,
HP = vida.instrument(b"GPIB::16") works, whereas
HP = vida.instrument("GPIB::16") does not, raising a ValueError.
Ideally, the end user should not have to care about string encoding.
Any suggestions as to how I should approach this? Something in the ctypes type definitions perhaps?
As it stands, the relevant ctypes type definition is:
ViString = _ctypes.c_char_p
ctypes, like most things in Python 3, intentionally doesn't automatically convert between unicode and bytes. That's because in most use cases, that would just be asking for the same kind of mojibake or UnicodeEncodeError disasters that people switched to Python 3 to avoid.
However, when you know you're only dealing with pure ASCII, that's another story. You have to be explicit—but you can factor out that explicitness into a wrapper.
As explained in Specifying the required argument types (function prototypes), in addition to a standard ctypes type, you can pass any class that has a from_param classmethod—which normally returns an instance of some type (usually the same type) with an _as_parameter_ attribute, but can also just return a native ctypes-type value instead.
class Asciifier(object):
#classmethod
def from_param(cls, value):
if isinstance(value, bytes):
return value
else:
return value.encode('ascii')
This may not be the exact rule you want—for example, it'll fail on bytearray (just as c_char_p will) even though that could be converted quietly to bytes… but then you wouldn't want to implicitly convert an int to bytes. Anything, whatever rule you decide on should be easy to code.
Here's an example (on OS X; you'll obviously have to change how libc is loaded for linux, Windows, etc., but you presumably know how to do that):
>>> libc = CDLL('libSystem.dylib')
>>> libc.atoi.argtypes = [Asciifier]
>>> libc.atoi.restype = c_int
>>> libc.atoi(b'123')
123
>>> libc.atoi('123')
123
>>> libc.atoi('123') # Unicode fullwidth digits
ArgumentError: argument 1: <class 'UnicodeEncodeError'>: 'ascii' codec can't encode character '\uff10' in position 0: ordinal not in range(128)
>>> libc.atoi(123)
ArgumentError: argument 1: <class 'AttributeError'>: 'int' object has no attribute 'encode'
Obviously you can catch the exception and raise a different one if those aren't clear enough for your use case.
You can similarly write a Utf8ifier, or an Encodifier(encoding, errors=None) class factory, or whatever else you need for some particular library and stick it in the argtypes the same way.
If you also want to auto-decode return types, see Return types and errcheck.
One last thing: When you're sure the data are supposed to be UTF-8, but you want to deal with the case where they aren't in the same way Python 2.x would (by preserving them as-is), you can even do that in 3.x. Use the aforementioned Utf8ifier as your argtype, and a decoder errcheck, and use errors=surrogateescape. See here for a complete example.

Can I make decode(errors="ignore") the default for all strings in a Python 2.7 program?

I have a Python 2.7 program that writes out data from various external applications. I continually get bitten by exceptions when I write to a file until I add .decode(errors="ignore") to the string being written out. (FWIW, opening the file as mode="wb" doesn't fix this.)
Is there a way to say "ignore encoding errors on all strings in this scope"?
You cannot redefine methods on built-in types, and you cannot change the default value of the errors parameter to str.decode(). There are other ways to achieve the desired behaviour, though.
The slightly nicer way: Define your own decode() function:
def decode(s, encoding="ascii", errors="ignore"):
return s.decode(encoding=encoding, errors=errors)
Now, you will need to call decode(s) instead of s.decode(), but that's not too bad, isn't it?
The hack: You can't change the default value of the errors parameter, but you can overwrite what the handler for the default errors="strict" does:
import codecs
def strict_handler(exception):
return u"", exception.end
codecs.register_error("strict", strict_handler)
This will essentially change the behaviour of errors="strict" to the standard "ignore" behaviour. Note that this will be a global change, affecting all modules you import.
I recommend neither of these two ways. The real solution is to get your encodings right. (I'm well aware that this isn't always possible.)
As mentioned in my thread on the issue the hack from Sven Marnach is even possible without a new function:
import codecs
codecs.register_error("strict", codecs.ignore_errors)
I'm not sure what your setup is exactly, but you can derive a class from str and override its decode method:
class easystr(str):
def decode(self):
return str.decode(self, errors="ignore")
If you then convert all incoming strings to easystr, errors will be silently ignored:
line = easystr(input.readline())
That said, decoding a string converts it to unicode, which should never be lossy. Could you figure out which encoding your strings are using and give that as the encoding argument to decode? That would be a better solution (and you can still make it the default in the above way).
Yet another thing you should try is to read your data differently. Do it like this and the decoding errors may well disappear:
import codecs
input = codecs.open(filename, "r", encoding="latin-1") # or whatever

Categories