Supporting python 2 and 3: str, bytes or alternative - python

I have a Python2 codebase that makes extensive use of str to store raw binary data. I want to support both Python2 and Python3.
The bytes (an alis of str) type in Python2 and bytes in Python3 are completely different. They take different arguments to construct, index to different types and have different str and repr.
What's the best way of unifying the code for both Python versions, using a single type to store raw data?

The python-future package has a backport of the Python3 bytes type.
>>> from builtins import bytes # in py2, this picks up the backport
>>> b = bytes(b'ABCD')
This provides the Python 3 interface in both Python 2 and Python 3. In Python 3, it is the builtin bytes type. In Python 2, it is a compatibility layer on top of the str type.

I don't know on what parts you want to work with bytes, I allmost allways work with bytearray's, and this is how I do it when reading from a file
with open(file, 'rb') as imageFile:
f = imageFile.read()
b = bytearray(f)
I took that right out of a project I am working on, and it works in both 2 and 3. Maybe something for you to look at?

If your project small and simple use six.
Otherwise I suggest to have two independent codebases: one for Python 2 and one for Python 3. Initially it may sound like a lot of unnecessary work, but eventually it's actually a lot easier to maintain.
As an example of what your project may become if you decide to support both pythons in a single codebase, take a look at google's protobuf. Lots of often counterintuitive branching all round the code, abstractions that were modified just to allow hacks. And as your project will evolve it won't get better: deadlines play against quality of the code.
With two separate codebases you will simply apply almost identical patches which isn't a lot of work compared to what is ahead of you if you want a single code base. And it will be easier to migrate to Python 3 completely once number of Python 2 users of your package drop.

Assuming you only need to support Python 2.6 and newer, you can simply use bytes for, well, bytes. Use b literals to create bytes objects, such as b'\x0a\x0b\x00'. When working with files, make sure the mode includes a b (as in open('file.bin', 'rb')).
Beware that iteration and element access is different though. In these cases, you can write your code to use chunks. Instead of b[0] == 0 (Python 3) or b[0] == b'\x00' (Python 2) write b[0:1] == b'\x00'. Other options is using bytearray (when the bytes are mutable) or helper functions.
Strings of characters should be unicode in Python 2, independent from Python 3 porting; otherwise the code would likely be wrong when encountering non-ASCII characters anyways. The equivalent is str in Python 3.
Either use u literals to create character strings (such as u'Düsseldorf') and/or make sure to start every file with from __future__ import unicode_literals. Declare file encodings when necessary by starting files with # encoding: utf-8.
Use io.open to read character strings from files. For network code, fetch bytes and call decode on them to get a character string.
If you need to support Python 2.5 or 3.2, have a look at six to convert literals.
Add plenty of assertions to make sure you that functions which operate on character strings don't get bytes, and vice versa. As usual, a good test suite with 100% coverage helps a lot.

Related

how to convert Python 2 unicode() function into correct Python 3.x syntax

I enabled the compatibility check in my Python IDE and now I realize that the inherited Python 2.7 code has a lot of calls to unicode() which are not allowed in Python 3.x.
I looked at the docs of Python2 and found no hint how to upgrade:
I don't want to switch to Python3 now, but maybe in the future.
The code contains about 500 calls to unicode()
How to proceed?
Update
The comment of user vaultah to read the pyporting guide has received several upvotes.
My current solution is this (thanks to Peter Brittain):
from builtins import str
... I could not find this hint in the pyporting docs.....
As has already been pointed out in the comments, there is already advice on porting from 2 to 3.
Having recently had to port some of my own code from 2 to 3 and maintain compatibility for each for now, I wholeheartedly recommend using python-future, which provides a great tool to help update your code (futurize) as well as clear guidance for how to write cross-compatible code.
In your specific case, I would simply convert all calls to unicode to use str and then import str from builtins. Any IDE worth its salt these days will do that global search and replace in one operation.
Of course, that's the sort of thing futurize should catch too, if you just want to use automatic conversion (and to look for other potential issues in your code).
You can test whether there is such a function as unicode() in the version of Python that you're running. If not, you can create a unicode() alias for the str() function, which does in Python 3 what unicode() did in Python 2, as all strings are unicode in Python 3.
# Python 3 compatibility hack
try:
unicode('')
except NameError:
unicode = str
Note that a more complete port is probably a better idea; see the porting guide for details.
Short answer: Replace all unicode calls with str calls.
Long answer: In Python 3, Unicode was replaced with strings because of its abundance. The following solution should work if you are only using Python 3:
unicode = str
# the rest of your goes goes here
If you are using it with both Python 2 or Python 3, use this instead:
import sys
if sys.version_info.major == 3:
unicode = str
# the rest of your code goes here
The other way: run this in the command line
$ 2to3 package -w
First, as a strategy, I would take a small part of your program and try to port it. The number of unicode calls you are describing suggest to me that your application cares about string representations more than most and each use-case is often different.
The important consideration is that all strings are unicode in Python 3. If you are using the str type to store "bytes" (for example, if they are read from a file), then you should be aware that those will not be bytes in Python3 but will be unicode characters to begin with.
Let's look at a few cases.
First, if you do not have any non-ASCII characters at all and really are not using the Unicode character set, it is easy. Chances are you can simply change the unicode() function to str(). That will assure that any object passed as an argument is properly converted. However, it is wishful thinking to assume it's that easy.
Most likely, you'll need to look at the argument to unicode() to see what it is, and determine how to treat it.
For example, if you are reading UTF-8 characters from a file in Python 2 and converting them to Unicode your code would look like this:
data = open('somefile', 'r').read()
udata = unicode(data)
However, in Python3, read() returns Unicode data to begin with, and the unicode decoding must be specified when opening the file:
udata = open('somefile', 'r', encoding='UTF-8').read()
As you can see, transforming unicode() simply when porting may depend heavily on how and why the application is doing Unicode conversions, where the data has come from, and where it is going to.
Python3 brings greater clarity to string representations, which is welcome, but can make porting daunting. For example, Python3 has a proper bytes type, and you convert byte-data to unicode like this:
udata = bytedata.decode('UTF-8')
or convert Unicode data to character form using the opposite transform.
bytedata = udata.encode('UTF-8')
I hope this at least helps determine a strategy.
You can use six library which have text_type function (unicode in py2, str in py3):
from six import text_type

unicode identifiers without sacrificing python 2 compatibility

I have a python units package. I want to have both Å and angstrom be two aliases for angstroms, so that people can use whichever one they prefer (Å is easier to read but angstrom is easier to type). Since unicode identifiers are forbidden in Python 2, the Å option will obviously only be available in Python 3. My question is: Is there any way to have a single source file that works in both Python 2 and Python 3, and has this variable defined in Python 3 only?
The naive solution if sys.version_info >= (3,): Å = angstrom does not work, because Python 2 raises a syntax error.
Normally, I don't encourage anything other than ascii characters for variable names... However, this is an interesting idea since the angstrom symbol actual has it's standard meaning in this context so I guess I'm cool with it this time :-).
I think you should be able to accomplish this with:
globals()['Å'] = angstrom
and this will "work" on both python2.x and python3.x. Of course, your python2.x users won't be able to reference it in their code without reverting to weird hacks like getattr(units, 'Å'), but it won't throw an error either which is the point of the question (I think).

Python doesn't interpret UTF8 correctly

I know similar questions have been asked a million times, but despite reading through many of them I can't find a solution that applies to my situation.
I have a django application, in which I've created a management script. This script reads some text files, and outputs them to the terminal (it will do more useful stuff with the contents later, but I'm still testing it out) and the characters come out with escape sequences like \xc3\xa5 instead of the intended å. Since that escape sequence means Ã¥, which is a common misinterpretation of å because of encoding problems, I suspect there are at least two places where this is going wrong. However, I can't figure out where - I've checked all the possible culprits I can think of:
The terminal encoding is UTF-8; echo $LANG gives en_US.UTF-8
The text files are encoded in UTF-8; file * in the directory where they reside results in all entries being listed as "UTF-8 Unicode text" except one, which does not contain any non-ASCII characters and is listed as "ASCII text". Running iconv -f ascii -t utf8 thefile.txt > utf8.txt on that file yields another file with ASCII text encoding.
The Python scripts are all UTF-8 (or, in several cases, ASCII with no non-ASCII characters). I tried inserting a comment in my management script with some special characters to force it to save as UTF-8, but it did not change the behavior. The above observations on the text files apply on all Python script files as well.
The Python script that handles the text files has # -*- encoding: utf-8 -*- at the top; the only line preceding that is #!/usr/bin/python3, but I've tried both changing to .../python for Python 2.7 or removing it entirely to leave it up to Django, without results.
According to the documentation, "Django natively supports Unicode data", so I "can safely pass around Unicode strings" anywhere in the application.
I really can't think of anywhere else to look for a non-UTF-8 link in the chain. Where could I possibly have missed a setting to change to UTF-8?
For completeness: I'm reading from the files with lines = file.readlines() and printing with the standard print() function. No manual encoding or decoding happens at either end.
UPDATE:
In response to quiestions in comments:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) yields ('ascii', 'UTF-8', None) for all files.
I started compiling an SSCCE, and quickly found that the problem is only there if I try to print the value in a tuple. In other words, print(lines[0].strip()) works fine, but print(lines[0].strip(), lines[1].strip()) does not. Adding .decode('utf-8') yields a tuple where both strings are marked with a prepending u and \xe5 (the correct escape sequence for å) instead of the odd characters before - but I can't figure out how to print them as regular strings, with no escape characters. I've tested another call to .decode('utf-8') as well as wrapping in str() but both fail with UnicodeEncodeError complaining that \xe5 can't be encoded in ascii. Since a single string works correctly, I don't know what else to test.
SSCCE:
# -*- coding: utf-8 -*-
import os, sys
for root,dirs,files in os.walk('txt-songs'):
for filename in files:
with open(os.path.join(root,filename)) as f:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding)
lines = f.readlines()
print(lines[0].strip()) # works
print(lines[0].strip(), lines[1].strip()) # does not work
The big problem here is that you're mixing up Python 2 and Python 3. In particular, you've written Python 3 code, and you're trying to run it in Python 2.7. But there are a few other problems along the way. So, let me try to explain everything that's going wrong.
I started compiling an SSCCE, and quickly found that the problem is only there if I try to print the value in a tuple. In other words, print(lines[0].strip()) works fine, but print(lines[0].strip(), lines[1].strip()) does not.
The first problem here is that the str of a tuple (or any other collection) includes the repr, not the str, of its elements. The simple way to solve this problem is to not print collections. In this case, there is really no reason to print a tuple at all; the only reason you have one is that you've built it for printing. Just do something like this:
print '({}, {})'.format(lines[0].strip(), lines[1].strip())
In cases where you already have a collection in a variable, and you want to print out the str of each element, you have to do that explicitly. You can print the repr of the str of each with this:
print tuple(map(str, my_tuple))
… or print the str of each directly with this:
print '({})'.format(', '.join(map(str, my_tuple)))
Notice that I'm using Python 2 syntax above. That's because if you actually used Python 3, there would be no tuple in the first place, and there would also be no need to call str.
You've got a Unicode string. In Python 3, unicode and str are the same type. But in Python 2, it's bytes and str that are the same type, and unicode is a different one. So, in 2.x, you don't have a str yet, which is why you need to call str.
And Python 2 is also why print(lines[0].strip(), lines[1].strip()) prints a tuple. In Python 3, that's a call to the print function with two strings as arguments, so it will print out two strings separated by a space. In Python 2, it's a print statement with one argument, which is a tuple.
If you want to write code that works the same in both 2.x and 3.x, you either need to avoid ever printing more than one argument, or use a wrapper like six.print_, or do a from __future__ import print_function, or be very careful to do ugly things like adding in extra parentheses to make sure your tuples are tuples in both versions.
So, in 3.x, you've got str objects and you just print them out. In 2.x, you've got unicode objects, and you're printing out their repr. You can change that to print out their str, or to avoid printing a tuple in the first place… but that still won't help anything.
Why? Well, printing anything, in either version, just calls str on it and then passes it to sys.stdio.write. But in 3.x, str means unicode, and sys.stdio is a TextIOWrapper; in 2.x, str means bytes, and sys.stdio is a binary file.
So, the pseudocode for what ultimately happens is:
sys.stdio.wrapped_binary_file.write(s.encode(sys.stdio.encoding, sys.stdio.errors))
sys.stdio.write(s.encode(sys.getdefaultencoding()))
And, as you saw, those will do different things, because:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) yields ('ascii', 'UTF-8', None)
You can simulate Python 3 here by using a io.TextIOWrapper or codecs.StreamWriter and then using print >>f, … or f.write(…) instead of print, or you can explicitly encode all your unicode objects like this:
print '({})'.format(', '.join(element.encode('utf-8') for element in my_tuple)))
But really, the best way to deal with all of these problems is to run your existing Python 3 code in a Python 3 interpreter instead of a Python 2 interpreter.
If you want or need to use Python 2.7, that's fine, but you have to write Python 2 code. If you want to write Python 3 code, that's great, but you have to run Python 3.3. If you really want to write code that works properly in both, you can, but it's extra work, and takes a lot more knowledge.
For further details, see What's New In Python 3.0 (the "Print Is A Function" and "Text Vs. Data Instead Of Unicode Vs. 8-bit" sections), although that's written from the point of view of explaining 3.x to 2.x users, which is backward from what you need. The 3.x and 2.x versions of the Unicode HOWTO may also help.
For completeness: I'm reading from the files with lines = file.readlines() and printing with the standard print() function. No manual encoding or decoding happens at either end.
In Python 3.x, the standard print function just writes Unicode to sys.stdout. Since that's a io.TextIOWrapper, its write method is equivalent to this:
self.wrapped_binary_file.write(s.encode(self.encoding, self.errors))
So one likely problem is that sys.stdout.encoding does not match your terminal's actual encoding.
And of course another is that your shell's encoding does not match your terminal window's encoding.
For example, on OS X, I create a myscript.py like this:
print('\u00e5')
Then I fire up Terminal.app, create a session profile with encoding "Western (ISO Latin 1)", create a tab with that session profile, and do this:
$ export LANG=en_US.UTF-8
$ python3 myscript.py
… and I get exactly the behavior you're seeing.
It seems from your comment that you are using python-2 and not python-3.
If you are using python-3, it's worth reading the unicode howto guide on reading/writing to understand what python is doing.
The basic flow if encoding is:
DECODE from encoding to unicode -> Processing -> Encode from unicode to encoding
In python3 the bytes are decoded to strings and strings are encoded to bytes.
The bytes to string decoding is handled for you with open().
[..] the built-in open() function can return a file-like object that
assumes the file’s contents are in a specified encoding and accepts
Unicode parameters for methods such as read() and write(). This works
through open()‘s encoding and errors parameters [..]
So to read in unicode from a utf-8 encoded file you should be doing this:
# python-3
with open('utf8.txt', mode='r', encoding='utf-8') as f:
lines = f.readlines() # returns unicode
If you want similar functionality using python-2, you can use codecs.open():
# python-2
import codecs
with codecs.open('utf8.txt', mode='r', encoding='utf-8') as f:
lines = f.readlines() # returns unicode

How to define a binary string in Python in a way that works with both py2 and py3?

I am writing a module that is supposed to work in both Python 2 and 3 and I need to define a binary string.
Usually this would be something like data = b'abc' but this code code fails on Python 2.5 with invalid syntax.
How can I write the above code in a way that will work in all versions of Python 2.5+
Note: this has to be binary (it can contain any kind of characters, 0xFF), this is very important.
I would recommend the following:
from six import b
That requires the six module, of course.
If you don't want that, here's another version:
import sys
if sys.version < '3':
def b(x):
return x
else:
import codecs
def b(x):
return codecs.latin_1_encode(x)[0]
More info.
These solutions (essentially the same) work, are clean, as fast as you are going to get, and can support all 256 byte values (which none of the other solutions here can).
If the string only has ASCII characters, call encode. This will give you a str in Python 2 (just like b'abc'), and a bytes in Python 3:
'abc'.encode('ascii')
If not, rather than putting binary data in the source, create a data file, open it with 'rb' and read from it.
You could store the data base64-encoded.
First step would be to transform into base64:
>>> import base64
>>> base64.b64encode(b"\x80\xFF")
b'gP8='
This is to be done once, and using the b or not depends on the version of Python you use for it.
In the second step, you put this byte string into a program without the b.
Then it is ensured that it works in py2 and py3.
import base64
x = 'gP8='
base64.b64decode(x.encode("latin1"))
gives you a str '\x80\xff' in 2.6 (should work in 2.5 as well) and a b'\x80\xff'in 3.x.
Alternatively to the two steps above, you can do the same with hex data, you can do
import binascii
x = '80FF'
binascii.unhexlify(x) # `bytes()` in 3.x, `str()` in 2.x

Deterministic key serialization

I'm writing a mapping class which persists to the disk. I am currently allowing only str keys but it would be nice if I could use a couple more types: hopefully up to anything that is hashable (ie. same requirements as the builtin dict), but more reasonable I would accept string, unicode, int, and tuples of these types.
To that end I would like to derive a deterministic serialization scheme.
Option 1 - Pickling the key
The first thought I had was to use the pickle (or cPickle) module to serialize the key, but I noticed that the output from pickle and cPickle do not match each other:
>>> import pickle
>>> import cPickle
>>> def dumps(x):
... print repr(pickle.dumps(x))
... print repr(cPickle.dumps(x))
...
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."
Is there any implementation/protocol combination of pickle which is deterministic for some set of types (e.g. can only use cPickle with protocol 0)?
Option 2 - Repr and ast.literal_eval
Another option is to use repr to dump and ast.literal_eval to load. I have written a function to determine if a given key would survive this process (it is rather conservative on the types it allows):
def is_reprable_key(key):
return type(key) in (int, str, unicode) or (type(key) == tuple and all(
is_reprable_key(x) for x in key))
The question for this method is if repr itself is deterministic for the types that I have allowed here. I believe this would not survive the 2/3 version barrier due to the change in str/unicode literals. This also would not work for integers where 2**32 - 1 < x < 2**64 jumping between 32 and 64 bit platforms. Are there any other conditions (ie. do strings serialize differently under different conditions in the same interpreter)? Edit: I'm just trying to understand the conditions that this breaks down, not necessarily overcome them.
Option 3: Custom repr
Another option which is likely overkill is to write my own repr which flattens out the things of repr which I know (or suspect may be) a problem. I just wrote an example here: http://gist.github.com/423945
(If this all fails miserably then I can store the hash of the key along with the pickle of both the key and value, then iterate across rows that have a matching hash looking for one that unpickles to the expected key, but that really does complicate a few other things and I would rather not do it. Edit: it turns out that the builtin hash is not deterministic across platforms either. Scratch that.)
Any insights?
Important note: repr() is not deterministic if a dictionary or set type is embedded in the object you are trying to serialize. The keys could be printed in any order.
For example print repr({'a':1, 'b':2}) might print out as {'a':1, 'b':2} or {'b':2, 'a':1}, depending on how Python decides to manage the keys in the dictionary.
After reading through much of the source (of CPython 2.6.5) for the implementation of repr for the basic types I have concluded (with reasonable confidence) that repr of these types is, in fact, deterministic. But, frankly, this was expected.
I believe that the repr method is susceptible to nearly all of the same cases under which the marshal method would break down (longs > 2**32 can never be an int on a 32bit machine, not guaranteed to not change between versions or interpreters, etc.).
My solution for the time being has been to use the repr method and write a comprehensive test suite to make sure that repr returns the same values on the various platforms I am using.
In the long run the custom repr function would flatten out all platform/implementation differences, but is certainly overkill for the project at hand. I may do this in the future, however.
"Any value which is an acceptable key for a builtin dict" is not feasible: such values include arbitrary instances of classes that don't define __hash__ or comparisons, implicitly using their id for hashing and comparison purposes, and the ids won't be the same even across runs of the very same program (unless those runs are strictly identical in all respects, which is very tricky to arrange -- identical inputs, identical starting times, absolutely identical environment, etc, etc).
For strings, unicodes, ints, and tuples whose items are all of these kinds (including nested tuples), the marshal module could help (within a single version of Python: marshaling code can and does change across versions). E.g.:
>>> marshal.dumps(23)
'i\x17\x00\x00\x00'
>>> marshal.dumps('23')
't\x02\x00\x00\x0023'
>>> marshal.dumps(u'23')
'u\x02\x00\x00\x0023'
>>> marshal.dumps((23,))
'(\x01\x00\x00\x00i\x17\x00\x00\x00'
This is Python 2; Python 3 would be similar (except that all the representation of these byte strings would have a leading b, but that's a cosmetic issue, and of course u'23' becomes invalid syntax and '23' becomes a Unicode string). You can see the general idea: a leading byte represents the type, such as u for Unicode strings, i for integers, ( for tuples; then for containers comes (as a little-endian integer) the number of items followed by the items themselves, and integers are serialized into a little-endian form. marshal is designed to be portable across platforms (for a given version; not across versions) because it's used as the underlying serializations in compiled bytecode files (.pyc or .pyo).
You mention a few requirements in the paragraph, and I think you might want to be a little more clear on these. So far I gather:
You're building an SQLite backend to basically a dictionary.
You want to allow the keys to be more than basestring type (which types).
You want it to survive the Python 2 -> Python 3 barrier.
You want to support large integers above 2**32 as the key.
Ability to store infinite values (because you don't want hash collisions).
So, are you trying to build a general 'this can do it all' solution, or just trying to solve an immediate problem to continue on within a current project? You should spend a little more time to come up with a clear set of requirements.
Using a hash seemed like the best solution to me, but then you complain that you're going to have multiple rows with the same hash implying you're going to be storing enough values to even worry about the hash.

Categories