Does python support unicode beyond basic multilingual plane?

Does python support unicode beyond basic multilingual plane? - python

Below is a simple test. repr seems to work fine. yet len and x for x in doesn't seem to divide the unicode text correctly in Python 2.6 and 2.7:
In [1]: u"爨爵"
Out[1]: u'\U0002f920\U0002f921'
In [2]: [x for x in u"爨爵"]
Out[2]: [u'\ud87e', u'\udd20', u'\ud87e', u'\udd21']
Good news is Python 3.3 does the right thing ™.
Is there any hope for Python 2.x series?

Yes, provided you compiled your Python with wide-unicode support.
By default, Python is built with narrow unicode support only. Enable wide support with:
./configure --enable-unicode=ucs4
You can verify what configuration was used by testing sys.maxunicode:
import sys
if sys.maxunicode == 0x10FFFF:
print 'Python built with UCS4 (wide unicode) support'
else:
print 'Python built with UCS2 (narrow unicode) support'
A wide build will use UCS4 characters for all unicode values, doubling memory usage for these. Python 3.3 switched to variable width values; only enough bytes are used to represent all characters in the current value.
Quick demo showing that a wide build handles your sample Unicode string correctly:
$ python2.6
Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> [x for x in u'\U0002f920\U0002f921']
[u'\U0002f920', u'\U0002f921']

Related

Python 3.6 variable annotations and numeric literals

In the documentation on Python in the section "What's new in Python 3.6" among other things there are presented variable annotations and using underscores in numeric literals.
However I tried shown examples and not all of them were passed.
Are these examples incomplete and do they require some additional code that is assumed under the hood?
For example this statement
primes: List[int] = []
issues
NameError: name 'List' is not defined
This statement
print( 1_000_000_000_000_000 )
is also considered as wrong.

The first case works if you first import List from typing. Most types used with type-hints aren't built-in, they need to be imported first.
The second case also works if you are running under 3.6. On my machine it correctly prints:
Python 3.6.2 | packaged by conda-forge | (default, Jul 23 2017, 22:59:30)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print( 1_000_000_000_000_000 )
1000000000000000
If the error message you receive is: SyntaxError: invalid syntax you're on 3.5 or less. If it's SyntaxError: invalid token you're not using the underscores correctly. I'm guessing you're receiving the first.
So, you might want to double check you're running with 3.6 (python -V).

Python version Check (major attribute)

What does sys.version_info.major do?
I understand that sys.version_info returns the version of interpreter we are using. What role does major have to do in this?
How does it work in the following code?
import sys
if sys.version_info.major < 3:
#code here
else:
#code here

Python 3.5.1 (v3.5.1:37a07cee5969, Dec 5 2015, 21:12:44)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=5, micro=1, releaselevel='final', serial=0)
major can tell you whether it's python2 or python3. I guess it imports different libraries depends on the specific python versions.
I am using python 3.5.1, you can see the version_info about it.

Here's a trite example:
from __future__ import print_function
import sys
if sys.version_info.major<3:
input = raw_input # Python2 ~== eval(input())
else:
#code here
This makes Python 2.x behave a bit more like Python 3 ... the future import cannot be in the if suite because it must appear before other imports. But the changes in semantics for the input() builtin function between Python 2.x and Python 3 can handled in a conditional.
For the most part it's a bad idea to go much beyond this in your efforts to support Python2 and Python3 in the same file.
The more challenging changes between Python2 and Python3 relate to strings as bytes vs. the newer strings as Unicode; and thus the need to explicitly specify encodings and use bytes objects for many networking protocols and APIs.

For example, I have Python 3.5.3 installed on my computer. sys_version stores the following information:
sys.version_info(major=3, minor=5, micro=3, releaselevel='final', serial=0)
The reason version is split into major, minor and micro is that programmers writing code for multiple versions of Python may need to differentiate between them somehow. Most differences are between major versions 2 and 3 of Python and that explains the context of major usage in your code.

List indexing efficiency (python 2 vs python 3)

In answering another question, I suggested to use timeit to test the difference between indexing a list with positive integers vs. negative integers. Here's the code:
import timeit
t=timeit.timeit('mylist[99]',setup='mylist=list(range(100))',number=10000000)
print (t)
t=timeit.timeit('mylist[-1]',setup='mylist=list(range(100))',number=10000000)
print (t)
I ran this code with python 2.6:
$ python2.6 test.py
0.587687015533
0.586369991302
Then I ran it with python 3.2:
$ python3.2 test.py
0.9212150573730469
1.0225799083709717
Then I scratched my head, did a little google searching and decided to post these observations here.
Operating system: OS-X (10.5.8) -- Intel Core2Duo
That seems like a pretty significant difference to me (a factor of over 1.5 difference). Does anyone have an idea why python3 is so much slower -- especially for such a common operation?
EDIT
I've run the same code on my Ubuntu Linux desktop (Intel i7) and achieved comparable results with python2.6 and python 3.2. It seems that this is an issue which is operating system (or processor) dependent (Other users are seeing the same behavior on Linux machines -- See comments).
EDIT 2
The startup banner was requested in one of the answers, so here goes:
Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
and:
Python 3.2 (r32:88452, Feb 20 2011, 10:19:59)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
UPDATE
I've just installed fresh versions of python2.7.3 and python3.2.3 from http://www.python.org/download/
In both cases, I took the
"Python x.x.3 Mac OS X 32-bit i386/PPC Installer (for Mac OS X 10.3 through 10.6 [2])"
since I am on OS X 10.5. Here are the new timings (which are reasonably consistent through multiple trials):
python 2.7
$python2.7 test.py
0.577006101608
0.590042829514
python 3.2.3
$python3.2 test.py
0.8882801532745361
1.034242868423462

This appears to be an artifact of some builds of Python 3.2. The best hypothesis at this point is that all 32-bit Intel builds have the slowdown, but no 64-bit ones do. Read on for further details.
You didn't run nearly enough tests to determine anything. Repeating your test a bunch of times, I got values ranging from 0.31 to 0.54 for the same test, which is a huge variation.
So, I ran your test with 10x the number, and repeat=10, using a bunch of different Python2 and Python3 installs. Throwing away the top and bottom results, averaging the other 8, and dividing by 10 (to get a number equivalent to your tests), here's what I saw:
1. 0.52/0.53 Lion 2.6
2. 0.49/0.50 Lion 2.7
3. 0.48/0.48 MacPorts 2.7
4. 0.39/0.49 MacPorts 3.2
5. 0.39/0.48 HomeBrew 3.2
So, it looks like 3.2 is actually slightly faster with [99], and about the same speed with [-1].
However, on a 10.5 machine, I got these results:
1. 0.98/1.02 MacPorts 2.6
2. 1.47/1.59 MacPorts 3.2
Back on the original (Lion) machine, I ran in 32-bit mode, and got this:
1. 0.50/0.48 Homebrew 2.7
2. 0.75/0.82 Homebrew 3.2
So, it seems like 32-bitness is what matters, and not Leopard vs. Lion, gcc 4.0 vs. gcc 4.2 or clang, hardware differences, etc. It would help to test 64-bit builds under Leopard, with different compilers, etc., but unfortunately my Leopard box is a first-gen Intel Mini (with a 32-bit Core Solo CPU), so I can't do that test.
As further circumstantial evidence, I ran a whole slew of other quick tests on the Lion box, and it looks like 32-bit 3.2 is ~50% slower than 2.x, while 64-bit 3.2 is maybe a little faster than 2.x. But if we really want to back that up, someone needs to pick and run a real benchmark suite.
Anyway, my best guess at this point is that when optimizing the 3.x branch, nobody put much effort into 32-bit i386 Mac builds. Which is actually a reasonable choice for them to have made.
Or, alternatively, they didn't even put much effort into 32-bit i386 period. That possibility might explain why the OP saw 2.x and 3.2 giving similar results on a linux box, while Otto Allmendinger saw 3.2 being similarly slower to 2.6 on a linux box. But since neither of them mentioned whether they were running 32-bit or 64-bit linux, it's hard to know whether that's relevant.
There are still lots of other different possibilities that we haven't ruled out, but this seems like the best one.

here is a code that illustrates at least part of the answer:
$ python
Python 2.7.3 (default, Apr 20 2012, 22:44:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> t=timeit.timeit('mylist[99]',setup='mylist=list(range(100))',number=50000000)
>>> print (t)
2.55517697334
>>> t=timeit.timeit('mylist[99L]',setup='mylist=list(range(100))',number=50000000)
>>> print (t)
3.89904499054
$ python3
Python 3.2.3 (default, May 3 2012, 15:54:42)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> t=timeit.timeit('mylist[99]',setup='mylist=list(range(100))',number=50000000)
>>> print (t)
3.9906489849090576
python3 does not have old int type.

Python 3 range() is the Python 2 xrange(). If you want to simulate the Python 2 range() in Python 3 code, you have to use list(range(num). The bigger the num is, the bigger difference will be observed with your original code.
Indexing should be independent on what is stored inside the list as the list stores only references to the target objects. The references are untyped and all of the same kind. The list type is therefore a homogeneous data structure -- technically. Indexing means to turn the index value into the start address + offset. Calculating the offset is very efficient with at most one subtraction. This is very cheap extra operation when compared with the other operations.

Python sys.maxint, sys.maxunicode on Linux and windows

On 64-bit Debian Linux 6:
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxint
9223372036854775807
>>> sys.maxunicode
1114111
On 64-bit Windows 7:
Python 2.7.1 (r271:86832, Nov 27 2010, 17:19:03) [MSC v.1500 64 bit (AMD64)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxint
2147483647
>>> sys.maxunicode
65535
Both Operating Systems are 64-bit. They have sys.maxunicode, according to wikipedia There are 1,114,112 code points in unicode. Is sys.maxunicode on Windows wrong?
And why do they have different sys.maxint?

I don't know what your question is, but sys.maxunicode is not wrong on Windows.
See the docs:
sys.maxunicode
An integer giving the largest supported code point for a Unicode character. The value of this depends on the configuration option that
specifies whether Unicode characters are stored as UCS-2 or UCS-4.
Python on Windows uses UCS-2, so the largest code point is 65,535 (and the supplementary-plane characters are encoded by 2*16 bit "surrogate pairs").
About sys.maxint, this shows at which point Python 2 switches from "simple integers" (123) to "long integers" (12345678987654321L). Obviously Python for Windows uses 32 bits, and Python for Linux uses 64 bits. Since Python 3, this has become irrelevant because the simple and long integer types have been merged into one. Therefore, sys.maxint is gone from Python 3.

Regarding the difference is sys.maxint, see What is the bit size of long on 64-bit Windows?. Python uses the long type internally to store a small integer on Python 2.x.

Python future outside of a specific module

In python 2.7, by using
from __future__ import division, print_function
I can now have print(1/2) showing 0.5.
However is it possible to have this automatically imported at python startup ?
I tried to use the sitecustomize.py special module but the inport is only valid inside the module and not in the shell.
As I'm sure people will ask why I need that : teaching Python to teenagers I noticed that the integer division was not easy for them so we decided to switch to Python 3. However one of the requirement of the course was to be able to plot function and Matplotlib is pretty good but only valid for Python 2.7.
So my idea was to use a custom 2.7 installation...not perfect but I don't have a better idea to have both Matplotlib and the new "natural" division "1/2=0.5".
Any advice or maybe a Matplotlib alternative that is working on python 3.2 ?

matplotlib on python 3 is closer than you may think: https://github.com/matplotlib/matplotlib-py3; http://www.lfd.uci.edu/~gohlke/pythonlibs/#matplotlib.
Why not use PYTHONSTARTUP instead of sitecustomize.py?
localhost-2:~ $ cat startup.py
from __future__ import print_function
from __future__ import division
localhost-2:~ $ export PYTHONSTARTUP=""
localhost-2:~ $ python
Python 2.7.2 (v2.7.2:8527427914a2, Jun 11 2011, 15:22:34)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 1/2
0
>>> print("fred",end=",")
File "<stdin>", line 1
print("fred",end=",")
^
SyntaxError: invalid syntax
>>> ^D
localhost-2:~ $ export PYTHONSTARTUP=startup.py
localhost-2:~ $ python
Python 2.7.2 (v2.7.2:8527427914a2, Jun 11 2011, 15:22:34)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 1/2
0.5
>>> print("fred",end=",")
fred,>>>

No need to compile a new version of Python 2.x. You can do this at start up.
As you found, sitecustomize.py does not work. This is because the from __future__ import IDENTIFIER isn't an import. It flags the module to be compiled under special rules. Any module that uses those features must have the __future__ import, as well as the interactive console.
The following shell command will start the interactive console with division and print_function active:
python -ic "from __future__ import division, print_function"
You could alias to python (on linux) or set up a launcher to hide the extra stuff.
If you are using IDLE, the PYTHONSTARTUP script #DSM suggests should work there as well.
Note that these are not global throughout the interpreter, it only affects the interactive console. Modules on the file-system must import from __future__ explicitly to use the feature. If this is an issue, I suggest making a template to base work off of with all the needed imports:
# True division
from __future__ import division
# Modules
import matplotlib
# ... code ...
def main():
pass
if __name__ == "__main__":
main()

This may not be practical, but you may be able to compile a custom Python with the Python 3 division behavior backported. The problem with this is matplotlib might require the Python 2 behavior (although I'm not sure).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.