Why Python2 and Python3 treat same windows directory differently? - python

My windows language is Chinese.
To illustrate my point, I use package pathlib.
from pathlib import *
rootdir=Path(r'D:\新建文件夹')
print(rootdir.exists())
Python2.7 I get False
Python3 I get True
Any ideas?Thanks for any advice.
For Python2.7,you can install pathlib with "pip install pathlib"

In Python 3 strings are Unicode by default. In Python 2, they are byte strings encoded in the source file encoding. Use a Unicode string in Python 2.
Also make sure to declare the source file encoding and make sure the source is saved in that encoding.
#coding:utf8
from pathlib import *
rootdir=Path(ur'D:\新建文件夹')
print(rootdir.exists())

The main difference between Python 2 and Python 3 is the basic types that exist to deal with texts and bytes. On Python 3 we have one text type: str which holds Unicode data and two byte types bytes and bytearray.
On the other hand on Python 2 we have two text types: str which for all intents and purposes is limited to ASCII + some undefined data above the 7 bit range, unicode which is equivalent to the Python 3 str type and one byte type bytearray which it inherited from Python 3.
Python 3 removed all codecs that don't go from bytes to Unicode or vice versa and removed the now useless .encode() method on bytes and .decode() method on strings.
More about this e.g. here.

Use Unicode literals for Windows paths: add from __future__ import unicode_literals at the top.
Explanation
r'D:\新建文件夹' is a bytestring on Python 2. Its specific value depends on the encoding declaration at the top (such as # -*- coding: utf-8 -*-). You should get an error without the declaration if you use non-ascii literal in Python 2. r'D:\新建文件夹' is a Unicode string on Python 3 and the default source code encoding is utf-8 (no encoding declaration is required)
Python uses Unicode API when working with files on Windows if the input is Unicode and "ANSI" API if the input is bytes.
If the source code encoding differs from "ANSI" encoding (such as cp1252) then the result may differ: the bytes are passed as is (the same byte-sequence can represent different characters in different encodings). If the filename can't be represented in "ANSI" encoding (e.g., cp1252 -- a single byte encoding can't represent all Unicode characters -- there are around a million Unicode characters but only 256 bytes); the results may differ. Using Unicode strings for filenames on Windows fixes both issues.

Related

Is there a way to specify which Unicode format is used in unicode encoding in python 2.7?

so I'd like to encode some values in Unicode in my python 2.7 script. I'd like to know if I can specify which type of Unicode to use, i.e UTF-8 vs UTF-32. Apart from that are there any limitations as to which encodings are supported in python 2.7, and how is the default encoding determined?
So, first things first: you should be using Python 3, not Python 2.
The handling of text and unicode is the major difference between the two versions of the language, and the real reason they had to do incompatible changes, and it is much, much more straightforward in Python 3.
This means to talk about unicode in Python 2 you have to understand certain things - unicode is used to represent text: characters regardless of the underlying representation these characters have.
In Python 2 programs, all text typed in the program itself have to be typed with "u" prefixed strings, like u"..." or u'...' - otherwise the strings are considered "byte strings" - just like one have in C code. (Alternativelly, one can place from __future__ import unicode_literals in the first or second line of the file, so this is done automatically.
Otherwise, all data read into the program, either from text files, database connections, inbound HTTP requests, will usually get as byte strings in Python2, and have to explicitly converted to text-strings (that is "unicode objects" in Python 2 speak) before being processed. This is done by calling the bytes-string .decode method - and you pass as the first parameter to it the encoding name used for those bytes. That is, if you have data you have read from an utf-8 encoded file, it can be decoded to text by doing:
data = data.decode("utf-8") # and so on for other encodings.
Also, if you are typing any non-ascii character in the source code of a Python2 file, regardless of it being inside a string (or, inside a comment, for example), you have to declare the file encoding in the first line of the file.
That is done with a Python comment that is treated in a special way by the language parser - the first LoC should contain:
# encoding: utf-8
(of course, you should type the encoding actually used by your program-editor to store the file. Also, some variants on this marking are allowed, as writting "coding" instead of encoding, the ":" being optional, and so on)
So - what I've described in the previous 5 paragraphs takes place automatically in Python 3. But if you followed up so far, you now have a program running with text to be handled. As you can perceive, you did not mention in your question how you are inputing this text you want to encode in different ways.
So, just as you did explicitly convert the input bytes to in memory unicode strings, now you can use the .encode method to convert the text back to whatever text-encoding you want.
If you have some text that you want to write in a text-file encoded in utf-32 little endian, you do:
with open("myfile.txt", "wt") as file_:
file_.write(data.encode("utf-32 LE"))
The valid text codecs are listed, as per Eran's answer at:
https://docs.python.org/2/library/codecs.html#standard-encodings
Now, if you do some tests with this and succeed, you'd better do two things before proceeding any further:
switch to use Python 3. Python 2 is real obsolete at this point - check if it is not already installed in your system by typing "python3" instead of just "Python". If it is not, just install it - it can live side-by-side with Python2
Read this article, to get a grasp on what really goes on whn we talk about unicode in encodings. (The author, Joel, is the founder of Stackoverflow itself, and the article is from 2003)
In python 2, strings are by default ASCII. You can decode them and re-encode them.
supported encodings can be found here: https://docs.python.org/2/library/codecs.html#standard-encodings
Here's an example:
a = "my string" # a is ASCII encoded bytes
b = u"my string" # b is unicode, not encoded
c = a.decode() # c is unicode, not encoded, by default decoding ASCII, you can specify otherwise as an argument
d = c.encode('utf-32') # d is utf-32 encoded bytes
print type(a) # output: <type 'str'>
print type(b) # output: <type 'unicode'>
print type(c) # output: <type 'unicode'>
print type(d) # output: <type 'str'>
Note 1: that in python 3 things are somewhat different.
Note 2: In order to write non-ascii literals in your script (that is if you want to write a = "☂" as part of your code, as opposed to having a just a variable that contains data you got from somewhere) you have to declare the encoding at the top of the file, more info here. And in python 2 only a small subset of unicode characters are accepted in the literal code. (while in memory of course you are not limited).
Note 3: Of course that while unicode type is to you not encoded, internally python keeps it encoded (either as utf-32 if I'm not mistaken). But that's an internal detail that shouldn't affect your code generally speaking.

Python default string encoding

When, where and how does Python implicitly apply encodings to strings or does implicit transcodings (conversions)?
And what are those "default" (i.e., implied) encodings?
For example, what are the encodings:
of string literals?
s = "Byte string with national characters"
us = u"Unicode string with national characters"
of byte strings when type-converted to and from Unicode?
data = unicode(random_byte_string)
when byte- and Unicode strings are written to/from a file or a terminal?
print(open("The full text of War and Peace.txt").read())
There are multiple parts of Python's functionality involved here: reading the source code and parsing the string literals, transcoding, and printing. Each has its own conventions.
Short answer:
For the purpose of code parsing:
str (Py2) -- not applicable, raw bytes from the file are taken
unicode (Py2)/str (Py3) -- "source encoding", defaults are ascii (Py2) and utf-8 (Py3)
bytes (Py3) -- none, non-ASCII characters are prohibited in the literal
For the purpose of transcoding:
both (Py2) -- sys.getdefaultencoding() (ascii almost always)
there are implicit conversions which often result in a UnicodeDecodeError/UnicodeEncodeError
both (Py3) -- none, must specify encoding explicitly when converting
For the purpose of I/O:
unicode (Py2) -- <file>.encoding if set, otherwise sys.getdefaultencoding()
str (Py2) -- not applicable, raw bytes are written
str (Py3) -- <file>.encoding, always set and defaults to locale.getpreferredencoding()
bytes (Py3) -- none, printing produces its repr() instead
First of all, some terminology clarification so that you understand the rest correctly. Decoding is translation from bytes to characters (Unicode or otherwise), and encoding (as a process) is the reverse. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software to get the distinction.
Now...
Reading the source and parsing string literals
At the start of a source file, you can specify the file's "source encoding" (its exact effect is described later). If not specified, the default is ascii for Python 2 and utf-8 for Python 3. A UTF-8 BOM has the same effect as a utf-8 encoding declaration.
Python 2
Python 2 reads the source as raw bytes. It only uses the "source encoding" to parse a Unicode literal when it sees one. (It's more complicated than that under the hood, but this is the net effect.)
> type t.py
# Encoding: cp1251
s = "абвгд"
us = u"абвгд"
print repr(s), repr(us)
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0430\u0431\u0432\u0433\u0434'
<change encoding declaration in the file to cp866, do not change the contents>
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0440\u0441\u0442\u0443\u0444'
<transcode the file to utf-8, update declaration or replace with BOM>
> py -2 t.py
'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4' u'\u0430\u0431\u0432\u0433\u0434'
So, regular strings will contain the exact bytes that are in the file. And Unicode strings will contain the result of decoding the file's bytes with the "source encoding".
If the decoding fails, you will get a SyntaxError. Same if there is a non-ASCII character in the file when there's no encoding specified. Finally, if unicode_literals future is used, any regular string literals (in that file only) are treated as Unicode literals when parsing, with all what that means.
Python 3
Python 3 decodes the entire source file with the "source encoding" into a sequence of Unicode characters. Any parsing is done after that. (In particular, this makes it possible to have Unicode in identifiers.) Since all string literals are now Unicode, no additional transcoding is needed. In byte literals, non-ASCII characters are prohibited (such bytes must be specified with escape sequences), evading the issue altogether.
Transcoding
As per the clarification at the start:
str (Py2)/bytes (Py3) -- bytes => can only be decoded (directly, that is; details follow)
unicode (Py2)/str (Py3) -- characters => can only be encoded
Python 2
In both cases, if the encoding is not specified, sys.getdefaultencoding() is used. It is ascii (unless you uncomment a code chunk in site.py, or do some other hacks which are a recipe for disaster). So, for the purpose of transcoding, sys.getdefaultencoding() is the "string's default encoding".
Now, here's a caveat:
a decode() and encode() -- with the default encoding -- is done implicitly when converting str<->unicode:
in string formatting (a third of UnicodeDecodeError/UnicodeEncodeError questions on Stack Overflow are about this)
when trying to encode() a str or decode() a unicode (the second third of the Stack Overflow questions)
Python 3
There's no "default encoding" at all: implicit conversion between str and bytes is now prohibited.
bytes can only be decoded and str -- encoded, and the encoding argument is mandatory.
converting bytes->str (incl. implicitly) produces its repr() instead (which is only useful for debug printing), evading the encoding issue entirely
converting str->bytes is prohibited
Printing
This matter is unrelated to a variable's value but related to what you would see on the screen when it's printed -- and whether you will get a UnicodeEncodeError when printing.
Python 2
A unicode is encoded with <file>.encoding if set; otherwise, it's implicitly converted to str as per the above. (The final third of the UnicodeEncodeError SO questions fall into here.)
For standard streams, the stream's encoding is guessed at startup from various environment-specific sources, and can be overridden with the PYTHONIOENCODING environment variable.
str's bytes are sent to the OS stream as-is. What specific glyphs you will see on the screen depends on your terminal's encoding settings (if it's something like UTF-8, you may see nothing at all if you print a byte sequence that is invalid UTF-8).
Python 3
The changes are:
Now files opened with text vs. binary mode natively accept str or bytes, correspondingly, and outright refuse to process the wrong type. Text-mode files always have an encoding set, locale.getpreferredencoding(False) being the default.
print for text streams still implicitly converts everything to str, which in the case of bytes prints its repr() as per the above, evading the encoding issue altogether
Implicit encoding as internal format to store strings/arrays: you should not care about the encoding. In fact, Python decodes characters in a Python internal way. It is mostly transparent. Just image that it is a Unicode text, or a sequence of bytes, in an abstract way.
The internal coding in Python 3.x varies according the "larger" character. It could be UTF-8/ASCII (for ASCII strings), UTF-16 or UTF-32. When you are using strings, it is like you have a Unicode string (so abstract, not a real encoding). If you do not program in C or you use some special functions (memory view), you will never be able to see the internal encoding.
Bytes are just a view of actual memory. Python interprets is as unsigned char. But again, often you should just think about what the sequence it is, not on internal encoding.
Python 2 has bytes and string as unsigned char, and Unicode as UCS-2 (so code points above 65535 will be coded with two characters (UCS-2) in Python 2, and just one character (UTF-32) in Python 3).

Python encoding in vim

Trying to understand encoding/decoding/unicode business in Python2.7 with vim.
I have a unicode string us to which I assign some unicode string u'é'.
Question 1
How is us represented in memory? Is it a sequence of 32- bits long ints that unicode code points \u should consist of? Or is it kept in memory as a sequence of 8- bits long hex values \x in some default encoding?
Question 2
I see four different ways to set encoding for the unicode string us: #1 in the beginning of the test.py file; #2 as an argument of encode function; #3 as an argument for vim; #4 as a local encoding of the file system. So, what do all these four encodings (#1,#2,#3,#4) do?
$ vim test.py
_____________
#encoding: #1
us=u'é'
print us.encode(encoding='#2')
_____________
:set encoding=#3
$ locale | grep LANG
LANG=en_US.#4
LANGUAGE=
In Python 2.x unicodes are encoded as either UCS-2 or UCS-4 depending on the options used when building it.
Source encoding as far as Python is concerned.
Encoding used to encode us as bytes when the code is executed.
Source encoding as far as vim is concerned. If this doesn't match #1 then expect trouble.
System encoding. Mostly affects filesystem and terminal output operations.
Question 1 - Storage
us = u'é'
This creates a Unicode character with a value of é - In Python 2.2+ Unicode characters are stored in UCS-2 or UCS-4 which use 2 or 4 byte long unsigned integers depending on a build time option.
Python 3.3+ uses UTF-8 which uses between 1 & 4 bytes for each character depending on the range it is in.
The storage of Unicode strings now depends on the highest codepoint in
the string:
pure ASCII and Latin1 strings (U+0000-U+007F) use 1 byte per codepoint 0xxxxxxx;
BMP strings partial (U+0080-U+07FF) use 2 bytes per codepoint 110xxxxx 10xxxxxx;
BMP strings remaining (U+0800-U+FFFF) use 3 bytes per codepoint 1110xxxx 10xxxxxx 10xxxxxx;
Other Plains (U+10000-U+10FFFF) use 4 bytes per codepoint 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.
Question 2 - Encoding
us=u'é'
Declares us to be a Unicode string stored as above, note that in python 3 all strings are by default Unicode so the u can be omitted.
print(us.encode('ascii', strict)) # encoding='#2')
Tells print how to attempt to translate the Unicode string for output, note that if you are using Python 3.3+ and a Unicode capable terminal/console you probably don't need to ever use this.
#set encoding=#3
Tells vim, emacs and a number of editors the encoding to use when displaying &/or editing the file applies to all text files not just python.
$ locale | grep LANG
LANG=en_US.#4
Is an operating system setting for the Locale Language that tells it how to display various things specifically which code page to use when displaying extended ASCII.
This doesn't actually answer the question but I'm hoping it gives some more insight into this problem.
Answer to question 1: it shouldn't matter to the programmer how Unicode strings are represented internally in Python.
To question 2:
All the programmer should care about is that the data sink and source encoding requirements are known and correctly specified. I would assume that Python can correctly interpret UTF encoded files by reading the BOM and maybe even by making educated guesses but without the BOM it can be ambiguous how to handle bytes with the high bit set so it's advisable to either make sure the BOM is there or tell Python that the file is UTF-8 encoded if you're not sure.
There's a difference between "Unicode" and "UTF" that seems to be glossed-over above; "UTF" specifies the representation in storage (disk, memory, network packet) but "Unicode" is simply the fact that each character has a single value (code point) that ranges from 0 to 0x10FFFF. The various flavors of UTF encode that value into the appropriate storage. Working with encoded strings can be annoying though (as the character width is variable) so when strings are actually represented in memory often it's easier to expand them into some format that allows for easy manipulation. (This is touched on in a comment on another answer.)
If you want a Unicode string in Python pre-3, just type u'<whatever>' and in 3+ type '<whatever>'. You'll get Unicode and you can use \uXXXX and \UXXXXXXXX escapes if it's infeasible to just type the characters in directly. When you want to write the data, specify the encoding. UTF-8 is often the easiest to deal with and seems to be the most commonly used but you may have reason to use a UTF-16 flavor.
The takeaway here is that the encoding is just a way to transform Unicode data so that it can be persisted. The various flavors of UTF are just the encodings, they are not actually Unicode.

Python 2: Comparing a unicode and a str

This topic is already on StackOverflow but I didn't find any satisfying solution:
I have some strings in Unicode coming from a server and I have some hardcoded strings in the code which I'd like to match against. And I do understand why I can't just make a == but I do not succeed in converting them properly (I don't care if I've to do str -> unicode or unicode -> str).
I tried encode and decode but it didn't gave any result.
Here is what I receive...
fromServer = {unicode} u'Führerschein nötig'
fromCode = {str} 'Führerschein nötig'
(as you can see, it is german!)
How can have them equals in Python 2 ?
First make sure you declare the encoding of your Python source file at the top of the file. Eg. if your file is encoded as latin-1:
# -*- coding: latin-1 -*-
And second, always store text as Unicode strings:
fromCode = u'Führerschein nötig'
If you get bytes from somewhere, convert them to Unicode with str.decode before working with the text. For text files, specify the encoding when opening the file, eg:
# use codecs.open to open a text file
f = codecs.open('unicode.rst', encoding='utf-8')
Code which compares byte strings with Unicode strings will often fail at random, depending on system settings, or whatever encoding happens to be used for a text file. Don't rely on it, always make sure you compare either two unicode strings or two byte strings.
Python 3 changed this behaviour, it will not try to convert any strings. 'a' and b'a' are considered objects of a different type and comparing them will always return False.
tested on 2.7
for German umlauts latin-1 is used.
if 'Führerschein nötig'.decode('latin-1') == u'Führerschein nötig':
print('yes....')
yes....

Why don't python interpreter use the file coding format for decoding?

The code bellow will cause an UnicodeDecodeError:
#-*- coding:utf-8 -*-
s="中文"
u=u"123"
u=s+u
I know it's because python interpreter is using ascii to decode s.
Why don't python interpreter use the file format(utf-8) for decoding?
Implicit decoding cannot know what source encoding was used. That information is not stored with strings.
All that Python has after importing is a byte string with characters representing bytes in the range 0-255. You could have imported that string from another module, or read it from a file object, etc. The fact that the parser knew what encoding was used for those bytes doesn't even matter for plain byte strings.
As such, it is always better to decode bytes explicitly, rather than rely on the implicit decoding. Either make use a Unicode literal for s as well, or explicitly decode using str.decode()
u = s.decode('utf8') + u
The types of the 2 strings are different - the first is a normal string, second is a unicode string, hence the error.
So, instead of doing s="中文", do as following to get unicode strings for both:
s=u"中文"
u=u"123"
u=s+u
The code works perfectly fine on Python 3.
However, in Python 2, if you do not add a u before a string literal, you are constructing a string of bytes. When one wants to combine a string of bytes and a string of characters, one either has to decode the string of bytes, or encode the string of characters. Python 2.x opted for the former. In order to prevent accidents (for example, someone appending binary data to a user input and thus generating garbage), the Python developers chose ascii as the encoding for that conversion.
You can add a line
from __future__ import unicode_literals
after the #coding declaration so that literals without u or b prefixes are always character and not byte literals.

Categories