I'am having trouble in achieving the same results of md5 digest() method from Python 2.7 in Python 3.6.
Python 2.7:
import md5
encryption_base = 'cS35jJYp15kjQf01FVqA7ubRaNOXKPmYGRbLUiimX0g3frQhzOZBmTSni4IEjHLWYMMioGaliIz5z8u2:abcdefghkmnopqrstuvwxyz:4'
digest = md5.new (encryption_base).digest()
print(digest)
#T┼ǃ×ÞRK(M<¶┤# ²
Python 3.6:
from hashlib import md5
encryption_base = 'cS35jJYp15kjQf01FVqA7ubRaNOXKPmYGRbLUiimX0g3frQhzOZBmTSni4IEjHLWYMMioGaliIz5z8u2:abcdefghkmnopqrstuvwxyz:4'
digest = md5(encryption_base.encode()).digest()
print(digest)
#b'T\xc5\x80\x9f\x9e\xe8RK(M<\xf4\xb4#\t\xfd'
How can I get the same string as in the Python 2.7 result? .hexdigest is not the case for this also.
You have the exact same result, a bytestring. The only difference is that in Python 3 printing a bytestring gives you a debugging-friendly representation, not the raw bytes. That's because the raw bytes are not necessarily printable and print() needs Unicode strings.
If you must have the same output, write the bytes directly to the stdout buffer, bypassing the Unicode TextIOWrapper() that takes care of encoding text to the underlying locale codec:
import sys
digest = md5(encryption_base.encode('ASCII')).digest()
sys.stdout.buffer.write(digest + b'\n')
Note that you must ensure that you define your encryption_base value as a bytes value too, or at least encode it to the same codec, ASCII, like I did above.
Definining it as a bytestring gives you the same value as in Python 2 without encoding:
encryption_base = b'cS35jJYp15kjQf01FVqA7ubRaNOXKPmYGRbLUiimX0g3frQhzOZBmTSni4IEjHLWYMMioGaliIz5z8u2:abcdefghkmnopqrstuvwxyz:4'
When you use str.encode() without explicitly setting an argument, you are encoding to UTF-8. IF your encryption_base string only consists of ASCII codepoints, the result would be the same, but not if you have any Latin-1 or higher codepoints in there too. Don't conflate bytes with Unicode codepoints! See https://nedbatchelder.com/text/unipain.html to fully understand the difference and how that difference applies to Python 2 and 3.
Related
When, where and how does Python implicitly apply encodings to strings or does implicit transcodings (conversions)?
And what are those "default" (i.e., implied) encodings?
For example, what are the encodings:
of string literals?
s = "Byte string with national characters"
us = u"Unicode string with national characters"
of byte strings when type-converted to and from Unicode?
data = unicode(random_byte_string)
when byte- and Unicode strings are written to/from a file or a terminal?
print(open("The full text of War and Peace.txt").read())
There are multiple parts of Python's functionality involved here: reading the source code and parsing the string literals, transcoding, and printing. Each has its own conventions.
Short answer:
For the purpose of code parsing:
str (Py2) -- not applicable, raw bytes from the file are taken
unicode (Py2)/str (Py3) -- "source encoding", defaults are ascii (Py2) and utf-8 (Py3)
bytes (Py3) -- none, non-ASCII characters are prohibited in the literal
For the purpose of transcoding:
both (Py2) -- sys.getdefaultencoding() (ascii almost always)
there are implicit conversions which often result in a UnicodeDecodeError/UnicodeEncodeError
both (Py3) -- none, must specify encoding explicitly when converting
For the purpose of I/O:
unicode (Py2) -- <file>.encoding if set, otherwise sys.getdefaultencoding()
str (Py2) -- not applicable, raw bytes are written
str (Py3) -- <file>.encoding, always set and defaults to locale.getpreferredencoding()
bytes (Py3) -- none, printing produces its repr() instead
First of all, some terminology clarification so that you understand the rest correctly. Decoding is translation from bytes to characters (Unicode or otherwise), and encoding (as a process) is the reverse. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software to get the distinction.
Now...
Reading the source and parsing string literals
At the start of a source file, you can specify the file's "source encoding" (its exact effect is described later). If not specified, the default is ascii for Python 2 and utf-8 for Python 3. A UTF-8 BOM has the same effect as a utf-8 encoding declaration.
Python 2
Python 2 reads the source as raw bytes. It only uses the "source encoding" to parse a Unicode literal when it sees one. (It's more complicated than that under the hood, but this is the net effect.)
> type t.py
# Encoding: cp1251
s = "абвгд"
us = u"абвгд"
print repr(s), repr(us)
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0430\u0431\u0432\u0433\u0434'
<change encoding declaration in the file to cp866, do not change the contents>
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0440\u0441\u0442\u0443\u0444'
<transcode the file to utf-8, update declaration or replace with BOM>
> py -2 t.py
'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4' u'\u0430\u0431\u0432\u0433\u0434'
So, regular strings will contain the exact bytes that are in the file. And Unicode strings will contain the result of decoding the file's bytes with the "source encoding".
If the decoding fails, you will get a SyntaxError. Same if there is a non-ASCII character in the file when there's no encoding specified. Finally, if unicode_literals future is used, any regular string literals (in that file only) are treated as Unicode literals when parsing, with all what that means.
Python 3
Python 3 decodes the entire source file with the "source encoding" into a sequence of Unicode characters. Any parsing is done after that. (In particular, this makes it possible to have Unicode in identifiers.) Since all string literals are now Unicode, no additional transcoding is needed. In byte literals, non-ASCII characters are prohibited (such bytes must be specified with escape sequences), evading the issue altogether.
Transcoding
As per the clarification at the start:
str (Py2)/bytes (Py3) -- bytes => can only be decoded (directly, that is; details follow)
unicode (Py2)/str (Py3) -- characters => can only be encoded
Python 2
In both cases, if the encoding is not specified, sys.getdefaultencoding() is used. It is ascii (unless you uncomment a code chunk in site.py, or do some other hacks which are a recipe for disaster). So, for the purpose of transcoding, sys.getdefaultencoding() is the "string's default encoding".
Now, here's a caveat:
a decode() and encode() -- with the default encoding -- is done implicitly when converting str<->unicode:
in string formatting (a third of UnicodeDecodeError/UnicodeEncodeError questions on Stack Overflow are about this)
when trying to encode() a str or decode() a unicode (the second third of the Stack Overflow questions)
Python 3
There's no "default encoding" at all: implicit conversion between str and bytes is now prohibited.
bytes can only be decoded and str -- encoded, and the encoding argument is mandatory.
converting bytes->str (incl. implicitly) produces its repr() instead (which is only useful for debug printing), evading the encoding issue entirely
converting str->bytes is prohibited
Printing
This matter is unrelated to a variable's value but related to what you would see on the screen when it's printed -- and whether you will get a UnicodeEncodeError when printing.
Python 2
A unicode is encoded with <file>.encoding if set; otherwise, it's implicitly converted to str as per the above. (The final third of the UnicodeEncodeError SO questions fall into here.)
For standard streams, the stream's encoding is guessed at startup from various environment-specific sources, and can be overridden with the PYTHONIOENCODING environment variable.
str's bytes are sent to the OS stream as-is. What specific glyphs you will see on the screen depends on your terminal's encoding settings (if it's something like UTF-8, you may see nothing at all if you print a byte sequence that is invalid UTF-8).
Python 3
The changes are:
Now files opened with text vs. binary mode natively accept str or bytes, correspondingly, and outright refuse to process the wrong type. Text-mode files always have an encoding set, locale.getpreferredencoding(False) being the default.
print for text streams still implicitly converts everything to str, which in the case of bytes prints its repr() as per the above, evading the encoding issue altogether
Implicit encoding as internal format to store strings/arrays: you should not care about the encoding. In fact, Python decodes characters in a Python internal way. It is mostly transparent. Just image that it is a Unicode text, or a sequence of bytes, in an abstract way.
The internal coding in Python 3.x varies according the "larger" character. It could be UTF-8/ASCII (for ASCII strings), UTF-16 or UTF-32. When you are using strings, it is like you have a Unicode string (so abstract, not a real encoding). If you do not program in C or you use some special functions (memory view), you will never be able to see the internal encoding.
Bytes are just a view of actual memory. Python interprets is as unsigned char. But again, often you should just think about what the sequence it is, not on internal encoding.
Python 2 has bytes and string as unsigned char, and Unicode as UCS-2 (so code points above 65535 will be coded with two characters (UCS-2) in Python 2, and just one character (UTF-32) in Python 3).
I performed socket communication in python2, it worked well and I have to make it works in python3 again. I have tired str.encode() stuff with many formats, but the other side of the network can't recognize what I send. The only thing I know is that the python3 str type is encoded as Unicode uft-8 in default, and I'm pretty sure the critical question in here is that what is the format of python2 str type. I have to send exactly the same thing as what was stored in python2 str. But the tricky thing is the socket of python3 only sends the encoded unicode bytes or other buffer interface, rather than the str type with the raw data in Python2. The example is as follow:
In python2:
data = 'AA060100B155'
datasplit = [fulldata[i: i+2] for i in range(0, len(fulldata), 2)]
senddata = ''
for item in datasplit:
itemdec = chr(int(item, 16))
senddata += itemdec
print(senddata)
#'\xaa\x06\x01\x00\xb1U',which is the data I need
In python3, seems it can only sends the encoded bytes using "senddata.encode()", but it is not the format I want. You can try:
print(senddata.encode('latin-1'))
#b'\xaa\x06\x01\x01\xb2U'
to see the difference of two senddatas, and an interesting thing is that it is faulty encoded when using utf-8.
The data stored in Python3 str type is the thing I need, but my question is how to send the data of that string without encoding it? Or how to perform the same str type of Python2 in Python3?
Can anyone help me with this?
I performed socket communication in python2, it worked well and I have to make it works in python3 again. I have tired str.encode() stuff with many formats, but the other side of the network can't recognize what I send.
You have to make sure that whatever you send is decodable by the other side. The first step you need to take is to know what sort of encoding that network/file/socket is using. If you use UTF-8 for instance to send your encoded data and the client has ASCII encoding, this will work. But, say cp500 is the encoding scheme of your client and you send the encoded string as UTF-8, this won't work. It's better to pass the name of your desired encoding explicitly to functions, because sometimes the default encoding of your platform may not necessarily be UTF-8. You can always check the default encoding by this call sys.getdefaultencoding().
The only thing I know is that the python3 str type is encoded as Unicode uft-8 in default, and I'm pretty sure the critical question in here is that what is the format of python2 str type. I have to send exactly the same thing as what was stored in python2 str. But the tricky thing is the socket of python3 only sends the encoded unicode bytes or other buffer interface, rather than the str type with the raw data in Python2
Yes, Python 3.X uses UTF-8 as the default encoding, but this is not guaranteed in some cases the default encoding could be changed, it's better to pass the name of the desired encoding explicitly to avoid such cases. Notice though, str in Python 3.X is the equivalent of unicode + str in 2.X, but str in 2.X supports only 8-bit (1-byte) (0-255) characters.
On one hand, your problem seems with 3.X and its type distinction between str and bytes strings. For APIs that expect bytes won't accept str in 3.X as of today. This is unlike 2.X, where you can mix unicode and str freely. This distinction in 3.X makes sense, given str represents decoded strings and used for textual data. Whereas, bytes represents encoded strings as raw bytes with absolute byte values.
On the other hand, you have problem with choosing the right encoding for your text in 3.X that you need to pass to client. First check what sort of encoding does your client use. Second, pass the encoded string with the the proper encoding scheme of your client so your client can decode it properly: str.encode('same-encoding-as-client').
Because you pass your data as str in 2.X and it works, I suspect and it's most likely your client uses 8-bit encoding for characters, something like Latin-1 might be the encoding used by your client.
You can convert the whole string to an integer, then use the integer method to_bytes to convert it into a bytes object:
fulldata = 'AA060100B155'
senddata = int(fulldata, 16).to_bytes(len(fulldata)//2, byteorder='big')
print(senddata)
# b'\xaa\x06\x01\x00\xb1U'
The first parameter of to_bytes is the number of bytes, the second (required) is the byteorder.
See int.to_bytes in the official documentation for reference.
There are various ways to do this. Here's one that works in both Python 2 and Python 3.
from binascii import unhexlify
fulldata = 'AA060100B155'
senddata = unhexlify(fulldata)
print(repr(senddata))
Python 2 output
'\xaa\x06\x01\x00\xb1U'
Python 3 output
b'\xaa\x06\x01\x00\xb1U'
The following is Python 2/3 compatible. The unhexlify function converts hexadecimal notation to bytes. Use a byte string and you don't have to deal with Unicode strings. Python 2 is byte strings by default, but recognizes the b'' syntax that Python 3 requires to use a byte string.
from binascii import unhexlify
fulldata = b'AA060100B155'
print(repr(unhexlify(fulldata)))
Python 2 output:
'\xaa\x06\x01\x00\xb1U'
Python 3 output:
b'\xaa\x06\x01\x00\xb1U'
My windows language is Chinese.
To illustrate my point, I use package pathlib.
from pathlib import *
rootdir=Path(r'D:\新建文件夹')
print(rootdir.exists())
Python2.7 I get False
Python3 I get True
Any ideas?Thanks for any advice.
For Python2.7,you can install pathlib with "pip install pathlib"
In Python 3 strings are Unicode by default. In Python 2, they are byte strings encoded in the source file encoding. Use a Unicode string in Python 2.
Also make sure to declare the source file encoding and make sure the source is saved in that encoding.
#coding:utf8
from pathlib import *
rootdir=Path(ur'D:\新建文件夹')
print(rootdir.exists())
The main difference between Python 2 and Python 3 is the basic types that exist to deal with texts and bytes. On Python 3 we have one text type: str which holds Unicode data and two byte types bytes and bytearray.
On the other hand on Python 2 we have two text types: str which for all intents and purposes is limited to ASCII + some undefined data above the 7 bit range, unicode which is equivalent to the Python 3 str type and one byte type bytearray which it inherited from Python 3.
Python 3 removed all codecs that don't go from bytes to Unicode or vice versa and removed the now useless .encode() method on bytes and .decode() method on strings.
More about this e.g. here.
Use Unicode literals for Windows paths: add from __future__ import unicode_literals at the top.
Explanation
r'D:\新建文件夹' is a bytestring on Python 2. Its specific value depends on the encoding declaration at the top (such as # -*- coding: utf-8 -*-). You should get an error without the declaration if you use non-ascii literal in Python 2. r'D:\新建文件夹' is a Unicode string on Python 3 and the default source code encoding is utf-8 (no encoding declaration is required)
Python uses Unicode API when working with files on Windows if the input is Unicode and "ANSI" API if the input is bytes.
If the source code encoding differs from "ANSI" encoding (such as cp1252) then the result may differ: the bytes are passed as is (the same byte-sequence can represent different characters in different encodings). If the filename can't be represented in "ANSI" encoding (e.g., cp1252 -- a single byte encoding can't represent all Unicode characters -- there are around a million Unicode characters but only 256 bytes); the results may differ. Using Unicode strings for filenames on Windows fixes both issues.
The code bellow will cause an UnicodeDecodeError:
#-*- coding:utf-8 -*-
s="中文"
u=u"123"
u=s+u
I know it's because python interpreter is using ascii to decode s.
Why don't python interpreter use the file format(utf-8) for decoding?
Implicit decoding cannot know what source encoding was used. That information is not stored with strings.
All that Python has after importing is a byte string with characters representing bytes in the range 0-255. You could have imported that string from another module, or read it from a file object, etc. The fact that the parser knew what encoding was used for those bytes doesn't even matter for plain byte strings.
As such, it is always better to decode bytes explicitly, rather than rely on the implicit decoding. Either make use a Unicode literal for s as well, or explicitly decode using str.decode()
u = s.decode('utf8') + u
The types of the 2 strings are different - the first is a normal string, second is a unicode string, hence the error.
So, instead of doing s="中文", do as following to get unicode strings for both:
s=u"中文"
u=u"123"
u=s+u
The code works perfectly fine on Python 3.
However, in Python 2, if you do not add a u before a string literal, you are constructing a string of bytes. When one wants to combine a string of bytes and a string of characters, one either has to decode the string of bytes, or encode the string of characters. Python 2.x opted for the former. In order to prevent accidents (for example, someone appending binary data to a user input and thus generating garbage), the Python developers chose ascii as the encoding for that conversion.
You can add a line
from __future__ import unicode_literals
after the #coding declaration so that literals without u or b prefixes are always character and not byte literals.
unichr(0x10000) fails with a ValueError when cpython is compiled without --enable-unicode=ucs4.
Is there a language builtin or core library function that converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on?
Yes, here you go:
>>> unichr(0xd800)+unichr(0xdc00)
u'\U00010000'
The crucial point to understand is that unichr() converts an integer to a single code unit in the Python interpreter's string encoding. The The Python Standard Library documentation for 2.7.3, 2. Built-in Functions, on unichr() reads,
Return the Unicode string of one character whose Unicode code is the integer i.... The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. ValueError is raised otherwise.
I added emphasis to "one character", by which they mean "one code unit" in Unicode terms.
I'm assuming that you are using Python 2.x. The Python 3.x interpreter has no built-in unichr() function. Instead the The Python Standard Library documentation for 3.3.0, 2. Built-in Functions, on chr() reads,
Return the string representing a character whose Unicode codepoint is the integer i.... The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).
Note that the return value is now a string of unspecified length, not a string with a single code unit. So in Python 3.x, chr(0x10000) would behave as you expected. It "converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on".
But back to Python 2.x. If you use unichr() to create Python 2.x unicode objects, and you are using Unicode scalar values above 0xFFFF, then you are committing your code to being aware of the Python interpreter's implementation of unicode objects.
You can isolate this awareness with a function which tries unichr() on a scalar value, catches ValueError, and tries again with the corresponding UTF-16 surrogate pair:
def unichr_supplemental(scalar):
try:
return unichr(scalar)
except ValueError:
return unichr( 0xd800 + ((scalar-0x10000)//0x400) ) \
+unichr( 0xdc00 + ((scalar-0x10000)% 0x400) )
>>> unichr_supplemental(0x41),len(unichr_supplemental(0x41))
(u'A', 1)
>>> unichr_supplemental(0x10000), len(unichr_supplemental(0x10000))
(u'\U00010000', 2)
But you might find it easier to just convert your scalars to 4-byte UTF-32 values in a UTF-32 byte string, and decode this byte string into a unicode string:
>>> '\x00\x00\x00\x41'.decode('utf-32be'), \
... len('\x00\x00\x00\x41'.decode('utf-32be'))
(u'A', 1)
>>> '\x00\x01\x00\x00'.decode('utf-32be'), \
... len('\x00\x01\x00\x00'.decode('utf-32be'))
(u'\U00010000', 2)
The code above was tested on Python 2.6.7 with UTF-16 encoding for Unicode strings. I didn't test it on a Python 2.x intepreter with UTF-32 encoding for Unicode strings. However, it should work unchanged on any Python 2.x interpreter with any Unicode string implementation.