Python 3 - Encode/Decode vs Bytes/Str [duplicate] - python

This question already has answers here:
Best way to convert string to bytes in Python 3?
(5 answers)
Closed 4 years ago.
I am new to python3, coming from python2, and I am a bit confused with unicode fundamentals.
I've read some good posts, that made it all much clearer, however I see there are 2 methods on python 3, that handle encoding and decoding, and I'm not sure which one to use.
So the idea in python 3 is, that every string is unicode, and can be encoded and stored in bytes, or decoded back into unicode string again.
But there are 2 ways to do it:
u'something'.encode('utf-8') will generate b'something', but so does bytes(u'something', 'utf-8').
And b'bytes'.decode('utf-8') seems to do the same thing as str(b'bytes', 'utf-8').
Now my question is, why are there 2 methods that seem to do the same thing, and is either better than the other (and why?) I've been trying to find answer to this on google, but no luck.
>>> original = '27岁少妇生孩子后变老'
>>> type(original)
<class 'str'>
>>> encoded = original.encode('utf-8')
>>> print(encoded)
b'27\xe5\xb2\x81\xe5\xb0\x91\xe5\xa6\x87\xe7\x94\x9f\xe5\xad\xa9\xe5\xad\x90\xe5\x90\x8e\xe5\x8f\x98\xe8\x80\x81'
>>> type(encoded)
<class 'bytes'>
>>> encoded2 = bytes(original, 'utf-8')
>>> print(encoded2)
b'27\xe5\xb2\x81\xe5\xb0\x91\xe5\xa6\x87\xe7\x94\x9f\xe5\xad\xa9\xe5\xad\x90\xe5\x90\x8e\xe5\x8f\x98\xe8\x80\x81'
>>> type(encoded2)
<class 'bytes'>
>>> print(encoded+encoded2)
b'27\xe5\xb2\x81\xe5\xb0\x91\xe5\xa6\x87\xe7\x94\x9f\xe5\xad\xa9\xe5\xad\x90\xe5\x90\x8e\xe5\x8f\x98\xe8\x80\x8127\xe5\xb2\x81\xe5\xb0\x91\xe5\xa6\x87\xe7\x94\x9f\xe5\xad\xa9\xe5\xad\x90\xe5\x90\x8e\xe5\x8f\x98\xe8\x80\x81'
>>> decoded = encoded.decode('utf-8')
>>> print(decoded)
27岁少妇生孩子后变老
>>> decoded2 = str(encoded2, 'utf-8')
>>> print(decoded2)
27岁少妇生孩子后变老
>>> type(decoded)
<class 'str'>
>>> type(decoded2)
<class 'str'>
>>> print(str(b'27\xe5\xb2\x81\xe5\xb0\x91\xe5\xa6\x87\xe7\x94\x9f\xe5\xad\xa9\xe5\xad\x90\xe5\x90\x8e\xe5\x8f\x98\xe8\x80\x81', 'utf-8'))
27岁少妇生孩子后变老
>>> print(b'27\xe5\xb2\x81\xe5\xb0\x91\xe5\xa6\x87\xe7\x94\x9f\xe5\xad\xa9\xe5\xad\x90\xe5\x90\x8e\xe5\x8f\x98\xe8\x80\x81'.decode('utf-8'))
27岁少妇生孩子后变老

Neither is better than the other, they do exactly the same thing. However, using .encode() and .decode() is the more common way to do it. It is also compatible with Python 2.

To add to Lennart Regebro's answer There is even the third way that can be used:
encoded3 = str.encode(original, 'utf-8')
print(encoded3)
Anyway, it is actually exactly the same as the first approach. It may also look that the second way is a syntactic sugar for the third approach.
A programming language is a means to express abstract ideas formally, to be executed by the machine. A programming language is considered good if it contains constructs that one needs. Python is a hybrid language -- i.e. more natural and more versatile than pure OO or pure procedural languages. Sometimes functions are more appropriate than the object methods, sometimes the reverse is true. It depends on mental picture of the solved problem.
Anyway, the feature mentioned in the question is probably a by-product of the language implementation/design. In my opinion, this is a nice example that show the alternative thinking about technically the same thing.
In other words, calling an object method means thinking in terms "let the object gives me the wanted result". Calling a function as the alternative means "let the outer code processes the passed argument and extracts the wanted value".
The first approach emphasizes the ability of the object to do the task on its own, the second approach emphasizes the ability of an separate algoritm to extract the data. Sometimes, the separate code may be that much special that it is not wise to add it as a general method to the class of the object.

To add to add to the previous answer, there is even a fourth way that can be used
import codecs
encoded4 = codecs.encode(original, 'utf-8')
print(encoded4)

Related

how to convert Python 2 unicode() function into correct Python 3.x syntax

I enabled the compatibility check in my Python IDE and now I realize that the inherited Python 2.7 code has a lot of calls to unicode() which are not allowed in Python 3.x.
I looked at the docs of Python2 and found no hint how to upgrade:
I don't want to switch to Python3 now, but maybe in the future.
The code contains about 500 calls to unicode()
How to proceed?
Update
The comment of user vaultah to read the pyporting guide has received several upvotes.
My current solution is this (thanks to Peter Brittain):
from builtins import str
... I could not find this hint in the pyporting docs.....
As has already been pointed out in the comments, there is already advice on porting from 2 to 3.
Having recently had to port some of my own code from 2 to 3 and maintain compatibility for each for now, I wholeheartedly recommend using python-future, which provides a great tool to help update your code (futurize) as well as clear guidance for how to write cross-compatible code.
In your specific case, I would simply convert all calls to unicode to use str and then import str from builtins. Any IDE worth its salt these days will do that global search and replace in one operation.
Of course, that's the sort of thing futurize should catch too, if you just want to use automatic conversion (and to look for other potential issues in your code).
You can test whether there is such a function as unicode() in the version of Python that you're running. If not, you can create a unicode() alias for the str() function, which does in Python 3 what unicode() did in Python 2, as all strings are unicode in Python 3.
# Python 3 compatibility hack
try:
unicode('')
except NameError:
unicode = str
Note that a more complete port is probably a better idea; see the porting guide for details.
Short answer: Replace all unicode calls with str calls.
Long answer: In Python 3, Unicode was replaced with strings because of its abundance. The following solution should work if you are only using Python 3:
unicode = str
# the rest of your goes goes here
If you are using it with both Python 2 or Python 3, use this instead:
import sys
if sys.version_info.major == 3:
unicode = str
# the rest of your code goes here
The other way: run this in the command line
$ 2to3 package -w
First, as a strategy, I would take a small part of your program and try to port it. The number of unicode calls you are describing suggest to me that your application cares about string representations more than most and each use-case is often different.
The important consideration is that all strings are unicode in Python 3. If you are using the str type to store "bytes" (for example, if they are read from a file), then you should be aware that those will not be bytes in Python3 but will be unicode characters to begin with.
Let's look at a few cases.
First, if you do not have any non-ASCII characters at all and really are not using the Unicode character set, it is easy. Chances are you can simply change the unicode() function to str(). That will assure that any object passed as an argument is properly converted. However, it is wishful thinking to assume it's that easy.
Most likely, you'll need to look at the argument to unicode() to see what it is, and determine how to treat it.
For example, if you are reading UTF-8 characters from a file in Python 2 and converting them to Unicode your code would look like this:
data = open('somefile', 'r').read()
udata = unicode(data)
However, in Python3, read() returns Unicode data to begin with, and the unicode decoding must be specified when opening the file:
udata = open('somefile', 'r', encoding='UTF-8').read()
As you can see, transforming unicode() simply when porting may depend heavily on how and why the application is doing Unicode conversions, where the data has come from, and where it is going to.
Python3 brings greater clarity to string representations, which is welcome, but can make porting daunting. For example, Python3 has a proper bytes type, and you convert byte-data to unicode like this:
udata = bytedata.decode('UTF-8')
or convert Unicode data to character form using the opposite transform.
bytedata = udata.encode('UTF-8')
I hope this at least helps determine a strategy.
You can use six library which have text_type function (unicode in py2, str in py3):
from six import text_type

Supporting python 2 and 3: str, bytes or alternative

I have a Python2 codebase that makes extensive use of str to store raw binary data. I want to support both Python2 and Python3.
The bytes (an alis of str) type in Python2 and bytes in Python3 are completely different. They take different arguments to construct, index to different types and have different str and repr.
What's the best way of unifying the code for both Python versions, using a single type to store raw data?
The python-future package has a backport of the Python3 bytes type.
>>> from builtins import bytes # in py2, this picks up the backport
>>> b = bytes(b'ABCD')
This provides the Python 3 interface in both Python 2 and Python 3. In Python 3, it is the builtin bytes type. In Python 2, it is a compatibility layer on top of the str type.
I don't know on what parts you want to work with bytes, I allmost allways work with bytearray's, and this is how I do it when reading from a file
with open(file, 'rb') as imageFile:
f = imageFile.read()
b = bytearray(f)
I took that right out of a project I am working on, and it works in both 2 and 3. Maybe something for you to look at?
If your project small and simple use six.
Otherwise I suggest to have two independent codebases: one for Python 2 and one for Python 3. Initially it may sound like a lot of unnecessary work, but eventually it's actually a lot easier to maintain.
As an example of what your project may become if you decide to support both pythons in a single codebase, take a look at google's protobuf. Lots of often counterintuitive branching all round the code, abstractions that were modified just to allow hacks. And as your project will evolve it won't get better: deadlines play against quality of the code.
With two separate codebases you will simply apply almost identical patches which isn't a lot of work compared to what is ahead of you if you want a single code base. And it will be easier to migrate to Python 3 completely once number of Python 2 users of your package drop.
Assuming you only need to support Python 2.6 and newer, you can simply use bytes for, well, bytes. Use b literals to create bytes objects, such as b'\x0a\x0b\x00'. When working with files, make sure the mode includes a b (as in open('file.bin', 'rb')).
Beware that iteration and element access is different though. In these cases, you can write your code to use chunks. Instead of b[0] == 0 (Python 3) or b[0] == b'\x00' (Python 2) write b[0:1] == b'\x00'. Other options is using bytearray (when the bytes are mutable) or helper functions.
Strings of characters should be unicode in Python 2, independent from Python 3 porting; otherwise the code would likely be wrong when encountering non-ASCII characters anyways. The equivalent is str in Python 3.
Either use u literals to create character strings (such as u'Düsseldorf') and/or make sure to start every file with from __future__ import unicode_literals. Declare file encodings when necessary by starting files with # encoding: utf-8.
Use io.open to read character strings from files. For network code, fetch bytes and call decode on them to get a character string.
If you need to support Python 2.5 or 3.2, have a look at six to convert literals.
Add plenty of assertions to make sure you that functions which operate on character strings don't get bytes, and vice versa. As usual, a good test suite with 100% coverage helps a lot.

Why can't Python's string.format pad with "\x00"?

I wanted to pad a string with null characters ("\x00"). I know lots of ways to do this, so please do not answer with alternatives. What I want to know is: Why does Python's string.format() function not allow padding with nulls?
Test cases:
>>> "{0:\x01<10}".format("bbb")
'bbb\x01\x01\x01\x01\x01\x01\x01'
This shows that hex-escaped characters work in general.
>>> "{0:\x00<10}".format("bbb")
'bbb '
But "\x00" gets turned into a space ("\x20").
>>> "{0:{1}<10}".format("bbb","\x00")
'bbb '
>>> "{0:{1}<10}".format("bbb",chr(0))
'bbb '
Even trying a couple other ways of doing it.
>>> "bbb" + "\x00" * 7
'bbb\x00\x00\x00\x00\x00\x00\x00'
This works, but doesn't use string.format
>>> spaces = "{0: <10}".format("bbb")
>>> nulls = "{0:\x00<10}".format("bbb")
>>> spaces == nulls
True
Python is clearly substituting spaces (chr(0x20)) instead of nulls (chr(0x00)).
Digging into the source code for Python 2.7, I found that the issue is in this section from ./Objects/stringlib/formatter.h, lines 718-722 (in version 2.7.3):
/* Write into that space. First the padding. */
p = fill_padding(STRINGLIB_STR(result), len,
format->fill_char=='\0'?' ':format->fill_char,
lpad, rpad);
The trouble is that a zero/null character ('\0') is being used as a default when no padding character is specified. This is to enable this behavior:
>>> "{0:<10}".format("foo")
'foo '
It may be possible to set format->fill_char = ' '; as the default in parse_internal_render_format_spec() at ./Objects/stringlib/formatter.h:186, but there's some bit about backwards compatibility that checks for '\0' later on. In any case, my curiosity is satisfied. I will accept someone else's answer if it has more history or a better explanation for why than this.
The answer to the original question is that it was a bug in python.
It was documented as being permitted, but wasn't. It was fixed in 2014. For python 2, the fix first appeared in either 2.7.7 or 2.7.8 (I'm not sure how to tell which)
Original tracked issue.
Because the string.format method in Python2.7 is a back port from Python3 string.format. Python2.7 unicode is the Python 3 string, where the Python2.7 string is the Python3 bytes. A string is the wrong type to express binary data in Python3. You would use bytes which has no format method. So really you should be asking why is the format method on string at all in 2.7 when it should have really only been on the unicode type since that is what became the string in Python3.
Which I guess that answer is that it is too convenient to have it there.
As a related matter why there is not format on bytes yet

Integer literal is an object in Python? [duplicate]

This question already exists:
Closed 10 years ago.
Possible Duplicate:
accessing a python int literals methods
Everything in Python is an object. Even a number is an object:
>>> a=1
>>> type(a)
<class 'int'>
>>>a.real
1
I tried the following, because we should be able to access class members of an object:
>>> type(1)
<class 'int'>
>>> 1.real
File "<stdin>", line 1
1.real
^
SyntaxError: invalid syntax
Why does this not work?
Yes, an integer literal is an object in Python. To summarize, the parser needs to be able to understand it is dealing with an object of type integer, while the statement 1.real confuses the parser into thinking it has a float 1. followed by the word real, and therefore raises a syntax error.
To test this you can also try
>> (1).real
1
as well as,
>> 1.0.real
1.0
so in the case of 1.real python is interpreting the . as a decimal point.
Edit
BasicWolf puts it nicely too - 1. is being interpreted as the floating point representation of 1, so 1.real is equivalent to writing (1.)real - so with no attribute access operator i.e. period /full stop. Hence the syntax error.
Further edit
As mgilson alludes to in his/her comment: the parser can handle access to int's attributes and methods, but only as long the statement makes it clear that it is being given an int and not a float.
a language is usually built in three layers.
when you provide a program to a language it first has to "read" the program. then it builds what it has read into something it can work with. and finally it runs that thing as "a program" and (hopefully) prints a result.
the problem here is that the first part of python - the part that reads programs - is confused. it's confused because it's not clever enough to know the difference between
1.234
and
1.letters
what seems to be happening is that it thinks you were trying to type a number like 1.234 but made a mistake and typed letters instead(!).
so this has nothing to do with what 1 "really is" and whether or not is it an object. all that kind of logic happens in the second and third stages i described earlier, when python tries to build and then run the program.
what you've uncovered is just a strange (but interesting!) wrinkle in how python reads programs.
[i'd call it a bug, but it's probably like this for a reason. it turns out that some things are hard for computers to read. python is probably designed so that it's easy (fast) for the computer to read programs. fixing this "bug" would probably make the part of python that reads programs slower or more complicated. so it's probably a trade-off.]
Although the behaviour with 1.real seems unlogical, it is expected due to the language specification: Python interprets 1. as a float (see floating point literals). But as #mutzmatron pointed out (1).real works because the expression in brackets is a valid Python object.
Update: Note the following pits:
1 + 2j.real
>>> 1.0 # due to the fact that 2j.real == 0
# but
1 + 2j.imag
>>> 3.0 # due to the fact that 2j.imag == 2
You can still access 1.real:
>>> hasattr(1, 'real')
True
>>> getattr(1, 'real')
1

which are python best practices to concatenate different types of data?

I want to generate bytes sequence containing string length and string content.
For example, for string 'hello' I want to get b'\x05hello'
After some docs reading I've wrote a function:
def LenAndStrBytes(strdata):
return bytearray([len(strdata)&0xFF])+strdata if strdata!=[] else 0
question:
I'm new in python programming and I wonder, which are best python practices to concatenate different types of data like int and something iterable like bytearray
Did I write my function optimally?
Well, just larsmans points out, it depends on your usage. If you can get the result w/ clear code and fulfilling context limitation, it is nice practice which is suitable.
No need for &0xFF, bytearray checks to ensure values between 0 and 255.
>>> strdata = 'hello'
>>> bytearray([len(strdata)]) + strdata if strdata else bytearray()
bytearray(b'\x05hello')
And you could also
import struct
bytearray(struct.pack('B%ds' % len(strdata), len(strdata), strdata))
Are you trying to serialise binary data before you write it to a file (or send it over a network?)
Did you perhaps mean to use the pickle module for data serialisation instead?

Categories