How to change '\x??' to unicode in python? - python

I have a string:
\xe2\x80\x8e\xd7\x93\xd7\x9c\xd7\x99\xd7\xaa\xe2\x80\x8e
want to achange it to unicode using python
how do I do that?

That is UTF-8 data already; python is showing you the string literal form.
>>> print '\xe2\x80\x8e\xd7\x93\xd7\x9c\xd7\x99\xd7\xaa\xe2\x80\x8e'.decode('utf8')
‎דלית‎
The above line decodes the UTF-8 data to a unicode object with .decode('utf8') and prints that; theprintstatement inspects the encoding used by my terminal and re-encodes theunicode` object so that my terminal can display it properly.
You may want to read up on Python and Unicode:
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Related

Python default string encoding

When, where and how does Python implicitly apply encodings to strings or does implicit transcodings (conversions)?
And what are those "default" (i.e., implied) encodings?
For example, what are the encodings:
of string literals?
s = "Byte string with national characters"
us = u"Unicode string with national characters"
of byte strings when type-converted to and from Unicode?
data = unicode(random_byte_string)
when byte- and Unicode strings are written to/from a file or a terminal?
print(open("The full text of War and Peace.txt").read())
There are multiple parts of Python's functionality involved here: reading the source code and parsing the string literals, transcoding, and printing. Each has its own conventions.
Short answer:
For the purpose of code parsing:
str (Py2) -- not applicable, raw bytes from the file are taken
unicode (Py2)/str (Py3) -- "source encoding", defaults are ascii (Py2) and utf-8 (Py3)
bytes (Py3) -- none, non-ASCII characters are prohibited in the literal
For the purpose of transcoding:
both (Py2) -- sys.getdefaultencoding() (ascii almost always)
there are implicit conversions which often result in a UnicodeDecodeError/UnicodeEncodeError
both (Py3) -- none, must specify encoding explicitly when converting
For the purpose of I/O:
unicode (Py2) -- <file>.encoding if set, otherwise sys.getdefaultencoding()
str (Py2) -- not applicable, raw bytes are written
str (Py3) -- <file>.encoding, always set and defaults to locale.getpreferredencoding()
bytes (Py3) -- none, printing produces its repr() instead
First of all, some terminology clarification so that you understand the rest correctly. Decoding is translation from bytes to characters (Unicode or otherwise), and encoding (as a process) is the reverse. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software to get the distinction.
Now...
Reading the source and parsing string literals
At the start of a source file, you can specify the file's "source encoding" (its exact effect is described later). If not specified, the default is ascii for Python 2 and utf-8 for Python 3. A UTF-8 BOM has the same effect as a utf-8 encoding declaration.
Python 2
Python 2 reads the source as raw bytes. It only uses the "source encoding" to parse a Unicode literal when it sees one. (It's more complicated than that under the hood, but this is the net effect.)
> type t.py
# Encoding: cp1251
s = "абвгд"
us = u"абвгд"
print repr(s), repr(us)
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0430\u0431\u0432\u0433\u0434'
<change encoding declaration in the file to cp866, do not change the contents>
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0440\u0441\u0442\u0443\u0444'
<transcode the file to utf-8, update declaration or replace with BOM>
> py -2 t.py
'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4' u'\u0430\u0431\u0432\u0433\u0434'
So, regular strings will contain the exact bytes that are in the file. And Unicode strings will contain the result of decoding the file's bytes with the "source encoding".
If the decoding fails, you will get a SyntaxError. Same if there is a non-ASCII character in the file when there's no encoding specified. Finally, if unicode_literals future is used, any regular string literals (in that file only) are treated as Unicode literals when parsing, with all what that means.
Python 3
Python 3 decodes the entire source file with the "source encoding" into a sequence of Unicode characters. Any parsing is done after that. (In particular, this makes it possible to have Unicode in identifiers.) Since all string literals are now Unicode, no additional transcoding is needed. In byte literals, non-ASCII characters are prohibited (such bytes must be specified with escape sequences), evading the issue altogether.
Transcoding
As per the clarification at the start:
str (Py2)/bytes (Py3) -- bytes => can only be decoded (directly, that is; details follow)
unicode (Py2)/str (Py3) -- characters => can only be encoded
Python 2
In both cases, if the encoding is not specified, sys.getdefaultencoding() is used. It is ascii (unless you uncomment a code chunk in site.py, or do some other hacks which are a recipe for disaster). So, for the purpose of transcoding, sys.getdefaultencoding() is the "string's default encoding".
Now, here's a caveat:
a decode() and encode() -- with the default encoding -- is done implicitly when converting str<->unicode:
in string formatting (a third of UnicodeDecodeError/UnicodeEncodeError questions on Stack Overflow are about this)
when trying to encode() a str or decode() a unicode (the second third of the Stack Overflow questions)
Python 3
There's no "default encoding" at all: implicit conversion between str and bytes is now prohibited.
bytes can only be decoded and str -- encoded, and the encoding argument is mandatory.
converting bytes->str (incl. implicitly) produces its repr() instead (which is only useful for debug printing), evading the encoding issue entirely
converting str->bytes is prohibited
Printing
This matter is unrelated to a variable's value but related to what you would see on the screen when it's printed -- and whether you will get a UnicodeEncodeError when printing.
Python 2
A unicode is encoded with <file>.encoding if set; otherwise, it's implicitly converted to str as per the above. (The final third of the UnicodeEncodeError SO questions fall into here.)
For standard streams, the stream's encoding is guessed at startup from various environment-specific sources, and can be overridden with the PYTHONIOENCODING environment variable.
str's bytes are sent to the OS stream as-is. What specific glyphs you will see on the screen depends on your terminal's encoding settings (if it's something like UTF-8, you may see nothing at all if you print a byte sequence that is invalid UTF-8).
Python 3
The changes are:
Now files opened with text vs. binary mode natively accept str or bytes, correspondingly, and outright refuse to process the wrong type. Text-mode files always have an encoding set, locale.getpreferredencoding(False) being the default.
print for text streams still implicitly converts everything to str, which in the case of bytes prints its repr() as per the above, evading the encoding issue altogether
Implicit encoding as internal format to store strings/arrays: you should not care about the encoding. In fact, Python decodes characters in a Python internal way. It is mostly transparent. Just image that it is a Unicode text, or a sequence of bytes, in an abstract way.
The internal coding in Python 3.x varies according the "larger" character. It could be UTF-8/ASCII (for ASCII strings), UTF-16 or UTF-32. When you are using strings, it is like you have a Unicode string (so abstract, not a real encoding). If you do not program in C or you use some special functions (memory view), you will never be able to see the internal encoding.
Bytes are just a view of actual memory. Python interprets is as unsigned char. But again, often you should just think about what the sequence it is, not on internal encoding.
Python 2 has bytes and string as unsigned char, and Unicode as UCS-2 (so code points above 65535 will be coded with two characters (UCS-2) in Python 2, and just one character (UTF-32) in Python 3).

Python string encode and decode

Encoding in JS means converting a string with special characters to escaped usable string. like : encodeURIComponent would convert spaces to %20 etc to be usable in URIs.
So encoding here means converting to a particular format.
In Python 2.7, I have a string : 奥多比. To convert it into UTF-8 format, however, I need to use decode() function.
Like: "奥多比".decode("utf-8") == u'\u5965\u591a\u6bd4'
I want to understand how the meaning of encode and decode is changing with language. To me essentially I should be doing "奥多比".encode("utf-8")
What am I missing here.
You appear to be confusing Unicode text (represented in Python 2 as the unicode type, indicated by the u prefix on the literal syntax), with one of the standard Unicode encodings, UTF-8.
You are not creating UTF-8, you created a Unicode text object, by decoding from a UTF-8 byte stream.
The byte string literal `"奥多比"' is a sequence of binary data, bytes. You either entered these in a text editor and saved the file as UTF-8 (and told Python to treat your source code as UTF-8 by starting the file with a PEP 263 codec header), or you typed it into the Python interactive prompt in a terminal that was configured to send UTF-8 data.
I strongly urge you to read more about the difference between bytes, codecs and Unicode text. The following links are highly recommended:
Ned Batchelder's Pragmatic Unicode
The Python Unicode HOWTO
Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
In Python v2, it's type str, i.e. sequence of bytes. To convert it to a Unicode string, you need to decode this sequence of bytes using a codec. Simply said, it specifies how should bytes be converted to a sequence of Unicode code points. Look into Unicode HOWTO for more in-depth article on this.

Python encoding in vim

Trying to understand encoding/decoding/unicode business in Python2.7 with vim.
I have a unicode string us to which I assign some unicode string u'é'.
Question 1
How is us represented in memory? Is it a sequence of 32- bits long ints that unicode code points \u should consist of? Or is it kept in memory as a sequence of 8- bits long hex values \x in some default encoding?
Question 2
I see four different ways to set encoding for the unicode string us: #1 in the beginning of the test.py file; #2 as an argument of encode function; #3 as an argument for vim; #4 as a local encoding of the file system. So, what do all these four encodings (#1,#2,#3,#4) do?
$ vim test.py
_____________
#encoding: #1
us=u'é'
print us.encode(encoding='#2')
_____________
:set encoding=#3
$ locale | grep LANG
LANG=en_US.#4
LANGUAGE=
In Python 2.x unicodes are encoded as either UCS-2 or UCS-4 depending on the options used when building it.
Source encoding as far as Python is concerned.
Encoding used to encode us as bytes when the code is executed.
Source encoding as far as vim is concerned. If this doesn't match #1 then expect trouble.
System encoding. Mostly affects filesystem and terminal output operations.
Question 1 - Storage
us = u'é'
This creates a Unicode character with a value of é - In Python 2.2+ Unicode characters are stored in UCS-2 or UCS-4 which use 2 or 4 byte long unsigned integers depending on a build time option.
Python 3.3+ uses UTF-8 which uses between 1 & 4 bytes for each character depending on the range it is in.
The storage of Unicode strings now depends on the highest codepoint in
the string:
pure ASCII and Latin1 strings (U+0000-U+007F) use 1 byte per codepoint 0xxxxxxx;
BMP strings partial (U+0080-U+07FF) use 2 bytes per codepoint 110xxxxx 10xxxxxx;
BMP strings remaining (U+0800-U+FFFF) use 3 bytes per codepoint 1110xxxx 10xxxxxx 10xxxxxx;
Other Plains (U+10000-U+10FFFF) use 4 bytes per codepoint 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.
Question 2 - Encoding
us=u'é'
Declares us to be a Unicode string stored as above, note that in python 3 all strings are by default Unicode so the u can be omitted.
print(us.encode('ascii', strict)) # encoding='#2')
Tells print how to attempt to translate the Unicode string for output, note that if you are using Python 3.3+ and a Unicode capable terminal/console you probably don't need to ever use this.
#set encoding=#3
Tells vim, emacs and a number of editors the encoding to use when displaying &/or editing the file applies to all text files not just python.
$ locale | grep LANG
LANG=en_US.#4
Is an operating system setting for the Locale Language that tells it how to display various things specifically which code page to use when displaying extended ASCII.
This doesn't actually answer the question but I'm hoping it gives some more insight into this problem.
Answer to question 1: it shouldn't matter to the programmer how Unicode strings are represented internally in Python.
To question 2:
All the programmer should care about is that the data sink and source encoding requirements are known and correctly specified. I would assume that Python can correctly interpret UTF encoded files by reading the BOM and maybe even by making educated guesses but without the BOM it can be ambiguous how to handle bytes with the high bit set so it's advisable to either make sure the BOM is there or tell Python that the file is UTF-8 encoded if you're not sure.
There's a difference between "Unicode" and "UTF" that seems to be glossed-over above; "UTF" specifies the representation in storage (disk, memory, network packet) but "Unicode" is simply the fact that each character has a single value (code point) that ranges from 0 to 0x10FFFF. The various flavors of UTF encode that value into the appropriate storage. Working with encoded strings can be annoying though (as the character width is variable) so when strings are actually represented in memory often it's easier to expand them into some format that allows for easy manipulation. (This is touched on in a comment on another answer.)
If you want a Unicode string in Python pre-3, just type u'<whatever>' and in 3+ type '<whatever>'. You'll get Unicode and you can use \uXXXX and \UXXXXXXXX escapes if it's infeasible to just type the characters in directly. When you want to write the data, specify the encoding. UTF-8 is often the easiest to deal with and seems to be the most commonly used but you may have reason to use a UTF-16 flavor.
The takeaway here is that the encoding is just a way to transform Unicode data so that it can be persisted. The various flavors of UTF are just the encodings, they are not actually Unicode.

Convert UTF-8 to string literals in Python

I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:
My string is: 'Entre\xc3\xa9'
Example one:
This code:
u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')
returns the result: u'Entre\xe9'
If I then continue by printing this:
print u'Entre\xe9'
I get the result: Entreé
This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?
Example:
a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b
I would like result of "c" to be:
Entreé
The u'' syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode object being created, but that's not the only way to create such an object.
You cannot make a unicode value from a byte string by adding u in front of it. But if you called str.decode() with the right encoding, you get a unicode value. Vice-versa, you can encode unicode objects to byte strings with unicode.encode().
Note that when displaying a unicode object, Python represents it by using the Unicode string literal syntax again (so u'...'), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.
Your a value is defined using a byte string literal, so you only need to decode:
a = 'Entre\xc3\xa9'
b = a.decode('utf8')
Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.
You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder

Is u'string' the same as 'string'.decode('XXX')

Although the title is a question, the short answer is apparently no. I've tried in the shell. The real question is why?
ps: string is some non-ascii characters like Chinese and XXX is the current encoding of string
>>> u'中文' == '中文'.decode('gbk')
False
//The first one is u'\xd6\xd0\xce\xc4' while the second one u'\u4e2d\u6587'
The example is above. I am using windows chinese simplyfied. The default encoding is gbk, so is the python shell. And I got the two unicode object unequal.
UPDATES
a = '中文'.decode('gbk')
>>> a
u'\u4e2d\u6587'
>>> print a
中文
>>> b = u'中文'
>>> print b
ÖÐÎÄ
Yes, str.decode() usually returns a unicode string, if the codec successfully can decode the bytes. But the values only represent the same text if the correct codec is used.
Your sample text is not using the right codec; you have text that is GBK encoded, decoded as Latin1:
>>> print u'\u4e2d\u6587'
中文
>>> u'\u4e2d\u6587'.encode('gbk')
'\xd6\xd0\xce\xc4'
>>> u'\u4e2d\u6587'.encode('gbk').decode('latin1')
u'\xd6\xd0\xce\xc4'
The values are indeed not equal, because they are not the same text.
Again, it is important that you use the right codec; a different codec will result in very different results:
>>> print u'\u4e2d\u6587'.encode('gbk').decode('latin1')
ÖÐÎÄ
I encoded the sample text to Latin-1, not GBK or UTF-8. Decoding may have succeeded, but the resulting text is not readable.
Note also that pasting non-ASCII characters only work because the Python interpreter has detected my terminal codec correctly. I can paste text from my browser into my terminal, which then passes the text to Python as UTF-8-encoded data. Because Python has asked the terminal what codec was used, it was able to decode back again from the u'....' Unicode literal value. When printing the encoded.decode('utf8') unicode result, Python once more auto-encodes the data to fit my terminal encoding.
To see what codec Python detected, print sys.stdin.encoding:
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
Similar decisions have to be made when dealing with different sources of text. Reading string literals from the source file, for example, requires that you either use ASCII only (and use escape codes for everything else), or provide Python with an explicit codec notation at the top of the file.
I urge you to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
to gain a more complete understanding on how Unicode works, and how Python handles Unicode.
Assuming Python2.7 by the title.
The answer is no. No because when you issue string.decode(XXX) you'll get a Unicode depending on the codec you pass as argument.
When you use u'string' the codec is inferred by the shell's current encoding, or if it's a file, you'll get ascii as default or whatever # coding: utf-8 special comment you insert at the beginning of the script.
Just to clearify, if codec XXX is ensured to always be the same codec used for the script's input (either the shell or the file) then both approaches behave pretty much the same.
Hope this helps!

Categories