Decoding Python Unicode strings that contain double blackslashes - python

My strings look like this \\xec\\x88\\x98, but if I print them they look like this \xec\x88\x98, and when I decode them they look like this \xec\x88\x98
If I type the string in manually as \xec\x88\x98 and then decode it, I get the value I want 수.
If I x.decode('unicode-escape') it removes the double slashes, but when decoding the value returned by x.decode('unicode-escape'), the value I get is ì.
How would I go about decoding the original \\xec\\x88\\x98, so that I get the value correct output?

In Python 2 you can use the 'string-escape' codec to convert '\\xec\\x88\\x98' to '\xec\x88\x98', which is the UTF-8 encoding of u'\uc218'.
Here's a short demo. Unfortunately, my terminal's font doesn't have that character so I can't print it. So instead I'll print its name and it's representation, and I'll also convert it to a Unicode-escape sequence.
import unicodedata as ud
src = '\\xec\\x88\\x98'
print repr(src)
s = src.decode('string-escape')
print repr(s)
u = s.decode('utf8')
print ud.name(u)
print repr(u), u.encode('unicode-escape')
output
'\\xec\\x88\\x98'
'\xec\x88\x98'
HANGUL SYLLABLE SU
u'\uc218' \uc218
However, this is a "band-aid" solution. You should try to fix this problem upstream (in your Web spider) so that you receive the data as plain UTF-8 instead of that string-escaped UTF-8 that you're currently getting.

Related

Turning utf in their real symbols

I just made the first web scraper by myself which goes onto wikipedia and downloads the html of the whole page. I managed to get just the content of a list. the values on the list contain numbers either positive or negative.
But instead of printing out a '-2' it gives me a '\xe2\x88\x922' . I tried the string.replace("\xe2\x88\x92","-") but this doesn't seem to work due to the backslashes.
do you know how I can convert these utf things into their real symbol ?
I used urllib to get the html content if this is important.
You can use bytes.decode to convert it:
>>> b'\xe2\x88\x922'.decode("utf8")
'-2'
And if your data doesn't start with b (i.e. if it is not a bytes object), you can first convert it to bytes then decode:
>>> s = '\xe2\x88\x922'
>>> byte_object = bytes(ord(c) for c in s)
>>> byte_object.decode("utf8")
'-2'
That is unfortunately common when reading data from web pages: they contain characters looking like standard ASCII characters but that are not.
Here you have a MINUS character (unicode U+2212) − which looks like the normal HYPHEN-MINUS (unicode U+002D or ASCII 0x2D) -.
In UTF8 it is encoded as b'\xe2\x88\x922'. It probably means that you read it as if it was Latin1 encoded while it is UTF8 encoded.
A trick the correctly recode it is to encode it as Latin1 and decode it back:
t = '\xe2\x88\x922'
print(t.encode('latin1').decode()
−2

Encoding characters with ISO 8859-1 in Python

With ord(ch) you can get a numerical code for character ch up to 127. Is there any function that returns a number from 0-255, so to cover also ISO 8859-1 characters?
Edit: Follows my last version of code and error I get
#!/usr/bin/python
# coding: iso-8859-1
import sys
reload(sys)
sys.setdefaultencoding('iso-8859-1')
print sys.getdefaultencoding() # prints "iso-8859-1"
def char_code(c):
return ord(c.encode('iso-8859-1'))
print char_code(u'à')
I get an error:
TypeError: ord() expected a character, but string of length 2 found
When you're starting with a Unicode string, you need to encode rather than decode.
>>> def char_code(c):
return ord(c.encode('iso-8859-1'))
>>> print char_code(u'à')
224
For ISO-8859-1 in particular, you don't even need to encode it at all, since Unicode uses the ISO-8859-1 characters for its first 256 code points.
>>> print ord(u'à')
224
Edit: I see the problem now. You've given a source code encoding comment that indicates the source is in ISO-8859-1. However, I'll bet that your editor is actually working in UTF-8. The source code will be mis-interpreted, and the single-character string you think you created will actually be two characters. Try the following to see:
print len(u'à')
If your encoding is correct, it will return 1, but in your case it's probably 2.
You can get ord() for anything. As you might expect, ord(u'💩') works fine, provided you can represent the character properly in your source, and/or read it in a known encoding.
Your error message vaguely suggests that coding: iso-8859-1 is not actually true, and the file's encoding is actually something else (UTF-8 or UTF-16 would be my guess).
The canonical must-read on character encoding in Python is http://nedbatchelder.com/text/unipain.html
You can still use ord(), but you have to decode it.
Like this:
def char_code(c):
return ord(c.decode('iso-8859-1'))

Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters

I use python 2.7 and I'm receiving a string from a server (not in unicode!).
Inside that string I find text with unicode escape sequences. For example like this:
<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>
How do I convert those \uxxxx - back to utf-8? The answers I found were either dealing with &# or required eval() which is too slow for my purposes. I need a universal solution for any text containing such sequenes.
Edit:
<\a> is a typo but I want a tolerance against such typos as well. There should only be reaction to \u
The example text is meant in proper python syntax like this:
"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
The desired output is in proper python syntax
"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"
Try
>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'
And then you can encode to utf8 as usual.
Python does contain some special string codecs for cases like this.
In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte-string using the "unicode_escape" codec to have a proper Unicode text object in Python.
(On which your program should be performing all textual operations) -
Whenever you are outputting that text again, you convert it to utf-8 as usual:
rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")
If there are othe bytes outside the 32-127 range, the unicode_escape codec
assumes them to be in the latin1 encoding. So if your response mixes utf-8 and these \uXXXX sequences you have to:
decode the original string using utf-8
encode back to latin1
decode using "unicode_escape"
work on the text
encode back to utf-8

Character Encoding, XML, Excel, python

I am reading a list of strings that were imported into an excel xml file from another software program. I am not sure what the encoding of the excel file is, but I am pretty sure its not windows-1252, because when I try to use that encoding, I wind up with a lot of errors.
The specific word that is causing me trouble right now is: "Zmysłowska, Magdalena" (notice the "l" is not a standard "l", but rather, has a slash through it).
I have tried a few things, Ill mention three of them here:
(1)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
page = page.encode("utf-8", "ignore")
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena
(2)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
Output: Zmys\u0142owska, Magdalena
Output after print statment: Zmysłowska, Magdalena
Note: this is great, but I need to encode it back to utf-8 before putting the string into my db. When I do that, by running page.encode("utf-8", "ignore"), I end up with Zmysłowska, Magdalena again.
(3)
Do nothing (no normalization, no decode, no encode). It seems like the string is already utf-8 when it comes in. However, when I do nothing, the string ends up with the following output again:
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena
Is there a way for me to convert this string to utf-8?
Your problem isn't your encoding and decoding. Your code correctly takes a UTF-8 string, and converts it to an NFKD-normalized UTF-8 string. (You might want to use page.decode("utf-8") instead of unicode(page, "utf-8") just for future-proofing in case you ever go to Python 3, and to make the code a bit easier to read because the encode and decode are more obviously parallel, but you don't have to; the two are equivalent.)
Your actually problem is that you're printing UTF-8 strings to some context that isn't UTF-8. Most likely you're printing to the cmd window, which is defaulting to Windows-1252. So, cmd tries to interpret the UTF-8 characters as Windows-1252, and gets garbage.
There's a pretty easy way to test this. Make Python decode the UTF-8 string as if it were Windows-1252 and see if the resulting Unicode string looks like what're seeing.
>>> print page.decode('windows-1252')
Zmysłowska, Magdalena
>>> print repr(page.decode('windows-1252'))
u'Zmys\xc5\u201aowska, Magdalena'
There are two ways around this:
Print Unicode strings and let Python take care of it.
Print strings converted to the appropriate encoding.
For option 1:
print page.decode("utf-8") # of unicode(page, "utf-8")
For option 2, it's going to be one of the following:
print page.decode("utf-8").encode("windows-1252")
print page.decode("utf-8").encode(sys.getdefaultencoding())
Of course if you keep the intermediate Unicode string around, you don't need all those decode calls:
upage = page.decode("utf-8")
upage = unicodedata.normalize("NFKD", upage)
page = upage.encode("utf-8", "ignore")
print upage

Python UnicodeEncodeError / Wikipedia-API

I am trying to parse this document with Python and BeautifulSoup:
http://en.wikipedia.org/w/api.php?format=xml&action=opensearch&search=rage_against_the_machine
The seventh Item down as this Text tag:
Rage Against the Machine's 1994–1995
Tour
When I try to print out the text "Rage Against the Machine's 1994–1995 Tour", python is giving me this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 31: ordinal not in range(128)
I can resolve it by simply replacing u'\u2013' with '-' like so:
itemText = itemText.replace(u'\u2013', '-')
However what about every character that I have not coded for? I do not want to ignore them nor do I want to list out every possible find and replace.
Surely a library must exist to try it's very best to detect the encoding from a list of common known encoding's (however likely it is to get it wrong).
someText = getTextWithUnknownEncoding(someLocation);
bestAsciiAttemptText = someLibrary.tryYourBestToConvertToAscii(someText)
Thank you
Decoding it as UTF-8 should work:
itemText = itemText.decode('utf-8')
Normally, you should try to preserve characters as unicode or utf-8. Avoid converting characters to your local codepage, as this results in loss of information.
However, if you must, here are. Few things to do. Let's use your example character:
>>> s = u'\u2013'
If you want to print the string e.g. for debugging, you can use repr:
>>> print(repr(s))
u'\u2013'
In an interactive session, you can just type the variable name to achieve the same result:
>>> s
u'\u2013'
If you really want to convert it the text to your local codepage, and it is OK that characters outside this codepage are converted to '?', you can use this:
>>> s.encode('latin-1', 'replace')
'?'
If '?' is not good enough, you can use translate to convert selected characters into an equivalent character as in this answer.
You may need to explicitly declare your encoding.
On the first line of your file (or after the hashbang, if there is one), add the following line:
-*- coding: utf-8 -*-
This 'magic comment' forces Python to expect UTF-8 characters and should decode them successfully.
More details: http://www.python.org/dev/peps/pep-0263/

Categories