Python UnicodeEncodeError / Wikipedia-API - python

I am trying to parse this document with Python and BeautifulSoup:
http://en.wikipedia.org/w/api.php?format=xml&action=opensearch&search=rage_against_the_machine
The seventh Item down as this Text tag:
Rage Against the Machine's 1994–1995
Tour
When I try to print out the text "Rage Against the Machine's 1994–1995 Tour", python is giving me this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 31: ordinal not in range(128)
I can resolve it by simply replacing u'\u2013' with '-' like so:
itemText = itemText.replace(u'\u2013', '-')
However what about every character that I have not coded for? I do not want to ignore them nor do I want to list out every possible find and replace.
Surely a library must exist to try it's very best to detect the encoding from a list of common known encoding's (however likely it is to get it wrong).
someText = getTextWithUnknownEncoding(someLocation);
bestAsciiAttemptText = someLibrary.tryYourBestToConvertToAscii(someText)
Thank you

Decoding it as UTF-8 should work:
itemText = itemText.decode('utf-8')

Normally, you should try to preserve characters as unicode or utf-8. Avoid converting characters to your local codepage, as this results in loss of information.
However, if you must, here are. Few things to do. Let's use your example character:
>>> s = u'\u2013'
If you want to print the string e.g. for debugging, you can use repr:
>>> print(repr(s))
u'\u2013'
In an interactive session, you can just type the variable name to achieve the same result:
>>> s
u'\u2013'
If you really want to convert it the text to your local codepage, and it is OK that characters outside this codepage are converted to '?', you can use this:
>>> s.encode('latin-1', 'replace')
'?'
If '?' is not good enough, you can use translate to convert selected characters into an equivalent character as in this answer.

You may need to explicitly declare your encoding.
On the first line of your file (or after the hashbang, if there is one), add the following line:
-*- coding: utf-8 -*-
This 'magic comment' forces Python to expect UTF-8 characters and should decode them successfully.
More details: http://www.python.org/dev/peps/pep-0263/

Related

How to solve garbled characters starting with "\u0e" in the Robot Framework log [duplicate]

I'm sanitizing a pandas dataframe and encounters unicode string that has a u inside it with a backslash than I need to replace e.g.
u'\u2014'.replace('\u','')
Result: u'\u2014'
I've tried encoding it as utf-8 then decoding it but that didn't work and I feel there must be an easier way around this.
pandas code
merged['Rank World Bank'] = merged['Rank World Bank'].astype(str)
Error
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 0: ordinal not in range(128)
u'\u2014' is actually -. It's not a number. It's a utf-8 character. Try using print keyword to print it . You will know
This is the output in ipython:
In [4]: print("val = ", u'\u2014')
val = —
Based on your comment, here is what you are doing wrong
"-" is not same as "EM Dash" Unicode character(u'\u2014')
So, you should do the following
print(u'\u2014'.replace("\u2014",""))
and that will work
EDIT:
since you are using python 2.x, you have to encode it with utf-8 as follows
u'\u2014'.encode('utf-8').decode('utf-8').replace("-","")
Yeah, Because it is taking '2014' followed by '\u' as a unicode string and not a string literal.
Things that can help:
Converting to ascii using .encode('ascii', 'ignore')
As you are using pandas, you can use 'encoding' parameter and pass 'ascii' there.
Do this instead : u'\u2014'.replace(u'\u2014', u'2014').encode('ascii', 'ignore')
Hope this helps.

"ascii" codec can't encode characters in position 0-2: ordinal not in range(128)

I am using python 2.7 and used Chinese characters in my code, so...
# coding = utf-8
and the problem is part of my code, as follows:
def fileoutput():
global percent_shown
date = str(datetime.datetime.now()).decode('utf-8')
with open("result.txt","a") as datafile:
datafile.write(date+" "+str(percent_shown.get()))
percent_shown is a string that includes Chinese characters
When I run it, I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
How to fix it? Thanks
As per PEP 263, the coding declaration must match the regular expression r"^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)" so you need to get rid of the space between "coding" and the equal sign:
# coding=utf-8
This declaration tells python that the .py file itself is utf-8 encoded, but doesn't change the rest of the program. This is useful if you are writing unicode literals but you still need to cast them to unicde properly to make sure things work.
Since you haven't shown us what you are trying to print, I found some Chinese characters to demonstrate. I have no idea what they mean... so appollogies for anyone I insult!
foo = u"学而设" # Good! you've got a unicode string
bar = "学而设" # Bad! you've got a utf-8 encoded string that python
# thinks is ascii
I think you can fix your program with a few tweaks. First, don't try to decode datetime.now(). Its just ascii. It didn't change its return type just because you declared the source file encoding. Second, use the codecs module to open the file with the encoding you wnat (I'm assuming its utf-8). Now, since you are working with unicode strings you can write them directly to the file.
import codecs
def fileoutput():
date = unicode(datetime.datetime.now())
with codecs.open("result.txt","a", encoding="utf-8") as datafile:
datafile.write(date+" "+percent_shown.get())
You can't have whitespace before the = in your coding comment. Try:
# coding=utf-8
See the regular expression in: https://www.python.org/dev/peps/pep-0263/

'ascii' codec can't encode character u'\xe9'

I already tried all previous answers and solution.
I am trying to use this value, which gave me encoding related error.
ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']
So I tried,
d = [x.decode('utf-8') for x in ar]
which gives:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)
I tried out
d = [x.encode('utf-8') for x in ar]
which removes error but changes the original content
original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode
what is correct way to deal with this scenario?
Edit
Error comes when I feed these links in
req = urllib2.Request()
The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).
Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.
Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world.
To encode use the quote function from the urllib2 library:
from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))
To decode, use unquote:
from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')
Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and
In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno
Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.
For example, your trouble chars were é and í.
é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8
In short, your .encode() method is correct and should be used for writing to files or to a browser.

Python string opts with unicode, UnicodeDecodeError

Suppose if I had a string with some unicode characters inside it, and we needed to do operations on it, what would be the best way to do so?
s = u"blah ascii_word etc شاهد word1 word 2" # Delimited by spaces
words = s.split(u' ')
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in
position 91: ordinal not in range(128)
Any clues?
Also, If I wanted to write this code into a text file and read it back later, what would be the procedure?
When you declare variable the way you do Python assumes it is in your default system encoding you have to add u before the string to make it unicode and add encoding declaration at the top of your file, if you do this you won't get any errors:
# -*- coding: utf-8 -*-
s = u"blah ascii_word etc شاهد word1 word 2"
words = s.split(u' ')
print words
# no error even tough my default system's encoding is ascii
I've checked this now and you don't even need the u - adding encoding is enough to fix the problem.
If you want to do things with unicode strings in the termainal you have to check your system encoding and change it if necessary:
>>> import sys
>>> sys.getdefaultencoding()
'ascii' #I have ascii
You can then manipulate this by using sys.setdefaultencoding(). But this is a tricky issue which depends on your operating system.

How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?

For example, if I have a unicode string, I can encode it as an ASCII string like so:
>>> u'\u003cfoo/\u003e'.encode('ascii')
'<foo/>'
However, I have e.g. this ASCII string:
'\u003foo\u003e'
... that I want to turn into the same ASCII string as in my first example above:
'<foo/>'
It took me a while to figure this one out, but this page had the best answer:
>>> s = '\u003cfoo/\u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'
There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).
EDIT: See also Python Standard Encodings.
On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).
I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.
Anyway, this is it.
At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)
For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors
>>> s = '\u003cfoo\u003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>
Ned Batchelder said:
It's a little dangerous depending on where the string is coming from,
but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'
Actually this method can be made safe like so:
>>> s = '\u003cfoo\u003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]
Mind the triple-quote string and the dash right before the closing 3-quotes.
Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]
So there would be no need to worry about what the users enter, as long as it is captured in raw format.
It's a little dangerous depending on where the string is coming from, but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'

Categories