Troubles in printing unicode string that I have in byte format - python

Reading from a database I get the following value
b'd\xe2\x80\x99int'
How can I print it to get the string d’int (note that this is different from d'int)?
I tried with print(b'd\xe2\x80\x99int'.decode('utf-8')) but I get the error:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 1: ordinal not in range(128)
EDIT: thanks to the comment I understood that the problem is not in my Python code but in emacs, I am having exactly the same problem as described here Unicode conversion issue using Python in Emacs
I will close the question

You can use bytes.decode()
>>> bytes.decode(b'd\xe2\x80\x99int', 'utf8')
'd’int'
or analogously, the .decode() method over the bytes object itself:
>>> b'd\xe2\x80\x99int'.decode('utf-8')
'd’int'

Related

Python: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 3-4: ordinal not in range(256)

I found a site which fixes my mojibake, here that uses the python package ftfy. I tried reproducing the steps given, although it seems to pre-convert the string before running the steps it gives me.
The string I am trying to fix is EvðŸ’👸ðŸ», although the site seems to pre-convert it to EvðŸâ\x80\x99Â\x9dðŸâ\x80\x98¸ðŸÂ\x8f» before attempting to fix it with the same steps as I am below.
My question is, how can I get my string in the same state as the site, before running the fix_broken_unicode function, to hopfully avoid the error I am facing?
When running my script, (probably due to me not pre-converting) I receive:
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 3-4: ordinal not in range(256)
The source code for mentioned website can be found at: https://github.com/simonw/ftfy-web/blob/master/ftfy_app.py, although because I am primarily a C++ developer I can't understand it.
My script:
import ftfy.bad_codecs
def fix_broken_unicode(string):
string = string.encode('latin-1')
string = string.decode('utf-8')
string = string.encode('sloppy-windows-1252')
string = string.decode('utf-8')
return string
print(fix_broken_unicode("EvðŸ’👸ðŸ»"))
Updates since answer:
My input: "EvðŸ’👸ðŸ»", expected outcome: Ev💝👸🏻
Your data string might be missing some non-printable characters:
>>> s = 'EvðŸ’\x9d👸ðŸ\x8f»' # \x9d and \x8f aren't printable.
>>> print(s) # This looks like your mojibake.
EvðŸ’👸ðŸ»
>>> s.encode('mbcs').decode('utf8')
'Ev💝👸🏻'
Note that Python's mbcs codec corresponds to Windows default ANSI codec.
It matches "sloppy-windows1252" only if Windows-1252 is the default ANSI codec (US- and Western European-localized versions of Windows), which is what I am running.
The other option is your original UTF-8 data was decoded with .decode('cp1252',errors='ignore'). If this is the case the two bytes were lost and the string isn't reversible.

Unexpected recurrence of Python unicode string ascii codec error

After days and months of desperation I recently found a solution to overcome the infamous UnicodeEncodeError: 'ascii' codec cant encoe character u'\u2026' in position 18: ordinal not in range (128). It was dealing with multilingual strings pretty well until recently, I bumped into this error AGAIN!
I tried type(thatstring) and it returned Unicode.
So I tried:
thatstring=thatstring.decode('utf-8')
This was handling those multilanguage strings pretty well but it came back now. I also tried
thatstring=thatstring.decode('utf-8','ignore')
No use.
thatstring=thatstring.encode('utf-8','ignore')
bounces with the error
UnicodeDecodeError: 'ascii' codec cant decode byte 0xc3 in position 48: ordinal not in range (128) faster than its counterpart.
Please help me. Thanks.
You did the right thing by trying type(thatstring), but you didn't draw the right conclusion from the result.
A unicode string has already been decoded, so trying to decode it again will produce an error if it contains non-ascii characters. When you use decode() on a unicode object, you effectively force python to do something like this:
temp = thatstring.encode('ascii') # convert unicode to bytes first
thatstring = temp.decode('utf-8') # now decode bytes back to unicode
Obviously, the first line will blow up as soon as it finds a non-ascii character, which explains why you see a unicode encode error, even though you are trying to decode the string. So the simple answer to your problem is: don't do that!
Instead, whenever your program receives string inputs, and wants to make sure they're converted to unicode, it should do something like this:
if isinstance(thatstring, bytes):
thatstring = thatstring.decode(encoding)

exceptions.UnicodeDecodeError - 'ascii' codec can't decode byte

I keep getting this error:
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
args = ('ascii', '\xe2\x9d\xb6 Senn =)', 0, 1, 'ordinal not in range(128)')
encoding = 'ascii'
end = 1
message = ''
object = '\xe2\x9d\xb6 Senn =)'
reason = 'ordinal not in range(128)'
start = 0
Using this code:
steamFriend = data['response']['players'][i]
n = steamUser(steamFriend['personaname'].encode("utf-8"), steamFriend['steamid'], steamFriend['avatarfull'], steamFriend['profileurl'], steamFriend['personastate'], False)
Some things to note here:
steamFriend is a JSON object
I get this error only sometimes, because the steamFriend['personaname'] contains some weird symbols (for example ❶), and I don't know how to parse this correctly so I don't get errors.
Any help is greatly appreciated.
Also, \xe2\x9d\xb6 Senn =) is supposed to represent ❶ Senn =), if that helps.
Without seeing the full code it is hard to tell, but it seems that steamUser expects ascii input. If that is the problem, you can solve it by:
streamFriend['personaname'].encode("ascii", errors="ignore")
or
streamFriend['personaname'].encode("ascii", errors="replace")
Obviously you will lose unicode characters in the process.
If the quoted error is occurring on the n=... line, the implication is that steamFriend['personaname'] is a byte string, not a Unicode string.
Consequently when you ask to .encode it, Python has to decode the string to Unicode in order to be able to encode it back to bytes. An implicit decoding happens using the default encoding, which is ASCII, so because the byte string does not contain only ASCII you get a failure.
Are you sure you didn't mean to do:
steamFriend['personaname'].decode("utf-8")
decoding the byte string '\xe2\x9d\xb6 Senn =)' using UTF-8 would give you the Unicode string u'\u2776 Senn =)', where U+2776=❶ so that would seem more like what you are after.
(Normally, however, JSON strings are explicitly Unicode, so it's not clear where you would have got the byte string from. How are you loading the JSON content?)

Weird unicode issue

I've got following problem. If I'll run my app in eclipse it works OK, but when I'll run it in standalone debuger - I got following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0144' in position 7: ordinal not in range(128)
How can I fix it?
My code fragment:
x = x.replace("Ł", "L")
Guess, based on (insufficient) information provided:
You are running Python 2.x.
[Guess] x is a str object.
[Guess] Eclipse sets the default encoding to UTF-8.
The "standard debugger" sets the default encoding to ascii.
Result: splat.
Solution (standard operating procedure for working with Unicode):
On input, convert all str objects to `unicode'.
Work in Unicode.
On output, encode all unicode objects using whatever encoding the
consumer of the output is expecting.
Important update Actually if x was a UTF-8-encoded str object, you should have got a message like UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 7: etc etc.
Note that your actual error message says UnicodeEncodeError: 'ascii' codec can't encode character u'\u0144' in position 7: etc etc This indicates that whatever it is complaining about is (a) a unicode object (b) at least 8 characters long. However you are saying in effect that x is not a unicode object (otherwise x.decode('utf8') would fail) and the other two args of replace are only 1 character long. Consequently we have an impossibility.
To help resolve this:
print type(x), repr(x) # for Python 2.x
Lstroke = "Ł"
print type(Lstroke), repr(Lstroke)
y = x.replace(Lstroke, 'L')
and edit your question to show the actual code that you ran plus the full error message and the traceback.
By the way: u'\u0144' is LATIN SMALL LETTER N WITH ACUTE; does that info help at all?
Try to add # -*- coding: utf-8 -*- to the top of the file to make the Python interpreter aware of which encoding the file uses, in my example UTF-8. You can also do this by saving the file with a BOM header. Not sure how Eclipse hints about the encoding but maybe they use sys.setdefaultencoding() somehow?.
You can read more details in the Python manual about source code encoding.

How can encode('ascii', 'ignore') throw a UnicodeDecodeError?

This line
data = get_url_contents(r[0]).encode('ascii', 'ignore')
produces this error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11450: ordinal not in range(128)
Why? I assumed that because I'm using 'ignore' that it should be impossible to have decode errors when saving the output to a value to a string variable.
Due to a quirk of Python 2, you can call encode on a byte string (i.e. text that's already encoded). In this case, it first tries to convert it to a unicode object by decoding with ascii. So, if get_url_contents is returning a byte string, your line effectively does this:
get_url_contents(r[0]).decode('ascii').encode('ascii', 'ignore')
In Python 3, byte strings don't have an encode method, so the same problem would just cause an AttributeError.
(Of course, I don't know that this is the problem - it could be related to the get_url_contents function. But what I've described above is my best guess)

Categories