losing characters when going from unicode to ascii

losing characters when going from unicode to ascii - python

y0 I have this problem that characters that include ñ or ŕ í á ú etc are discarded when I apply
text = text.encode('ascii', 'ignore')
to a function that needs the input to be ascii.
is there a way to force it to ascii without losing those characters or should I change the function to accept unicode characters?
http://dpaste.com/601417/

The 'ascii' encoding can't represent the characters you refer to. You have to choose a different encoding — perhaps 'cp850' or 'latin_1' — but then you have to be sure that your output terminal interprets 8-bit codes using the relevant code page.
On balance, life is easier if you just go Unicode all the way.

Yes, you should go for another encoding, if you need those characters (for example Unicode). See ascii table for all chars that are included in ascii.

Related

Decode / encode html escaped special characters in Python

I have some text that has html escape codes in it that I am struggling to fully decode / encode to display properly with Python (ultimately in a Django application).
""Coup d'Ãtat"" being a troublesome snippet.
I have used html.unescape() to successfully unescape most of the html codes, but I am struggling with the decoding of the special characters, "Ã", in this example. Ideally this would display as "Coup d'État", but despite trying some decoding/encoding combinations I am getting "Coup d'Ãtat".
What is the correct way to convert ""Coup d'Ãtat"" into "Coup d'État"?
Thanks for your help, and apologies if this has been answered elsewhere. I've tried searching, but no success.

You have a Mojibake, double-encoded data. You not only have HTML entities, your data was incorrectly decoded from bytes to text before the HTML entities were applied.
For your example, the two Ã,  entities decode to the Unicode characters Ã and ‰. Those two characters are also known (from the Unicode standard), as U+00C3 LATIN CAPITAL LETTER A WITH TILDE and U+2030 PER MILLE SIGN. This is typical of UTF-8 data being mis-interpreted as a Latin variant encoding (such as ISO 8859-1 or a Windows Latin codepage variant.
If we assume that the original character was meant to be É, or U+00C9 LATIN CAPITAL LETTER E WITH ACUTE, then the original would have been encoded to the bytes C3 and 89 if using UTF-8. That Ã (U+00C3!) shows up here is not a coincidence, it is typical of UTF-8 -> Latin variant Mojibakes to end up with such combinations. The 89 mapping tells us that the most likely candidate for the wrong encoding is the Windows CP 1252 encoding, which maps the hex value 89 to U+2030 PER MILLE SIGN.
You could manually encode to bytes then decode as the correct encoding, but the trick is to know what encoding was used incorrectly, and sometimes that mistake leads to data loss, because the CP-1252 codepage doesn't have a Unicode character mapping for 5 specific byte values. That's not a direct problem for the example in your question, but can be for other text. Manually decoding would work like this:
>>> import html
>>> broken = ""Coup d'Ãtat""
>>> html.unescape(broken)
'"Coup d\'Ã‰tat"'
>>> html.unescape(broken).encode("cp1252")
b'"Coup d\'\xc3\x89tat"'
>>> html.unescape(broken).encode("cp1252").decode("utf-8")
'"Coup d\'État"'
A better option is to use the special ftfy library (the name is an acronym for Fixed That For You), which uses detailed knowledge about how to recognize such mistakes and undo the damage.
ftfy also handles the HTML-entity decoding, all in one step:
>>> import ftfy
>>> ftfy.fix_text(""Coup d'Ãtat"")
'"Coup d\'État"'
The library includes sloppy variants of text codes often found in a Mojibake to help with repairing. It also encodes information about how to recognize the specific errors that a given wrong codec choice produces so it knows what to do to reverse the damage.

Encode Decode using python

I have this function in python
Str = "Ã¼";
print Str
def correctText( str ):
str = str.upper()
correctedText = str.decode('UTF8').encode('Windows-1252')
return correctedText;
corText = correctText(Str);
print corText
It works and converts characters like Ã¼ and Ã© however it fails when i try Ã? and Â¶
Is there a way i can fix it?

According to UTF8, Ã and Â¶ are not valid characters, meaning that don't have a number of bytes divisible by 4 (usually). What you need to do is either use some other kind of encoding or strip out errors in your str by using the unicode() function. I recommend using the ladder.

What you are trying to do is to compose valid UTF-8 codes by several consecutive Windows-1252 codes.
For example, for Ã¼, the Windows-1252 code of Ã is C3 and for ¼ it's BC. Together the code C3BC happens to be the UTF-8 code of ü.
Now, for Ã?, the Windows-1252 code is C33F, which is not a valid UTF-8 code (because the second byte does not start with 10).
Are you sure this sequence occurs in your text? For example, for à, the Windows-1252 decoding of the UTF-8 code (C3A0) is Ã followed by a non-printable character (non-breaking space). So, if this second character is not printed, the ? might be a regular character of the text.
For Â¶ the Windows-1252 encoding is C2B6. Shouldn't it be Ã¶, for which the Windows-1252 encoding is C3B6, which equals the UTF-8 code of ö?

python3 encode replace unicode characters

According to the documentation, the following command
'Brückenspinne'.encode("utf-8",errors='replace')
Should give me the byte sequenceb'Br??ckenspinne'. However, unicode characters are not replaced but encoded nevertheless :
b'Br\xc3\xbcckenspinne'
Can you tell me how I actually eliminate unicode characters ? (I use replace for testing purposes, I intend to use 'xmlcharrefreplace' later. To be totally honest, I want to convert the unicode characters to their xmlcharref, keeping everything as a string).
Thank you.

utf-8 encoding can represent the character ü; no replacement occur.
Use other encoding that cannot represent the character. For example ascii:
>>> 'Brückenspinne'.encode("ascii", errors='replace')
b'Br?ckenspinne'
>>> 'Brückenspinne'.encode("ascii", errors='xmlcharrefreplace')
b'Brückenspinne'

encoding issue. Replace special character

I have a dictionary that looks like this:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Now I'm trying to replace the \xf6 with ö ,
but trying .replace('\xf6', 'ö') returns an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
0: ordinal not in range(128)
How can I fix this?

Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.
From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.
With that said, what you see in your example data:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.
But when you try to replace it using:
x.replace('\xf6', 'ö')
You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.
Hence why you get the "'ascii' codec can't decode byte...' message.
You can do unicode replacements like this:
a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')
This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.
What you want to do, is encode your string into something you want to use, for instance - UTF-8:
a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'
And define UTF-8 as your primary encoding by placing this at the top of your code:
#!/usr/bin/python
# -*- coding: UTF-8
This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.
But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".
In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.
There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):
Python - Encoding string - Swedish Letters
tl;dr:
Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.

As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).
If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.
If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)
In that case just use an explicit encoding to convert your unicode strings to byte strings:
print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')
should give what you expect.

Python Unicode CSV export (using Django)

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe

You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.
So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like â€™ instead of ’.
So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (’), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).
dmessage= contact.message.encode('cp1252', 'ignore')
alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
dmessage= contact.message.encode('ascii', 'ignore')

Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.
The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.

[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:
smashcii = {
0x2019 : u"'",
# etc
#
smashed = input_string.translate(smashcii)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.