Python decoding issue with Chinese characters

Python decoding issue with Chinese characters - python

I'm using Python 3.5, and I'm trying to take a block of byte text that may or may not contain special Chinese characters and output it to a file. It works for entries that do not contain Chinese characters, but breaks when they do. The Chinese characters are always a person's name, and are always in addition to the English spelling of their name. The text is JSON formatted and needs to be decoded before I can load it. The decoding seems to go fine and doesn't give me any errors. When I try and write the decoded text to a file it gives me the following error message:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 14-18: character maps to undefined
Here is an example of the raw data that I get before I do anything to it:
b' "isBulkRecipient": "false",\r\n "name": "Name in, English \xef'
b'\xab\x62\xb6\xe2\x15\x8a\x8b\x8a\xee\xab\x89\xcf\xbc\x8a",\r\n
Here is the code that I am using:
recipientData = json.loads(recipientContent.decode('utf-8', 'ignore'))
recipientName = recipientData['signers'][0]['name']
pprint(recipientName)
with open('envelope recipient list.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
csvData = [[recipientName]]
a.writerows(csvData)
The recipientContent is obtained from an API call. I do not need to have the Chinese characters in the output file. Any advice will be greatly appreciated!
Update:
I've been doing some manual workarounds for each entry that breaks, and came other entries that didn't contain Chinese special characters, but had them from other languages, and the broke the program as well. The special characters are only in the name field. So a name could be something like "Ałex" where it is a mixture of normal and special characters. Before i decode the string that contains this information i am able to print it out to the screen and it looks like this: b'name": "A\xc5ex",\r\n
But after i decode it into utf-8 it will give me an error if i try to output it. The error message is: UnicodeEncodeError: 'charmap' codec can't encode character 'u0142' in position 2- character maps to -undefined-
I looked up what \u0142 was and it is the ł special character.

The error you're getting is when you're writing to the file.
In Python 3.x, when you open() in text mode (the default) without specifying an encoding=, Python will use an encoding most suitable to your locale or language settings.
If you're on Windows, this will use the charmap codec to map to your language encoding.
Although you could just write bytes straight to a file, you're doing the right thing by decoding it first. As others have said, you should really decode using the encoding specified by the web server. You could also use Python Requests module, which does this for you. (You example doesn't decode as UTF-8, so I assume your example isn't correct)
To solve your immediate error, simply pass an encoding to open(), which supports the characters you have in your data. Unicode in UTF-8 encoding is the obvious choice. Therefore, you should change your code to read:
with open('envelope recipient list.csv', 'a', encoding='utf-8', newline='') as fp:

Warning: shotgun solution ahead
Assuming you just want to get rid of all foreign character in all your file ( that is they are not important for your future processing of all other fields), you can simply ignore all non ascii characters
recipientData = json.loads(recipientContent.decode('utf-8', 'ignore'))
by
recipientData = json.loads(recipientContent.decode('ascii', 'ignore'))
like this you remove all non ascii characters before future processing.
I called it shotgun solution because it might not work correctly under certain circumstances:
Obviously if non ascii characters are needed to keep for future use
If b'\' or b" characters appears for example from part of an utf-16 character.

Add this line to your code :
from __future__ import unicode_literals

Related

Best way to remove '\xad' in Python?

I'm trying to build a corpus from the .txt file found at this link.
I believe the instances of \xad are supposedly 'soft-hyphens', but do not appear to be read correctly under UTF-8 encoding. I've tried encoding the .txt file as iso8859-15, using the code:
with open('Harry Potter 3 - The Prisoner Of Azkaban.txt', 'r',
encoding='iso8859-15') as myfile:
data=myfile.read().replace('\n', '')
data2 = data.split(' ')
This returns an array of 'words', but '\xad' remains attached to many entries in data2. I've tried
data_clean = data.replace('\\xad', '')
and
data_clean = data.replace('\\xad|\\xad\\xad','')
but this doesn't seem to remove the instances of '\xad'. Has anyone ran into a similar problem before? Ideally I'd like to encode this data as UTF-8 to avail of the nltk library, but it won't read the file with UTF-8 encoding as I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 471: invalid start byte
Any help would be greatly appreciated!
Additional context: This is a recreational project with the aim of being able to generate stories based on the txt file. Everything I've generated thus far has been permeated with '\xad', which ruins the fun!

Your file almost certainly has actual U+00AD soft-hyphen characters in it.
These are characters that mark places where a word could be split when fitting lines to a page. The idea is that the soft hyphen is invisible if the word doesn't need to be split, but printed the same as a U+2010 normal hyphen if it does.
Since you don't care about rendering this text in a book with nicely flowing text, you're never going to hyphenate anything, so you just want to remove these characters.
The way to do this is not to fiddle with the encoding. Just remove them from the Unicode text, using whichever of these you find most readable:
data = data.replace('\xad', '')
data = data.replace('\u00ad', '')
data = data.replace('\N{SOFT HYPHEN}', '')
Notice the single backslash. We're not replacing a literal backslash, x, a, d, we're replacing a literal soft-hyphen character, that is, the character whose code point is hex 0xad.
You can either do this to the whole file before splitting into words, or do it once per word after splitting.
Meanwhile, you seem to be confused about what encodings are and what to do with them:
I've tried encoding the .txt file as iso8859-15
No, you've tried decoding the file as ISO-8859-15. It's not clear why you tried ISO-8859-15 in the first place. But, since the ISO-8859-15 encoding for the character '\xad' is the byte b'\xad', maybe that's correct.
Ideally I'd like to encode this data as UTF-8 to avail of the nltk library
But NLTK doesn't want UTF-8 bytes, it wants Unicode strings. You don't need to encode it for that.
Plus, you're not trying to encode your Unicode text to UTF-8, you're trying to decode your bytes from UTF-8. If that's not what those bytes are… if you're lucky, you'll get an error like this one; if not, you'll get mojibake that you don't notice until you've screwed up a 500GB corpus and thrown away the original data.1
1. UTF-8 is specifically designed so you'll get early errors whenever possible. In this case, reading ISO-8859-15 text with soft hyphens as if it were UTF-8 raises exactly the error you're seeing, but reading UTF-8 text with soft hyphens as if it were ISO-8859-15 will silently succeed, but with an extra 'Â' character before each soft hyphen. The error is usually more helpful.

character set issue UTF-8

I have a document,in which the word "doesn't" contains apostrophe as shown below.
When i tried to process that via a python program it is showing the word as " doesnÆt" and exiting with the error as mentioned below.
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 70: invalid start byte
I opened the document in notepad and changed the encoding to UTF-8 from ANSI(found somewhere on the web), now it is working fine.
But can some one throw light on ,what all these things are about and how can i type the kind of apostrophe with my laptop keyboard.

MS Word famously converts quotes to "smart-quotes", so that they properly wrap around words or point in the right direction as an apostrophe.
You haven't been exactly faithful with your copy-paste so it's hard to be sure we're talking about the same thing.
For example here's the smart-quotes compared to plain ascii:
Doesn’t vs. Doesn't
or
“hello” vs. "hello"
Notice how the smart quotes on the left are curlier. In your screenshot, ’ will have been mapped to the Unicode point U+2019 ('RIGHT SINGLE QUOTATION MARK'). You can't easily manually type smart quotes with using a Windows key combination and typing the Unicode value.
You've then likely saved this text as in Windows-1252 (Western Europe) encoding (aka ANSI), which assigned the byte 0x92. Then, you loaded this into Python but passed the incorrect encoding of UTF-8. That's when you saw the exception.
The way to deal with this in the future is to specify the correct encoding when opening the file in Python. E.g.
with io.open("myfile.txt", 'r', encoding="windows-1252") as my_file:
my_data = my_file.read()

Python UnicodeEncodeError when Outputting Parsed Data from a Webpage

I have a program that parses webpages and then writes the data out somewhere else. When I am writing the data, I get
"UnicodeEncodeError: 'ascii' codec can't encode characters in position
19-21: ordinal not in range(128)"
I am gathering the data using lxml.
name = apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
worksheet.goog["Name"].append(name)
Upon reading, http://effbot.org/pyfaq/what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean.htm, it suggests I record all of my variables in unicode. This means I need to know what encoding the site is using.
My final line that actually writes the data out somewhere is:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], (str(worksheet.goog[value][row])).encode('ascii', 'ignore'))
How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out?

You error is because of:
str(worksheet.goog[value][row])
Calling str you are trying to encode the ascii, what you should be doing is encoding to utf-8:
worksheet.goog[value][row].encode("utf-8")
As far as How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out? goes, you can't there is no ascii latin ă etc... unless you want to get the the closest ascii equivalent using something like Unidecode.

I think I may have figured my own problem out.
apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
Actually defaults to unicode. So what I did was change this line to:
name = (apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text).encode('ascii', errors='ignore')
And I just output without changing anything:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], worksheet.goog[value][row])
Due to the nature of the data, ASCII only is mostly fine. Although, I may be able to use UTF-8 and catch some extra characters...but this is not relevant to the question.
:)

Identify garbage unicode string using python

My script is reads data from csv file, the csv file can have multiple strings of English or non English words.
Some time the text file has garbage strings , i want to identify those string and skip those string and process others
doc = codecs.open(input_text_file, "rb",'utf_8_sig')
fob = csv.DictReader(doc)
for row, entry in enumerate(f):
if is_valid_unicode_str(row['Name']):
process_futher
def is_valid_unicode_str(value):
try:
function
return True
except UnicodeEncodeError:
return false
csv input:
"Name"
"Ã¨Â¢â€¹Ã¨Â¢âdcx€¹Ã¤Â¸Å½Ã¦Å“â€¹Ã¥Ââ€¹Ã¤Â»Â¬Ã§â€ÂµÃ¥ÂÂÃ¥â€¢â€"
"元大寶來證券"
"John Dove"
I want to defile function is_valid_unicode_str() which will identify the garbage string and process valid one only.
I tried to use decode is but it doesnt failed while decoding garbage strings
value.decode('utf8')
The expected output are string with Chinese and English string to be process
could you please guide me how can i implement function to filter valid Unicode files?.

(ftfy developer here)
I've figured out that the text is likely to be '袋袋与朋友们电子商'. I had to guess at the characters 友, 子, and 商, because some unprintable characters are characters missing in the string in your question. When guessing, I picked the most common character from the small number of possibilities. And I don't know where the "dcx" goes or why it's there.
Google Translate is not very helpful here but it seems to mean something about e-commerce.
So here's everything that happened to your text:
It was encoded as UTF-8 and decoded incorrectly as sloppy-windows-1252, twice
It had the letters "dcx" inserted into the middle of a UTF-8 sequence
Characters that don't exist in windows-1252 -- with byte values 81, 8d, 8f, 90, and 9d -- were removed
A non-breaking space (byte value a0) was removed from the end
If just the first problem had happened, ftfy.fix_text_encoding would be able to fix it. It's possible that the remaining problems just happened while you were trying to get the string onto Stack Overflow.
So here's my recommendation:
Find out who keeps decoding the data incorrectly as sloppy-windows-1252, and get them to decode it as UTF-8 instead.
If you end up with a string like this again, try ftfy.fix_text_encoding on it.

You have Mojibake strings; text encoded to one (correct) codec, then decoded as another.
In this case, your text was decoded with the Windows 1252 codepage; the U+20AC EURO SIGN in the text is typical of CP1252 Mojibakes. The original encoding could be one of the GB* family of Chinese encodings, or a multiple roundtrip UTF-8 - CP1252 Mojibake. Which one I cannot determine, I cannot read Chinese, nor do I have your full data; CP1252 Mojibakes include un-printable characters like 0x81 and 0x8D bytes that might have gotten lost when you posted your question here.
I'd install the ftfy project; it won't fix GB* encodings (I requested the project add support), but it includes a new codec called sloppy-windows-1252 that'll let you reverse an erroneous decode with that codec:
>>> import ftfy # registers extra codecs on import
>>> text = u'Ã¨Â¢â€¹Ã¨Â¢âdcx€¹Ã¤Â¸Å½Ã¦Å“â€¹Ã¥Ââ€¹Ã¤Â»Â¬Ã§â€ÂµÃ¥ÂÂÃ¥â€¢â€'
>>> print text.encode('sloppy-windows-1252').decode('gb2312', 'replace')
猫垄�姑�⑩dcx�盲赂沤忙��姑ヂ�姑ぢ宦�р�得ヂ�氓�⑩�
>>> print text.encode('sloppy-windows-1252').decode('gbk', 'replace')
猫垄鈥姑�⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦�р�得ヂ�氓鈥⑩�
>>> print text.encode('sloppy-windows-1252').decode('gb18030', 'replace')
猫垄鈥姑⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р�得ヂ氓鈥⑩�
>>> print text.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'replace')
袋�dcx与朋�们���
The � U+FFFD REPLACEMENT CHARACTER shows the decoding wasn't entirely successful, but that could be due to the fact that your copied string here is missing anything not printable or using the 0x81 or 0x8D bytes.
You can try to fix your data this way; from the file data, try to decode to one of the GB* codecs after encoding to sloppy-windows-1252, or roundtrip from UTF-8 twice and see what fits best.
If that's not good enough (you cannot fix the data) you can use the ftfy.badness.sequence_weirdness() function to try and detect the issue:
>>> from ftfy.badness import sequence_weirdness
>>> sequence_weirdness(text)
9
>>> sequence_weirdness(u'元大寶來證券')
0
>>> sequence_weirdness(u'John Dove')
0
Mojibakes score high on the sequence weirdness scale. You'd could try and find an appropriate threshold for your data by which time you'd call the data most likely to be corrupted.
However, I think we can use a non-zero return value as a starting point for another test. English text should score 0 on that scale, and so should Chinese text. Chinese mixed with English can still score over 0, but you could not then encode that Chinese text to the CP-1252 codec while you can with the broken text:
from ftfy.badness import sequence_weirdness
def is_valid_unicode_str(text):
if not sequence_weirdness(text):
# nothing weird, should be okay
return True
try:
text.encode('sloppy-windows-1252')
except UnicodeEncodeError:
# Not CP-1252 encodable, probably fine
return True
else:
# Encodable as CP-1252, Mojibake alert level high
return False

Python Unicode CSV export (using Django)

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe

You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.
So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like â€™ instead of ’.
So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (’), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).
dmessage= contact.message.encode('cp1252', 'ignore')
alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
dmessage= contact.message.encode('ascii', 'ignore')

Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.
The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.

[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:
smashcii = {
0x2019 : u"'",
# etc
#
smashed = input_string.translate(smashcii)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.