UnicodeEncodeError, can't seem to set errors='ignore' - python

I'm fairly new to Python, so I'm hoping this is something simple that I'm just missing.
I'm running Python 2.7 on Windows 7
I'm trying to run a basic twitter scraping program through the command line. However I keep getting the following error:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 79: character maps to (undefined)
I understand basically what's happening here, that it's trying to print to the console in cp437 and it's getting confused by the unicode characters in the tweets that it's grabbing.
All I'm trying to do is either get it to replace those characters with "?" or just get it to drop those characters altogether. I have read a bunch of posts about this and I can't figure out how to do it.
I opened the cp437.py file that's referenced in the error and I changed all the errors='strict' to errors='ignore' but that didn't solve the problem.
I then tried to go into the C:\Python27\Lib\codecs.py file and change all the errors='strict' to errors='ignore' but that didn't solve the problem either.
Any ideas? Like I said, hopefully I'm just missing something basic but I've read a bunch of posts on this and I can't seem to puzzle it out.
Thanks a lot.
Seth

I would not suggest changing the built in libraries - they are designed to allow handling encoding errors without needing to be fiddled with (and if you have change, not longer clear that any solution that would work for everyone else, would work for you).
You may just want to be passing errors='ignore' into whatever encoding function you are using to just skip the error character, or errors='replace' to replace that character with the character \ufff to signify there was a problem. [ error='strict' is the default if you don't pass any value. ]
However, if you are printing to the command line, you probably don't want to be encoding as unicode anyway, but ASCII instead - since unicode includes characters that the command line can't print. (and i suspect that that which is causing errors to be thrown up, rather than there being non-standard unicode characters in the response you are getting from Twitter).
Try e.g.
print original_data.encode('ascii', 'ignore')

Related

Why under specific enviroments UNICODE characters trigger an EncodeError and at others not?

I am making a Python project that needs to work with Greek characters print, edit and return strings.
On my main PC that has the Greek language installed everything runs fine but when I am running on my English laptop the same program with the same version of python an encode error is triggered. Especially this one:
EncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
The error happens due to this code
my_string = "Δίας"
print(my_string)
Why is this happening and what I need to do to fix it?
Why is this happening? You are using Python 2 and although it supports Unicode, it makes you jump through a few more hoops explicitly than Python 3 does. The string you provide contains characters that fall outside the normal first 128 ASCII characters, which is what is causing the problem.
The print statement tries to encode the string as standard ascii, but it runs into characters it doesn't understand and by that point, it does not know what encoding the characters are supposed to be in. You might think this is obvious: "the same encoding the file is in!" or "always UTF-8!", but Python 2 wants you to make it explicit.
What do you need to do to fix it? One solution would be to use Python 3 and not worry about it, if all you need is a quick solution. Python 3 really is the way forward at this point and using Python 2 makes you solve problems that many Python programmers today don't have to solve (although they should be able to, in the end).
If you want to keep using Python 2, you should change your code to this:
# coding=utf-8
my_string = u"Δίας"
print(my_string.encode('utf-8'))
The first line tells the interpreter explicitly what encoding the source file was written in. This helps your IDE as well, to make sure it is showing you the code correctly. The second line has the u in front of the string, telling Python my_string is in fact a unicode string. And the third line explicitly tells Python that you want the output to be utf-8 encoded as well.
A more complete explanation of all this is here https://docs.python.org/2/howto/unicode.html
If you're wondering why it works on your Greek computer, but not on your English computer - the default encoding on the Greek computer actually has the code points for the characters you're using, while the English encoding does not. This indicates that Python is clever enough to figure out that things are utf (and the string is a series of unicode code points), but by the time it needs to encode them, it doesn't know what encoding to use, as the standard (English) encoding doesn't have the characters in the string.

Python Dict to JSON: json.dumps unicode error, but ord(str(dict)) yields nothing over 128

I have a task where I needed to generate a portable version of a data dictionary, with some extra fields inserted. I ended up building a somewhat large Python dictionary, which I then wanted to convert to JSON. However, when I attempt this conversion...
with open('CPS14_data_dict.json','w') as f:
json.dump(data_dict,f,indent=4,encoding='utf-8')
I get smacked with an exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 17: invalid start byte
I know this is a common error, but I cannot, for the life of me, find an instance of how one locates the problematic portion in the dictionary. The only plausible thing I have seen is to convert the dictionary to a string and run ord() on each character:
for i,c in enumerate(str(data_dict)):
if ord(c)>128:
print i,'|',c
The problem is, this operation returns nothing at all. Am I missing something about how ord() works? Alternatively, a position (17) is reported, but it's not clear to me what this refers to. There do not appear to be any problems at the 17th character, row, or entry.
I should say that I know about the ensure_ascii=False option. Indeed, it will write to disk (and it's beautiful). This approach, however, seems to just kick the can down the road. I get the same encoding error when I try to read the file back in. Since I will want to use this file for multiple purposes (converted back to a dictionary), this is an issue.
It would also be helpful to note that this is my work computer with Windows 7, so I don't have my shell tools to explore the file (and my VM is on the fritz).
Any help would be greatly appreciated.

Python Unicode Handling Errors - How To Simply Remove Unicode

There are literally dozens, maybe even hundreds of questions on this site about unicode handling errors with python. Here is an example of what I am talking about:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2310: ordinal not in range(128)
A great many of the questions indicate that the OP just wants the offending content to GO AWAY. The responses they receive are uniformly full of mumbo jumbo about codecs, character sets, and all sorts of things that do not address this one basic question:
"I am trying to process a text file with some unicode in it, I could not possibly care any less what this stuff is, it is just noise in the context of the problem I am trying to solve."
So, I have a file with a zillion JSON encoded tweets, I have no interest at all in these special characters, I want them REMOVED from the line.
fh = open('file-full-of-unicode.txt')
for line in fh:
print zap_unicode(line)
Given a variable called 'line', how do I simply print it minus any unicode it might contain?
There, I have repeated the question in several different fashions so it can not be misconstrued - unicode is junk in the context of what I am trying to do, I want to convert it to something innocuous or simply remove it entirely. How is this most easily accomplished?
You can do line.decode('ascii', 'ignore'). This will decode as ASCII everything that it can, ignoring any errors.
However, if you do this, prepare for pain. Unicode exists for a reason. Throwing away parts of your data without even knowing what you're throwing away will almost always cause problems down the road. It's like saying "I don't care if some of the gizmos inside my car engine don't work, I just want to drive it. Just take out anything that doesn't work!" That's all well and good until it explodes.
The problem can't "go away", the code that can't handle unicodes properly must be fixed. There is no trivial way to do this, but the simplest way is to decode the bytes on input to make sure that the application is using text everywhere.
as #BrenBarn suggested the result will be spurious english words...
my suggestion is why not consider removing entire word which contains a unicode character
i.e a word is space+characters+space if characters has a unicode character then remove all character after previous space and before next space.. the result will be much better than removing just unicode characters..
if you just want to remove names (#username) and tags (#tags) you can use a filter ([#,#,..]) instead of brute forcibly searching for all unicode characters..
I have low reputation so can't comment, so made it as an answer.. :-(

django + unicode constant errors

I built a django site last year that utilises both a dashboard and an API for a client.
They are, on occasion, putting unicode information (usually via a Microsoft keyboard and a single quote character!) into the database.
It's fine to change this one instance for everything, but what I constantly get is something like this error when a new character is added that I haven't "converted":
UnicodeDecodeError at /xx/xxxxx/api/xxx.json
'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)
The issue is actually that I need to be able to convert this unicode (from the model) into HTML.
# if a char breaks the system, replace it here (duplicate line)
text = unicode(str(text).replace('\xa3', '£'))
I duplicate this line here, but it just breaks otherwise.
Tearing my hair out because I know this is straight forward and I'm doing something remarkably silly somewhere.
Have searched elsewhere and realised that while my issue is not new, I can't find the answer elsewhere.
I assume that text is unicode (which seems a safe assumption, as \xa3 is the unicode for the £ character).
I'm not sure why you need to encode it at all, seeing as the text will be converted to utf-8 on output in the template, and all browsers are perfectly capable of displaying that. There is likely another point further down the line where something (probably your code, unfortunately) is assuming ASCII, and the implicit conversion is breaking things.
In that case, you could just do this:
text = text.encode('ascii', 'xmlcharrefreplace')
which converts the non-ASCII characters into HTML/XML entities like £.
Tell the JSON-decoder that it shall decode the json-file as unicode. When using the json module directly, this can be done using this code:
json.JSONDecoder(encoding='utf8').decode(
json.JSONEncoder(encoding='utf8').encode('blä'))
If the JSON decoding takes place via some other modules (django, ...) maybe you can pass the information through this other module into the json stuff.

Python Unicode CSV export (using Django)

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe
You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.
So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like ’ instead of ’.
So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (’), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).
dmessage= contact.message.encode('cp1252', 'ignore')
alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
dmessage= contact.message.encode('ascii', 'ignore')
Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.
The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.
[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:
smashcii = {
0x2019 : u"'",
# etc
#
smashed = input_string.translate(smashcii)

Categories