Prevent encoding errors in Python - python

I have scripts which print out messages by the logging system or sometimes print commands. On the Windows console I get error messages like
Traceback (most recent call last):
File "C:\Python32\lib\logging\__init__.py", line 939, in emit
stream.write(msg)
File "C:\Python32\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 4537:character maps to <undefined>
Is there a general way to make all encodings in the logging system, print commands, etc. fail-safe (ignore errors)?

The problem is that your terminal/shell (cmd as your are on Windows) cannot print every Unicode character.
You can fail-safe encode your strings with the errors argument of the str.encode method. For example you can replace not supported chars with ? by setting errors='replace'.
>>> s = u'\u2019'
>>> print s
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can\'t encode character u'\u2019' in position
0: character maps to <undefined>
>>> print s.encode('cp850', errors='replace')
?
See the documentation for other options.
Edit If you want a general solution for the logging, you can subclass StreamHandler:
class CustomStreamHandler(logging.StreamHandler):
def emit(self, record):
record = record.encode('cp850', errors='replace')
logging.StreamHandler.emit(self, record)

Related

'–'.encode('utf-8').decode('iso8859_15') , different output in python2.7 and python3.7

I am migrating a software product ,eventually I come to this problem.
s = '–' # https://www.fileformat.info/info/unicode/char/0096/index.htm
in python2
s.encode('iso8859_15').decode('iso8859_15') # u'-'
s.encode('utf-8').decode('iso8859_15') # u'-'
in python3
s.encode('iso8859_15').decode('iso8859_15')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python37\lib\encodings\iso8859_15.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 0: character maps to <undefined>
s.encode('utf-8').decode('iso8859_15') # 'â\x80\x93'
Please somebody explain , Why is that so and what is the solution for this.
Thanks in advance
I tried replacing this character with '-'(hyphen) but automation test cases failed, and modification of automation test cases is prohibited.

How do we remove all emoji values from strings in python 3?

I am trying to write a program that will get tweets and then insert them into a csv file but I get this error:
Traceback (most recent call last):
File "c:/Users/Fateh Aliyev/Desktop/Python/AI/Data Mining/data.py", line 30, in <module>
csv.writerow([text, 0])
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f44c' in position 41: character maps to <undefined>
I am sure that this is from the emojis that are in the strings. I tried this solution but I got the same error. Is this caused by python not being able to encode the string in the first place or something else? How do we get rid of the emojis?
You can remove the emoji by ignoring it when it cannot be encoded:
import codecs
codecs.charmap_encode('\U0001f44c', 'ignore')
# outputs: (b'', 1)

Python - cannot decode html (urllib)

I'm trying to write html from webpage to file, but I have problem with decode characters:
import urllib.request
response = urllib.request.urlopen("https://www.google.com")
charset = response.info().get_content_charset()
print(response.read().decode(charset))
Last line causes error:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in
position 6079: ordinal not in range(128)
response.info().get_content_charset() returns iso-8859-2, but if i check content of response without decoding (print(resposne.read())) there is "utf-8" encoding as html metatag. If i use "utf-8" in decode function there is also similar problem:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position
6111: invalid start byte
What's going on?
You can ignore invalid characters using
response.read().decode("utf-8", 'ignore')
Instead of ignore there are other options, e.g. replace
https://www.tutorialspoint.com/python/string_encode.htm
https://docs.python.org/3/howto/unicode.html#the-string-type
(There is also str.encode(encoding='UTF-8',errors='strict') for strings.)

Syntax Error in ArcGIS Field Calculator

I got a tiny script in ArcGIS which creates a hyperlink.
My code:
def Befahrung(value1, value2):
if value1 is '':
return ''
else:
return "G:\\Example\\" + str(value1) + "\\File_" + str(value2) + ".pdf"
The error (only when !Bezeichnun! contains a special character):
ERROR 000539: Error running expression: Befahrung(u" ",u"1155Mönch1")
Traceback (most recent call last):
File "<expression>", line 1 in <module>
File "<string>", line 5 in Befahrung
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 5: ordinal not in range(128)
!Bezeichnun! and !Auftrag! are both strings. It works very well until !Bezeichnun! contains a special character. I can't change the characters, I need to save them.
What do I have to change?
In Befahrung, you convert a string (Unicode in this case) to ASCII:
str(value1);
str(value2);
cannot work if value1 or value2 contain non-ASCII characters. You want to use
unicode(value1)
or better, use string formatting:
return u"G:\\Example\\{}\\File_{}.pdf".format(value1, value2)
(works in Python 2.7 and above)
I recommend reading the Python Unicode HOWTO. The error can be distilled to
>>> str(u"1155Mönch1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 5: ordinal not in range(128)
If you know what character encoding you need (e.g., UTF-8), you can encode it like
value1.encode('utf-8')

Another python unicode error

I'm getting errors such as
UnicodeEncodeError('ascii', u'\x01\xff \xfeJ a z z', 1, 2, 'ordinal not in range(128)'
I'm also getting sequences such as
u'\x17\x01\xff \xfeA r t B l a k e y'
I recognize \x01\xff\xfe as a BOM, but how do I transform these into the obvious output (Jazz and Art Blakey)?
These are coming from a program that reads music file tags.
I've tried various encodings, such a s.encode('utf8'), and various decodes followed by encodes, without success.
As requested:
from hsaudiotag import auto
inf = 'test.mp3'
song = auto.File(inf)
print song.album, song.artist, song.title, song.genre
> Traceback (most recent call last): File "audio2.py", line 4, in
> <module>
> print song.album, song.artist, song.title, song.genre File "C:\program files\python27\lib\encodings\cp437.py", line 12, in encode
> return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\xfe' in
> position 4 : character maps to <undefined>
If I change the print statement to
with open('x', 'wb') as f:
f.write(song.genre)
I get
Traceback (most recent call last):
File "audio2.py", line 6, in <module>
f.write(song.genre)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 1:
ordinal not in range(128)
For your actual question, you need to write bytes, not characters, to files. Call:
f.write(song.genre.encode('utf-8'))
and you won't get the error. You can use io.open to get a character stream that you can write to with the encoding done automatically, ie:
with io.open('x', 'wb', encoding='utf-8') as f:
f.write(song.genre)
Getting Unicode to the Console can be a matter of some difficulty (under Windows in particular)—see PrintFails.
However, as discussed in the comments, what you've got doesn't look like a working tag value... it looks more like an mangled ID3v2 frame data block, which it might not be possible to recover. I don't know if this is a bug in your tag reading library or you just have a file with rubbish tags.

Categories