python encoding error - python

What does one do with this kind of error? You are reading lines from a file. You don't know the encoding.
What does "byte 0xed" mean?
What does "position 3792" mean?
I'll try to answer this myself and repost but I'm slightly annoyed that I'm spending as long as I am figuring this out. Is there a clobber/ignore and continue method for getting past unknown encodings? I just want to read a text file!
Traceback (most recent call last):
File "./test.py", line 8, in <module>
for x in fin:
File "/bns/rma/local/lib/python3.1/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 3792: ordinal not in range(128)

0xed is the unicode code for í, which is contained in the input at the position 3792 (that is, if you count starting at the first letter, the 3792nd letter will be í).
You are using the ascii codec to decode the file, but the file is not ascii-encoded, try with a unicode aware codec instead (utf_8 maybe?), or, if you know the encoding used to write the file, choose the appropriate encoding from the full list of available codecs.

I think I found the way to be dumb :) :
fin = (x.decode('ascii', 'ignore') for x in fin)
for x in fin: print(x)
where errors='ignore' could be 'replace' or whatever. This at least follows the idiom "garbage in, garbage out" that I am seeking.

Related

UnicodeEncodeError in BeautifulSoup webscraper

I'm having a unicode encode error with the following code for a simple web scraper.
print 'JSON scraper initializing'
from bs4 import BeautifulSoup
import json
import requests
import geocoder
# Set page variable
page = 'https://www.bandsintown.com/?came_from=257&page='
urlBucket = []
for i in range (1,3):
uniqueUrl = page + str(i)
urlBucket.append(uniqueUrl)
# Build response container
responseBucket = []
for i in urlBucket:
uniqueResponse = requests.get(i)
responseBucket.append(uniqueResponse)
# Build soup container
soupBucket = []
for i in responseBucket:
individualSoup = BeautifulSoup(i.text, 'html.parser')
soupBucket.append(individualSoup)
# Build events container
allSanFranciscoEvents = []
for i in soupBucket:
script = i.find_all("script")[4]
eventsJSON = json.loads(script.text)
allSanFranciscoEvents.append(eventsJSON)
with open("allSanFranciscoEvents.json", "w") as writeJSON:
json.dump(allSanFranciscoEvents, writeJSON, ensure_ascii=False)
print ('end')
The odd thing is the sometimes, this code works, and doesn't give an error. It has to do with the for i in range line of the code. For example, if I put in (2,4) for the range, it works fine. If I change it to 1,3, it reads:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 12: ordinal not in range(128)
Can anyone tell me how to fix this issue within my code? If I print allSanFranciscoEvents, it is reading in all the data, so I believe the issue is happening in the final piece of code, with the JSON dump. Thanks so much.
Best Fix
Use Python 3! Python 2 is going EOL very soon. New code written in legacy python today will have a very short shelf life.
The only thing I had to change to make your code work in python 3 was to call the print() function instead of the print keyword. Your example code then worked without any error.
Persisting with Python 2
The odd thing is the sometimes, this code works, and doesn't give an
error. It has to do with the for i in range line of the code. For
example, if I put in (2,4) for the range, it works fine.
That is because you are requesting different pages with those different ranges, and not every page has a character that can't be converted to str using the ascii codec. I had to go to page 5 of the response to get the same error that you did. In my case, it was the artist name, u'Mø' that caused the issue. So here's a 1 liner that reproduces the issue:
>>> str(u'Mø')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 0: ordinal not in range(128)
Your error explicitly singles out the character u'\xe9':
>>> str(u'\xe9')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
Same issue, just different character. The character is Latin small letter e with acute. Python is trying to use the default encoding, 'ascii', to convert the Unicode string to str, but 'ascii' doesn't know what the code point is.
I believe the issue is happening in the final piece of code, with the
JSON dump.
Yes, it is:
>>> with open('tmp.json', 'w') as f:
... json.dump(u'\xe9', f, ensure_ascii=False)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
And from the traceback, you can see that it's actually coming from writing to the file (fp.write(chunk)).
file.write() writes a string to a file, but u'\xe9' is a unicode object. The error message: 'ascii' codec can't encode character... tells us that python is trying to encode that unicode object to turn it into a str type, so it can write it to the file. Calling encode on the unicode string uses the "default string encoding", which is defined here to be 'ascii'.
To fix, don't leave it up to python to use the default encoding:
>>> with open('tmp.json', 'w') as f:
... json.dump(u'\xe9'.encode('utf-8'), f, ensure_ascii=False)
...
# No error :)
In your specific example, you can fix the intermittent error by changing this:
allSanFranciscoEvents.append(eventsJSON)
to this:
allSanFranciscoEvents.append(eventsJSON.encode('utf-8'))
That way, you are explicitly using the 'utf-8' codec to convert the Unicode strings to str, so that python doesn't try to apply the default encoding, 'ascii' when writing to the file.
eventsJSON is object it can't use eventsJSON.encode('utf-8'). For Python 2.7 to write the file in utf-8 or unicode you can use codecs or write it using binary or wb flag.
with open("allSanFranciscoEvents.json", "wb") as writeJSON:
jsStr = json.dumps(allSanFranciscoEvents)
# the decode() needed because we need to convert it to binary
writeJSON.write(jsStr.decode('utf-8'))
print ('end')
# and read it normally
with open("allSanFranciscoEvents.json", "r") as readJson:
data = json.load(readJson)
print(data[0][0]["startDate"])
# 2019-02-04

Python - UnicodeEncodeError: 'charmap' codec can't encode characters in position 85-89: character maps to <undefined>

I am trying to see if I can transfer the output of urllib.request.urlopen() to a text file just to look at it. I tried decoding the output into a string so I can write into a file, but apparently the original output included some Korean characters that are not translating properly into the string.
So far I have:
from urllib.request import urlopen
openU = urlopen(myUrl)
pageH = openU.read()
openU.close()
stringU = pageH.decode("utf-8")
f=open("test.txt", "w+")
f.write(stringU)
I do not get any errors until the last step at which point it says:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Chae\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 85-89: character maps to `<undefined>`
Is there a way to get the string to also include Korean or if not, how do I skip the characters causing problems and write the rest of the string into the file?
Does it matter to you what the file encoding is? If not, then use utf-8 encoding:
f=open("test.txt", "w+", encoding="utf-8")
f.write(stringU)
If you want the file to be cp1252-encoded, which apparently is the default on your system, and to ignore unencodable values, add errors="ignore":
f=open("test.txt", "w+", errors="ignore")
f.write(stringU)

Python 2.7 UnicodeDecodeError: 'ascii' codec can't decode byte

I've been parsing some docx files (UTF-8 encoded XML) with special characters (Czech alphabet). When I try to output to stdout, everything goes smoothly, but I'm unable to output data to the file,
Traceback (most recent call last):
File "./test.py", line 360, in
ofile.write(u'\t\t\t\t\t\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 37: ordinal not in range(128)
Although I explicitly cast the word variable to unicode type (type(word) returned unicode), I tried to encode it with .encode('utf-8) I'm still stuck with this error.
Here is a sample of the code as it looks now:
for word in word_list:
word = unicode(word)
#...
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
#...
I also tried the following:
for word in word_list:
word = word.encode('utf-8')
#...
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
#...
Even the combination of these two:
word = unicode(word)
word = word.encode('utf-8')
I was kind of desperate so I even tried to encode the word variable inside the ofile.write()
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n')
I would appreciate any hints of what I'm doing wrong.
ofile is a bytestream, which you are writing a character string to. Therefore, it tries to handle your mistake by encoding to a byte string. This is only generally safe with ASCII characters. Since word contains non-ASCII characters, it fails:
>>> open('/dev/null', 'wb').write(u'ä')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0:
ordinal not in range(128)
Make ofile a text stream by opening the file with io.open, with a mode like 'wt', and an explicit encoding:
>>> import io
>>> io.open('/dev/null', 'wt', encoding='utf-8').write(u'ä')
1L
Alternatively, you can also use codecs.open with pretty much the same interface, or encode all strings manually with encode.
Phihag's answer is correct. I just want to propose to convert the unicode to a byte-string manually with an explicit encoding:
ofile.write((u'\t\t\t\t\t<feat att="writtenForm" val="' +
word + u'"/>\n').encode('utf-8'))
(Maybe you like to know how it's done using basic mechanisms instead of advanced wizardry and black magic like io.open.)
I've had a similar error when writing to word documents (.docx). Specifically with the Euro symbol (€).
x = "€".encode()
Which gave the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
How I solved it was by:
x = "€".decode()
I hope this helps!
The best solution i found in stackoverflow is in this post:
How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"
put in the beggining of the code and the default codification will be utf8
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

Pythonic way of reading NUL in a file

I'm using python to read a text file with the segment below
(can't post a screenshot since i'm a noob) but this is what is looks like in notepad++:
NULSOHSOHNULNULNULSUBMesssage-ID:
error:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print(f.readline())
File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7673: character maps to <undefined>
Opening the file as binary:
f = open('file.txt','rb')
f.readline()
gives me the text as binary
b'\x00\x01\x01\x00\x00\x00\x1a\xb7Message-ID:
but how do I get the text as ascii ? And whats the easiest/pythonic way of handling this ?
The problem is with "byte 0x8f in position 7673", not with "byte 0x00 in position 1". I.e., your NUL is not the problem. If you look at the cp-1252 codepage on wikipedia, you can see that 0x8f has no corresponding character.
The larger issue is that your file is not in a single encoding: it appears to be a mix of binary framing of text segments. What you really need to do is figure out the format of this file and parse it into binary pieces (or perhaps some richer data structure, like a tuple, list, dict, object, etc), then decode the text pieces into unicode if you need to process further.
When opening a file in text mode, you can specifically tell which encoding to use:
f = open('file.txt','r',encoding='ascii')
However your real problem is different: the binary piece that you cited can not be read as ASCII, because the byte \xb7 is outside of ASCII range (0-127). The exception traceback tells that Python is using cp1252 codec by default, which cannot decode your file either.
You need either to figure out which encoding the file has, or to handle it as binary all the time.
Perhaps open it in read the correct read mode?
f = open('file.txt','r')
f.readline()

Python 3 chokes on CP-1252/ANSI reading

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:
File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>
The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?
This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.
UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.
Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.
You can relax the error handling.
For instance:
f = open(filename, encoding="...", errors="replace")
Or:
f = open(filename, encoding="...", errors="ignore")
See the docs.
EDIT:
But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails
All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.
As the traceback and error message indicate, the file in question is NOT encoded in cp1252.
If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.
You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?
A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.
"others read directly as iterables" -- sample code please.
Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?
Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?
Note: please edit your question with clarifying information; don't answer in the comments.

Categories