UnicodeEncodeError in BeautifulSoup webscraper - python

I'm having a unicode encode error with the following code for a simple web scraper.
print 'JSON scraper initializing'
from bs4 import BeautifulSoup
import json
import requests
import geocoder
# Set page variable
page = 'https://www.bandsintown.com/?came_from=257&page='
urlBucket = []
for i in range (1,3):
uniqueUrl = page + str(i)
urlBucket.append(uniqueUrl)
# Build response container
responseBucket = []
for i in urlBucket:
uniqueResponse = requests.get(i)
responseBucket.append(uniqueResponse)
# Build soup container
soupBucket = []
for i in responseBucket:
individualSoup = BeautifulSoup(i.text, 'html.parser')
soupBucket.append(individualSoup)
# Build events container
allSanFranciscoEvents = []
for i in soupBucket:
script = i.find_all("script")[4]
eventsJSON = json.loads(script.text)
allSanFranciscoEvents.append(eventsJSON)
with open("allSanFranciscoEvents.json", "w") as writeJSON:
json.dump(allSanFranciscoEvents, writeJSON, ensure_ascii=False)
print ('end')
The odd thing is the sometimes, this code works, and doesn't give an error. It has to do with the for i in range line of the code. For example, if I put in (2,4) for the range, it works fine. If I change it to 1,3, it reads:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 12: ordinal not in range(128)
Can anyone tell me how to fix this issue within my code? If I print allSanFranciscoEvents, it is reading in all the data, so I believe the issue is happening in the final piece of code, with the JSON dump. Thanks so much.

Best Fix
Use Python 3! Python 2 is going EOL very soon. New code written in legacy python today will have a very short shelf life.
The only thing I had to change to make your code work in python 3 was to call the print() function instead of the print keyword. Your example code then worked without any error.
Persisting with Python 2
The odd thing is the sometimes, this code works, and doesn't give an
error. It has to do with the for i in range line of the code. For
example, if I put in (2,4) for the range, it works fine.
That is because you are requesting different pages with those different ranges, and not every page has a character that can't be converted to str using the ascii codec. I had to go to page 5 of the response to get the same error that you did. In my case, it was the artist name, u'Mø' that caused the issue. So here's a 1 liner that reproduces the issue:
>>> str(u'Mø')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 0: ordinal not in range(128)
Your error explicitly singles out the character u'\xe9':
>>> str(u'\xe9')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
Same issue, just different character. The character is Latin small letter e with acute. Python is trying to use the default encoding, 'ascii', to convert the Unicode string to str, but 'ascii' doesn't know what the code point is.
I believe the issue is happening in the final piece of code, with the
JSON dump.
Yes, it is:
>>> with open('tmp.json', 'w') as f:
... json.dump(u'\xe9', f, ensure_ascii=False)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
And from the traceback, you can see that it's actually coming from writing to the file (fp.write(chunk)).
file.write() writes a string to a file, but u'\xe9' is a unicode object. The error message: 'ascii' codec can't encode character... tells us that python is trying to encode that unicode object to turn it into a str type, so it can write it to the file. Calling encode on the unicode string uses the "default string encoding", which is defined here to be 'ascii'.
To fix, don't leave it up to python to use the default encoding:
>>> with open('tmp.json', 'w') as f:
... json.dump(u'\xe9'.encode('utf-8'), f, ensure_ascii=False)
...
# No error :)
In your specific example, you can fix the intermittent error by changing this:
allSanFranciscoEvents.append(eventsJSON)
to this:
allSanFranciscoEvents.append(eventsJSON.encode('utf-8'))
That way, you are explicitly using the 'utf-8' codec to convert the Unicode strings to str, so that python doesn't try to apply the default encoding, 'ascii' when writing to the file.

eventsJSON is object it can't use eventsJSON.encode('utf-8'). For Python 2.7 to write the file in utf-8 or unicode you can use codecs or write it using binary or wb flag.
with open("allSanFranciscoEvents.json", "wb") as writeJSON:
jsStr = json.dumps(allSanFranciscoEvents)
# the decode() needed because we need to convert it to binary
writeJSON.write(jsStr.decode('utf-8'))
print ('end')
# and read it normally
with open("allSanFranciscoEvents.json", "r") as readJson:
data = json.load(readJson)
print(data[0][0]["startDate"])
# 2019-02-04

Related

Writing Non Ascii Text on Python

Here simple script, I'm confused when to write non-ascii text. I want to write some character to a file, U know python write default is simple str. for instance, char number 128, the type is str, so I conclude that when writing that to a file, it doesn't matter because its type is str, not bytes. The write default is str too, not binary file
#python 3
#v.1
print(type(chr(128))) #The type is str
f = open('tep.txt','w')
f.write('\n')
for i in range(128,1000):
f.write(chr(i))
f.close()
the code above is solved with this code
#v.2
f = open('tep.txt','wb')
f.write('\n'.encode('utf-8'))
for i in range(128,1000):
f.write(chr(i).encode('utf-8'))
f.close()
I don't know what's happen with writing binary file.
Because char > 128 the type is str. So writing str to file with default should be ok, but it fails.
What's happening here?
Python 3 strings are Unicode and must be encoded to a file. The default encoding for open on some OSes is not UTF-8, so it is best to be explicit. If you look at open() documentation, the default encoding on Windows is (on my system):
>>> import locale
>>> locale.getpreferredencoding(False)
'cp1252'
>>> chr(128).encode('cp1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python38\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 0: character maps to <undefined>
That is probably the error you see (post it next time so we don't have to guess!).
It is best to be explicit about encoding whenever opening a file for reading and writing, since it varies by OS, and some encodings don't support every Unicode code point.
Not Explicit - Note the error complains about using CP1252 and not supporting that character:
>>> with open('tep.txt','w') as f: # NOT explicit
... f.write(chr(128))
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python38\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 0: character maps to <undefined>
Explicit - UTF-8 supports every valid Unicode code point:
>>> with open('tep.txt','w',encoding='utf8') as f: # Explicit!
... f.write(chr(128))
...
1
Further Reading:
The Absolute Minimum Every Software Developer...Must Know About Unicode....
Pragmatic Unicode

Python - UnicodeEncodeError: 'charmap' codec can't encode characters in position 85-89: character maps to <undefined>

I am trying to see if I can transfer the output of urllib.request.urlopen() to a text file just to look at it. I tried decoding the output into a string so I can write into a file, but apparently the original output included some Korean characters that are not translating properly into the string.
So far I have:
from urllib.request import urlopen
openU = urlopen(myUrl)
pageH = openU.read()
openU.close()
stringU = pageH.decode("utf-8")
f=open("test.txt", "w+")
f.write(stringU)
I do not get any errors until the last step at which point it says:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Chae\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 85-89: character maps to `<undefined>`
Is there a way to get the string to also include Korean or if not, how do I skip the characters causing problems and write the rest of the string into the file?
Does it matter to you what the file encoding is? If not, then use utf-8 encoding:
f=open("test.txt", "w+", encoding="utf-8")
f.write(stringU)
If you want the file to be cp1252-encoded, which apparently is the default on your system, and to ignore unencodable values, add errors="ignore":
f=open("test.txt", "w+", errors="ignore")
f.write(stringU)

Encoding error using csv.reader on io file object with non-ascii encoding

I am trying to read a csv file with cp1252 encoding like this:
import io
import csv
csvr = csv.reader(io.open('data.csv', encoding='cp1252'))
for row in csvr:
print row
The relevant content of 'data.csv' is
Curva IV
Fecha: 27-Jul-2016 16:22:40
Muestra: 1
Tensión Corriente Ig
0.000000e+000 1.154330e-004 -2.984730e-004
...
and I get the following output
['Curva IV']
['Fecha: 27-Jul-2016 16:22:40']
['Muestra: 1']
Traceback (most recent call last):
File "D:/sandbox/bla.py", line 347, in <module>
mist()
File "D:/sandbox/bla.py", line 343, in mist
for row in csvr:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 5: ordinal not in range(128)
which I do not understand at all. Obviously the critical line is that with the accent on the 'o'. It seems like the iterator of the object returned by csv.reader is attempting to do a conversion. The exception is raised before the print statement, so it is not a problem with my terminal encoding. Any ideas what is going on here?
From the docs:
Note
This version of the csv module doesn’t support Unicode input. Also,
there are currently some issues regarding ASCII NUL characters.
Accordingly, all input should be UTF-8 or printable ASCII to be safe;
see the examples in section Examples.
The input has to be converted to UTF-8 before passing it to csv.reader.

Python 2.7 UnicodeDecodeError: 'ascii' codec can't decode byte

I've been parsing some docx files (UTF-8 encoded XML) with special characters (Czech alphabet). When I try to output to stdout, everything goes smoothly, but I'm unable to output data to the file,
Traceback (most recent call last):
File "./test.py", line 360, in
ofile.write(u'\t\t\t\t\t\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 37: ordinal not in range(128)
Although I explicitly cast the word variable to unicode type (type(word) returned unicode), I tried to encode it with .encode('utf-8) I'm still stuck with this error.
Here is a sample of the code as it looks now:
for word in word_list:
word = unicode(word)
#...
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
#...
I also tried the following:
for word in word_list:
word = word.encode('utf-8')
#...
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
#...
Even the combination of these two:
word = unicode(word)
word = word.encode('utf-8')
I was kind of desperate so I even tried to encode the word variable inside the ofile.write()
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n')
I would appreciate any hints of what I'm doing wrong.
ofile is a bytestream, which you are writing a character string to. Therefore, it tries to handle your mistake by encoding to a byte string. This is only generally safe with ASCII characters. Since word contains non-ASCII characters, it fails:
>>> open('/dev/null', 'wb').write(u'ä')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0:
ordinal not in range(128)
Make ofile a text stream by opening the file with io.open, with a mode like 'wt', and an explicit encoding:
>>> import io
>>> io.open('/dev/null', 'wt', encoding='utf-8').write(u'ä')
1L
Alternatively, you can also use codecs.open with pretty much the same interface, or encode all strings manually with encode.
Phihag's answer is correct. I just want to propose to convert the unicode to a byte-string manually with an explicit encoding:
ofile.write((u'\t\t\t\t\t<feat att="writtenForm" val="' +
word + u'"/>\n').encode('utf-8'))
(Maybe you like to know how it's done using basic mechanisms instead of advanced wizardry and black magic like io.open.)
I've had a similar error when writing to word documents (.docx). Specifically with the Euro symbol (€).
x = "€".encode()
Which gave the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
How I solved it was by:
x = "€".decode()
I hope this helps!
The best solution i found in stackoverflow is in this post:
How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"
put in the beggining of the code and the default codification will be utf8
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

UnicodeEncodeError when trying to convert Django models to XML

I found a python program: Export Django database to xml file that converts django models to a xml representation. I get these errors when trying to run the program. My models contain some text written in French.
Traceback (most recent call last):
File "xml_export.py", line 71, in <module>
writer.content(value)
File "xml_export.py", line 41, in content
self.output += str(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3:
ordinal not in range(128)
It looks like your variable text contains a non-ASCII string.
See:
>>> mystring = u"élève"
>>> mystring
u'\xe9l\xe8ve'
>>> str(mystring)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
So, you first need to encode your string into UTF-8:
>>> str(mystring.encode("utf-8"))
'\xc3\xa9l\xc3\xa8ve'
Or, if (as the comments show) text may contain other variable types besides strings, use
self.output += unicode(mystring).encode("utf-8")
Seriously, don't use the linked code. It's terrible, and appears to have been written by someone with absolutely no knowledge of unicode, character encodings, or even how to build up an XML document. String concatentation? Really?
Just don't use it.
Did you tried to use the built-in command :
./manage.py dumpdata --format xml
The way you used unicode in u'élève' is ok, so this should work (normalement...).

Categories