Beautiful Soup Unicode encode error

Beautiful Soup Unicode encode error - python

I am trying the following code with a particular HTML file
from BeautifulSoup import BeautifulSoup
import re
import codecs
import sys
f = open('test1.html')
html = f.read()
soup = BeautifulSoup(html)
body = soup.body.contents
para = soup.findAll('p')
print str(para).encode('utf-8')
I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 9: ordinal not in range(128)
How do I debug this?
I do not get any error when I remove the call to print function.

The str(para) builtin is trying to use the default (ascii) encoding for the unicode in para.
This is done before the encode() call:
>>> s=u'123\u2019'
>>> str(s)
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
>>> s.encode("utf-8")
'123\xe2\x80\x99'
>>>
Try encoding para directly, maybe by applying encode("utf-8") to each list element.

Related

BeautifulSoup(req.text, 'lxml') outputs UnicodeEncodeError: 'charmap' codec can't encode character '\xe9'

Wanted to write a script that fills out the tarrifs form (water, heat, electricity) on Python.
But before I wanted to web-scrape the page (looks super agly, just don't mind). I did:
import requests
from bs4 import BeautifulSoup
req = requests.get(
'https://domm132.wixsite.com/tsnmichurina/forma-sdachi-pokazanij')
soup = BeautifulSoup(req.text, 'html.parser')
print(soup)
Output is:
Traceback (most recent call last):
File "C:\Users\ska19\sublime_projects\script.py", line 16, in <module>
print(soup)
File "C:\Users\ska19\Anaconda3\lib\encodings\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xe9' in position 12506: character maps to <undefined>
[Finished in 2.8s]
I have tried to specify the encoding and ignore errors:
req = requests.get(
'https://domm132.wixsite.com/tsnmichurina/forma-sdachi-pokazanij')
text = req.text.encode('utf-8', 'ignore')
soup = BeautifulSoup(text, 'lxml')
print(soup)
Again the error appears:
UnicodeEncodeError: 'charmap' codec can't encode character '\xe9' in position 12504: character maps to <undefined>
Unfortunately, I have no idea how either to decode the character correctly, or how to skip it.
Could anyone tell me how to fix this? Sorry for a simple question.

Python - cannot decode html (urllib)

I'm trying to write html from webpage to file, but I have problem with decode characters:
import urllib.request
response = urllib.request.urlopen("https://www.google.com")
charset = response.info().get_content_charset()
print(response.read().decode(charset))
Last line causes error:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in
position 6079: ordinal not in range(128)
response.info().get_content_charset() returns iso-8859-2, but if i check content of response without decoding (print(resposne.read())) there is "utf-8" encoding as html metatag. If i use "utf-8" in decode function there is also similar problem:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position
6111: invalid start byte
What's going on?

You can ignore invalid characters using
response.read().decode("utf-8", 'ignore')
Instead of ignore there are other options, e.g. replace
https://www.tutorialspoint.com/python/string_encode.htm
https://docs.python.org/3/howto/unicode.html#the-string-type
(There is also str.encode(encoding='UTF-8',errors='strict') for strings.)

Why is Python able to parse Amazon but not Google/Reddit?

I've searched for awhile to no result. Python seems to be able to handle some-- but not all--webpages:
import requests, webbrowser, bs4
res = requests.get('http://www.reddit.com')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
print soup.prettify()
Surprisingly, this is able to print the Amazon.com homepage, but not Reddit. The error I get is:
Traceback (most recent call last):File "testweb.py", line 7, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xd7' in position 37769: character maps to <undefined>
My question: How can I write a program that can encode for any webpage? Where am I going wrong?
EDIT: Further testing shows google.com also does not work. It's a similar error message:
Traceback (most recent call last):File "testweb.py", line 7, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 9651: character maps to <undefined>
EDIT 2: Tried decoding res.text to utf-8 but got this error:
Traceback (most recent call last):File "testweb.py", line 5, in <module>
soup = bs4.BeautifulSoup(res.text.decode('utf-8'), 'html.parser')File "C:\PYTHON27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 9358: ordinal not in range(128)
Edit 3: Tried encoding res.text to utf-8 but got this error:
Traceback (most recent call last):File "testweb.py", line 8, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 9622: character maps to <undefined>

Change the output encoding to utf-8, so it'll output utf-8 encoded text, and try to encode the request text, instead of decoding it.
Example:
# -*- coding: utf-8 -*-
import requests, webbrowser, bs4
res = requests.get('http://www.reddit.com')
soup = bs4.BeautifulSoup(res.text.encode('utf-8'), 'html.parser')
print (soup.prettify())
Try to encode directly in prettify:
print (soup.prettify('latin-1')) or print (soup.prettify('utf-8'))

Why html2text module throws UnicodeDecodeError?

I have problem with html2text module...shows me UnicodeDecodeError:
UnicodeDecodeError: 'ascii' codec can't decode byte
0xbe in position 6: ordinal not in range(128)
Example :
#!/usr/bin/python
# -*- coding: utf-8 -*-
import html2text
import urllib
h = html2text.HTML2Text()
h.ignore_links = True
html = urllib.urlopen( "http://google.com" ).read()
print h.handle( html )
...also have tried h.handle( unicode( html, "utf-8" ) with no success. Any help.
EDIT :
Traceback (most recent call last):
File "test.py", line 12, in <module>
print h.handle(html)
File "/home/alex/Desktop/html2text-master/html2text.py", line 254, in handle
return self.optwrap(self.close())
File "/home/alex/Desktop/html2text-master/html2text.py", line 266, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 6: ordinal not in range(128)

The issue is easily reproducable when not decoding, but works just fine when you decode your source correctly. You also get the error if you reuse the parser!
You can try this out with a known good Unicode source, such as http://www.ltg.ed.ac.uk/~richard/unicode-sample.html.
If you don't decode the response to unicode, the library fails:
>>> h = html2text.HTML2Text()
>>> h.handle(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Now, if you reuse the HTML2Text object, its state is not cleared up, it still holds the incorrect data, so even passing in Unicode will now fail:
>>> h.handle(html.decode('utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
You need to use a new object and it'll work just fine:
>>> h = html2text.HTML2Text()
>>> result = h.handle(html.decode('utf8'))
>>> len(result)
12750
>>> type(result)
<type 'unicode'>

python UnicodeEncodeError > How can I simply remove troubling unicode characters?

Heres what I did..
>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>>
>>> soup.find('div')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>>
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>
How can I simply remove troubling unicode characters from html ?
Or is there any cleaner solution ?

Try this way:
soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

The error you see is due to repr(soup)tries to mix Unicode and bytestrings. Mixing Unicode and bytestrings frequently leads to errors.
Compare:
>>> u'1' + '©'
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
And:
>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'
Here's an example for classes:
>>> class A:
... def __repr__(self):
... return u'copyright ©'.encode('utf-8')
...
>>> A()
copyright ©
>>> class B:
... def __repr__(self):
... return u'copyright ©'
...
>>> B()
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
... def __repr__(self):
... return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)
Similar thing happens with BeautifulSoup:
>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)
To workaround it:
>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'

First of all, "troubling" unicode characters could be letters in some language but assuming you won't have to worry about non-english characters then you can use a python lib to convert unicode to ansi. Check out the answer to this question:
How do I convert a file's format from Unicode to ASCII using Python?
The accepted answer there seems like a good solution (that I didn't know about beforehand).

I had the same problem, spent hours on it. Notice the error occurs whenever the interpreter has to display content, this is because the interpreter is trying to convert to ascii, causing problems. Take a look at the top answer here:
UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup Unicode encode error - python

Related

BeautifulSoup(req.text, 'lxml') outputs UnicodeEncodeError: 'charmap' codec can't encode character '\xe9'

Python - cannot decode html (urllib)

Why is Python able to parse Amazon but not Google/Reddit?

Why html2text module throws UnicodeDecodeError?

python UnicodeEncodeError > How can I simply remove troubling unicode characters?

Categories

Resources