urlopen, BeautifulSoup and UTF-8 Issue - python

I am just trying to retrieve a web page, but somehow a foreign character is embedded in the HTML file. This character is not visible when I use "View Source."
isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_opener.read()
html = BeautifulSoup(page)
html #This line causes error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 21555: ordinal not in range(128)
I also tried...
html = BeautifulSoup(page.encode('utf-8'))
How can I read this web page into BeautifulSoup without getting this error?

This error is probably actually happening when you try to print the representation of the BeautifulSoup file, which will happen automatically if, as I suspect, you are working in the interactive console.
# This code will work fine, note we are assigning the result
# of the BeautifulSoup object to prevent it from printing immediately.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(u'\xa0')
# This will probably show the error you saw
print soup
# And this would probably be fine
print soup.encode('utf-8')

You could try the following :
import codecs
f = codecs.open(filename,'r','utf-8')
soup = BeautifulSoup(f.read(),"html.parser")
I encountered a similar problem with bs4

Related

using request and beautiful soup module in python

My code is very short code but it is giving unexpected error
import bs4
url = "https://www."+input("Enter The Name of the website: ")+".com"
req = re.get(url)
html_text = req.text
htmls = bs4.BeautifulSoup(html_text, "html.parser").prettify()
with open("facebook.html", "w+") as file:
file.write(htmls)
Traceback (most recent call last):
File "C:\Users\Tanay Mishra\PycharmProjects\First_Project\Web Scraping\web.py", line 9, in <module>
file.write(htmls)
File "C:\Users\Tanay Mishra\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to <undefined>```
\u200b means ZERO WIDTH SPACE in Unicode. Try to specify encoding="utf-8" in open() function. Also good practice is to use .content property of Response object and let BeautifulSoup to guess the encoding:
import bs4
import requests
url = "https://www."+input("Enter The Name of the website: ")+".com"
req = requests.get(url)
htmls = bs4.BeautifulSoup(req.content, "html.parser").prettify()
with open("facebook.html", "w+", encoding="utf-8") as file:
file.write(htmls)

Is there a way to retrieve the HTML content of a web page by casting it into a string in Python?

I am trying to retrieve the HTML content of a web page and extract it and read it as a string. However, I have a problem, whenever I run my code I get a bytes like object instead of a string and decode() does not seem to work in this case.
My code is the following:
money_request = urllib.request.urlopen('website-url-here').read()
print(money_request.decode('utf-8')
Running the above script will yield the following error:
Traceback (most recent call last):
File "E:\University Stuff\Licenta\gas_station_service.py", line 12, in <module>
print(money_request.decode())
File "C:\Python38\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u02bb' in position 143288: character maps to <undefined>
>>>
I would also like to specify that I have checked if the website uses utf-8 encoding using the Chrome console and the command document.characterSet.
I need to retrieve this as a string in order to perform a search on the lines of code to get a value from a span tag.
Any help is appreciated.
may be would be better if you use the beautiful soup because it help to parse into html
if you don't have this module install it like pip install bs4 on windows and pip3 install bs4 if on mac or linux and i hope requests already exists in python 3 and if you don't have lxml module go ahead and install it with pip install
import requests
from bs4 import BeautifulSoup
res = request.get('website-url-here')
src = res.content
soup = BeautifulSoup(src, 'lxml')
markup = soup.prettify()
print(markup)
and you'll get the entire page of the scraping web may be would would be easy for you
to extract the useful on
by finding the contents that you want
soup.find_all('div', {'class', 'classname'})
this will return into array while this don't
soup.find('div', {'class', 'classname'})
but this will return the first content the choice is yours
You can simply use the text to get a string of the website html code
import requests
response = requests.get('website-url-here')
print(response.text)

Ascii codec can't encode character error, content.decode('utf-8','ignore') won't work. Any other options?

Example code:
import requests
import bs4 as bs
import urllib
import urllib.request
from urllib.request import Request
req = Request("https://www.openbugbounty.org/researchers/Spam404/vip/page/1/",
headers ={'User-Agent':'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
print(soup)
My output gives me the following error:
File "/Users/student/Desktop/AutoBots/kbfza2.py", line 15, in
print(soup)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 5194: ordinal not in range(128)
After searching for a solution online for a while, it seemed that changing my soup line to:
soup = bs.BeautifulSoup(sauce.decode('utf-8','ignore'),'lxml)
would solve this for me but this hasn't fixed anything for me.
Am I mistaken in thinking that the decode function with the ignore argument should allow me to print(soup) without error even if it isn't successfully decoded completely?
Just re-read your question and I believe you are trying to print Unicode text to a console that doesn't support that character set (I couldn't reproduce the error with the code you posted).
You may need to force your console output to utf-8 encoding, or if you are using an IDE like sublime text, change it to render utf-8 encoding.

unicode error in .prettify() python 3

I'm trying to scrape websites using requests and BeautifulSoup4 packages.
>>>import requests
>>>from bs4 import BeautifulSoup
>>>r = requests.get('https://www.yellowpages.com/search?search_terms = coffee&geo_location_terms=Los+Angeles%2C+CA')
>>>r.content #shows source code (mess) bytes type
>>>soup = BeautifulSoup(r.content,'html.parser')
When I try to prettify and display the html code of the page with
print(soup.prettify())
I get the error
UnicodeEncodeError: 'charmap' codec can't decode the character '\u2013'
in position 44379: character maps to <undefined>
I also tried
>>>soupbytes = soup.prettify(encoding = 'utf-8') #this is bytes format
>>>soupstr = soupbytes.decode('utf-8') #this is str format
With the first one I don't get any problem to print (print(soupbytes)), but it doesn't print the text 'pretty', and it is bytes format. If I try to print the second one (print(soupstr)) I get the error again but I get the object in str type.
I have to say also that I don't get any error in the IDE (spyder). It's to say, if I run the next code in spyder:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.yellowpages.com/search?
search_terms=coffee&geo_location_terms=Los+Angeles%2C+CA')
r.content #muestra html de la pagina
soup = BeautifulSoup(r.content,'html.parser')
print(soup.prettify())
I don't have any error and it is printed nicely.
Why is that difference? And how could I avoid the error in the terminal???

Encountering Error while using BeautifulSoup

I am trying to extract the words(verbs) starting with R from this page. But on executing the following code:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.usingenglish.com/reference/phrasal-verbs/r.html"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
print soup.prettify()
The Error thrown was something like this:
UnicodeEncodeError: 'charmap' codec can't encode character u '\xa9' in position 57801: character maps to undefined
Can someone please tell me what the error is and how to fix and proceed?
It would be much easier if you showed us the whole stack trace or, at least, at which line it points.
Anyway, I bet, the problem is with the last line. Change it to:
print(soup.prettify().encode('utf-8'))

Categories