I'm using python 3.4 with Beautiful Soup 4 and requests.
I am trying to grab the webpage, and print the text from it using beautiful soup. It can grab the webpage and print the title, It can even prettify if I provide it with the encoding, which is utf-8, but when I try to print the text from the page, it goofs over an encoding error.
from bs4 import BeautifulSoup
import requests
sparknotesSearch = requests.get("http://www.sparknotes.com/search?q=Sonnet")
soup = BeautifulSoup(sparknotesSearch.text)
print (soup.title)
#Can't print this?
print(soup.get_text())
The error/output I get is this:
<title>SparkNotes Search Results: sONNET</title>
Traceback (most recent call last):
File "C:\Users\Cayle J. Elsey\Dropbox\Programming\Salient_Point\networking.py", line 10, in <module>
print(soup.get_text())
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 6238: character maps to <undefined>
[Finished in 0.5s]
just encode your string into UTF-8 . and Your problem will be solved
html= soup.prettify()
html=html.encode('UTF-8')
Related
My code is very short code but it is giving unexpected error
import bs4
url = "https://www."+input("Enter The Name of the website: ")+".com"
req = re.get(url)
html_text = req.text
htmls = bs4.BeautifulSoup(html_text, "html.parser").prettify()
with open("facebook.html", "w+") as file:
file.write(htmls)
Traceback (most recent call last):
File "C:\Users\Tanay Mishra\PycharmProjects\First_Project\Web Scraping\web.py", line 9, in <module>
file.write(htmls)
File "C:\Users\Tanay Mishra\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to <undefined>```
\u200b means ZERO WIDTH SPACE in Unicode. Try to specify encoding="utf-8" in open() function. Also good practice is to use .content property of Response object and let BeautifulSoup to guess the encoding:
import bs4
import requests
url = "https://www."+input("Enter The Name of the website: ")+".com"
req = requests.get(url)
htmls = bs4.BeautifulSoup(req.content, "html.parser").prettify()
with open("facebook.html", "w+", encoding="utf-8") as file:
file.write(htmls)
Example code:
import requests
import bs4 as bs
import urllib
import urllib.request
from urllib.request import Request
req = Request("https://www.openbugbounty.org/researchers/Spam404/vip/page/1/",
headers ={'User-Agent':'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
print(soup)
My output gives me the following error:
File "/Users/student/Desktop/AutoBots/kbfza2.py", line 15, in
print(soup)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 5194: ordinal not in range(128)
After searching for a solution online for a while, it seemed that changing my soup line to:
soup = bs.BeautifulSoup(sauce.decode('utf-8','ignore'),'lxml)
would solve this for me but this hasn't fixed anything for me.
Am I mistaken in thinking that the decode function with the ignore argument should allow me to print(soup) without error even if it isn't successfully decoded completely?
Just re-read your question and I believe you are trying to print Unicode text to a console that doesn't support that character set (I couldn't reproduce the error with the code you posted).
You may need to force your console output to utf-8 encoding, or if you are using an IDE like sublime text, change it to render utf-8 encoding.
I am trying to scrape the paragraphs from a wikipedia page.
I am getting this error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013'
in position 530: character maps to <undefined>
For example, I used this wikipedia page and wrote the following script in Python with BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
soup=BeautifulSoup(r.content,"html.parser")
for i in soup.find_all("p"):
print i.text
print "\n"
I'm a python beginner. I wrote code as following:
from bs4 import BeautifulSoup
import requests
url = "http://www.google.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.find_all("a")
for link in links:
print(link.text)
When run this .py file in windows powershell, the print(link.text) causes the following error.
error: UnicodeEncodeError: 'gbk' codec can't encode charactor '\xbb' in position 5:
illegal multibyte sequence.
I know the error is caused by some chinese characters, and It seem like I should use 'decode' or 'ignore', but I don't know how to fix my code. Help please! Thanks!
If you don't wish to display those special chars:
You can ignore them by:
print(link.text.encode(errors="ignore"))
You can encode the string in utf8.
for link in links:
print(link.text.encode('utf8'))
But better approach is:
response = requests.get(url)
soup = BeautifulSoup(response.text.encode("utf8"), "html.parser")
To understand more about the problem you are facing, you should look at this stackoverflow answer.
I am just trying to retrieve a web page, but somehow a foreign character is embedded in the HTML file. This character is not visible when I use "View Source."
isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_opener.read()
html = BeautifulSoup(page)
html #This line causes error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 21555: ordinal not in range(128)
I also tried...
html = BeautifulSoup(page.encode('utf-8'))
How can I read this web page into BeautifulSoup without getting this error?
This error is probably actually happening when you try to print the representation of the BeautifulSoup file, which will happen automatically if, as I suspect, you are working in the interactive console.
# This code will work fine, note we are assigning the result
# of the BeautifulSoup object to prevent it from printing immediately.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(u'\xa0')
# This will probably show the error you saw
print soup
# And this would probably be fine
print soup.encode('utf-8')
You could try the following :
import codecs
f = codecs.open(filename,'r','utf-8')
soup = BeautifulSoup(f.read(),"html.parser")
I encountered a similar problem with bs4