I'm trying to scrape websites using requests and BeautifulSoup4 packages.
>>>import requests
>>>from bs4 import BeautifulSoup
>>>r = requests.get('https://www.yellowpages.com/search?search_terms = coffee&geo_location_terms=Los+Angeles%2C+CA')
>>>r.content #shows source code (mess) bytes type
>>>soup = BeautifulSoup(r.content,'html.parser')
When I try to prettify and display the html code of the page with
print(soup.prettify())
I get the error
UnicodeEncodeError: 'charmap' codec can't decode the character '\u2013'
in position 44379: character maps to <undefined>
I also tried
>>>soupbytes = soup.prettify(encoding = 'utf-8') #this is bytes format
>>>soupstr = soupbytes.decode('utf-8') #this is str format
With the first one I don't get any problem to print (print(soupbytes)), but it doesn't print the text 'pretty', and it is bytes format. If I try to print the second one (print(soupstr)) I get the error again but I get the object in str type.
I have to say also that I don't get any error in the IDE (spyder). It's to say, if I run the next code in spyder:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.yellowpages.com/search?
search_terms=coffee&geo_location_terms=Los+Angeles%2C+CA')
r.content #muestra html de la pagina
soup = BeautifulSoup(r.content,'html.parser')
print(soup.prettify())
I don't have any error and it is printed nicely.
Why is that difference? And how could I avoid the error in the terminal???
Related
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://stats.swehockey.se/Teams/Info/TeamRoster/10333').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup)
This code producing the following error
UnicodeEncodeError: 'ascii' codec can't encode character '\xbb' in position 1509: ordinal not in range(128)
I have tried several work arounds but they all have some drawback. After searching on stackoverflow, I found the solution of changing .stdout, like this:
import bs4 as bs
import urllib.request
import sys
import codecs
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
sys.stderr = codecs.getwriter('utf-8')(sys.stderr)
sauce = urllib.request.urlopen('https://stats.swehockey.se/Teams/Info/TeamRoster/10333').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup)
I no longer get the error, however, the output is no longer directed towards the terminal. I'm not sure why this is happening. Using the .prettify('utf-8') method also gets rid of the error and produces output, however, the resulting object is a string, and not a beautiful soup object, and thus has none of the associated bs methods (e.g. .find_all()). A similar problem arises with the .encode('utf-8') approach.
Also, I've noticed that in the output, there are many \r and \n characters still in the beautiful soup object instead of the pure html content.
I want a beautiful soup object without any of the \r or \n characters that I can print to the terminal.
In your code:
sauce = urllib.request.urlopen('https://stats.swehockey.se/Teams/Info/TeamRoster/10333').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
sauce is of type bytes. When you pass that into bs.BeautifulSoup(), BeautifulSoup attempts to decode those bytes as an ascii string, and fails, because it is actually a utf-8 string -- according to both the Content-Type response header (text/html; charset=utf-8) and also the meta tag at the start of the html document (<meta charset="utf-8" />).
The first argument for bs.BeautifulSoup(), markup, takes a string or a file-like object representing markup to be parsed. You should explicitly decode those bytes as a utf-8 encoded string, and use that instead of the raw bytes, like so:
sauce = urllib.request.urlopen('https://stats.swehockey.se/Teams/Info/TeamRoster/10333').read().decode('utf-8')
soup = bs.BeautifulSoup(sauce, 'lxml')
Also, I've noticed that in the output, there are many \r and \n characters still in the beautiful soup object instead of the pure html content.
I want a beautiful soup object without any of the \r or \n characters that I can print to the terminal.
The \r and \n characters are just representations of the newline character. If you were to print these, or view them in a text editor, they would show up as actual newlines.
Example code:
import requests
import bs4 as bs
import urllib
import urllib.request
from urllib.request import Request
req = Request("https://www.openbugbounty.org/researchers/Spam404/vip/page/1/",
headers ={'User-Agent':'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
print(soup)
My output gives me the following error:
File "/Users/student/Desktop/AutoBots/kbfza2.py", line 15, in
print(soup)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 5194: ordinal not in range(128)
After searching for a solution online for a while, it seemed that changing my soup line to:
soup = bs.BeautifulSoup(sauce.decode('utf-8','ignore'),'lxml)
would solve this for me but this hasn't fixed anything for me.
Am I mistaken in thinking that the decode function with the ignore argument should allow me to print(soup) without error even if it isn't successfully decoded completely?
Just re-read your question and I believe you are trying to print Unicode text to a console that doesn't support that character set (I couldn't reproduce the error with the code you posted).
You may need to force your console output to utf-8 encoding, or if you are using an IDE like sublime text, change it to render utf-8 encoding.
I am using bs4 to extract text from a web document. But Its output is very strange. like
Ú©Ø¨Ú¾Û ÛÛ Ø¨Ø§Øª سÙجھ ÙÛÚº ÙÛ Ø§ÙØªÛ ØªÚ¾ÛÛ Ù¾Ú¾Ø± اÙÛØ³ØªÛ Ø§ÙÛØ³ØªÛ Ø¬Ø¨ Ú©ÚÚ¾ عÙ٠اÙÙÛ Ø´Ø±Ùع ÛÙØ¦Û ØªÙ Ø¨Ø§Øª Ú©ÚÚ¾ Ù¾ÙÛ Ù¾ÚÛÛÙÛک٠اب ÛÛ Ø¨Ø§Øª اÛØ³Û Ø³Ùجھ ÙÛÚº اÙØ¦Û Ú©Û Ø³ÙÚا اÙÙ¾ Ú©ÛÙÙÚ¯ÙÚº Ú©Û Ø¨Ú¾Û Ø´Ûئر کرÙÚºÛ ÚÙد رÙز Ùب٠ÙÛرا 8 Ù
I think it is some encoding. I am a new user of bs4. Please guide me how to decode it to show as urdu text.
Here is a document source whose title I want to extract
Follwoing code I am using to do it.
from bs4 import BeautifulSoup
import urllib2
import requests
url="http://blog.jang.com.pk/blog_details.asp?id=11058"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'lxml')
print str(soup.title)
Burhan Khalid's answer works, but because the original web page is encoded in utf-8:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
You should update the requests' response field to match the original page's encoding:
from bs4 import BeautifulSoup
import urllib2
import requests
url="http://blog.jang.com.pk/blog_details.asp?id=11058"
r = requests.get(url)
# Update encoding to match source
r.encoding = "utf-8"
data = r.text
soup = BeautifulSoup(data,'lxml')
print str(soup.title)
Now any field you access will have the correct encoding rather than having to set to Urdu on a per field basis.
If you simply try to print the string, you'll get garbage characters out:
>>> import requests
>>> from bs4 import BeautifulSoup as bs4
>>> r = requests.get('http://blog.jang.com.pk/blog_details.asp?id=11058')
>>> s = bs4(r.text, 'lxml')
>>> print s.title.text
Ú©ÚÚ¾ تÙØ¬Û Ø§Ø³ طر٠بھÛ!
You need to encode it properly, since the result is a unicode bytestring.
>>> print s.title.text.encode('iso-8859-1')
کچھ توجہ اس طرف بھی!
If it displays the glyphs correctly, but in the wrong order (ie, they are not right-to-left), then this is a problem with your operating system/terminal/shell/program you are using to run the application.
The above is from gnome-terminal, which doesn't support Arabic RTL properly.
If I run the same code in mlterm:
The white box is there because I am using an Arabic font, which doesn't have all the characters in the Urdu language.
I think what is happening is that there some badly formed Unicode in the website response:
----> 1 r.content.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 1106: invalid continuation byte
Hence the text is being decoded using the ANSI codec which is obviously wrong. You can work around this issue by calling decode with the option errors='ignore' (we are using the content rather than text because this is the raw binary response from the website:
data = r.content.decode(errors='ignore')
soup = BeautifulSoup(data,'lxml')
print (str(soup.title))
<title>کچھ توجہ اس طرف بھی!</title>
I am trying to extract the words(verbs) starting with R from this page. But on executing the following code:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.usingenglish.com/reference/phrasal-verbs/r.html"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
print soup.prettify()
The Error thrown was something like this:
UnicodeEncodeError: 'charmap' codec can't encode character u '\xa9' in position 57801: character maps to undefined
Can someone please tell me what the error is and how to fix and proceed?
It would be much easier if you showed us the whole stack trace or, at least, at which line it points.
Anyway, I bet, the problem is with the last line. Change it to:
print(soup.prettify().encode('utf-8'))
I am just trying to retrieve a web page, but somehow a foreign character is embedded in the HTML file. This character is not visible when I use "View Source."
isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_opener.read()
html = BeautifulSoup(page)
html #This line causes error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 21555: ordinal not in range(128)
I also tried...
html = BeautifulSoup(page.encode('utf-8'))
How can I read this web page into BeautifulSoup without getting this error?
This error is probably actually happening when you try to print the representation of the BeautifulSoup file, which will happen automatically if, as I suspect, you are working in the interactive console.
# This code will work fine, note we are assigning the result
# of the BeautifulSoup object to prevent it from printing immediately.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(u'\xa0')
# This will probably show the error you saw
print soup
# And this would probably be fine
print soup.encode('utf-8')
You could try the following :
import codecs
f = codecs.open(filename,'r','utf-8')
soup = BeautifulSoup(f.read(),"html.parser")
I encountered a similar problem with bs4