using request and beautiful soup module in python - python

My code is very short code but it is giving unexpected error
import bs4
url = "https://www."+input("Enter The Name of the website: ")+".com"
req = re.get(url)
html_text = req.text
htmls = bs4.BeautifulSoup(html_text, "html.parser").prettify()
with open("facebook.html", "w+") as file:
file.write(htmls)
Traceback (most recent call last):
File "C:\Users\Tanay Mishra\PycharmProjects\First_Project\Web Scraping\web.py", line 9, in <module>
file.write(htmls)
File "C:\Users\Tanay Mishra\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to <undefined>```

\u200b means ZERO WIDTH SPACE in Unicode. Try to specify encoding="utf-8" in open() function. Also good practice is to use .content property of Response object and let BeautifulSoup to guess the encoding:
import bs4
import requests
url = "https://www."+input("Enter The Name of the website: ")+".com"
req = requests.get(url)
htmls = bs4.BeautifulSoup(req.content, "html.parser").prettify()
with open("facebook.html", "w+", encoding="utf-8") as file:
file.write(htmls)

Related

Is there any way to decode properly this webpage? I got an error shown on the question

Attempted to debug with no success.
Scrape data from website
site_url='https://finance.yahoo.com/quote/DUK?p=DUK'
r = urllib.request.urlopen(site_url)
site_content = r.read().decode('utf-8')
Saving scraped HTML to .html file (for later processing)
with open('saved_page.html', 'w') as f:
f.write(site_content)
Use html.parser to create soup
s = BeautifulSoup(site_content, 'html.parser')
The outcome of this code is listed below
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-42-486ff45635ec> in <module>
3 site_url='https://finance.yahoo.com/quote/DUK?p=DUK'
4 r = urllib.request.urlopen(site_url)
----> 5 site_content = r.read().decode('utf-8')
6
7 # Saving scraped HTML to .html file (for later processing)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
UTF-8 can't decode 0x8b as a valid start byte because this makes part from the header b'\x1f\x8b\x08, which points out this is Gzip data. So in order to fetch this HTML page you will have to:
import gzip # to decompress
from io import BytesIO # to do the operations
Bringing that up to your code:
import gzip
from io import BytesIO
import urllib.request
from bs4 import BeautifulSoup
site_url='https://finance.yahoo.com/quote/DUK?p=DUK'
r = urllib.request.urlopen(site_url)
site_content = r.read() # took off decode('utf-8')
buffer = BytesIO(site_content)
file = gzip.GzipFile(fileobj=buffer)
site_content = file.read().decode('utf-8')
s = BeautifulSoup(site_content, 'html.parser')
But I highly recommend that you have a look on requests library, is not built-in as urllib, but it's much simpler.

Ascii codec can't encode character error, content.decode('utf-8','ignore') won't work. Any other options?

Example code:
import requests
import bs4 as bs
import urllib
import urllib.request
from urllib.request import Request
req = Request("https://www.openbugbounty.org/researchers/Spam404/vip/page/1/",
headers ={'User-Agent':'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
print(soup)
My output gives me the following error:
File "/Users/student/Desktop/AutoBots/kbfza2.py", line 15, in
print(soup)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 5194: ordinal not in range(128)
After searching for a solution online for a while, it seemed that changing my soup line to:
soup = bs.BeautifulSoup(sauce.decode('utf-8','ignore'),'lxml)
would solve this for me but this hasn't fixed anything for me.
Am I mistaken in thinking that the decode function with the ignore argument should allow me to print(soup) without error even if it isn't successfully decoded completely?
Just re-read your question and I believe you are trying to print Unicode text to a console that doesn't support that character set (I couldn't reproduce the error with the code you posted).
You may need to force your console output to utf-8 encoding, or if you are using an IDE like sublime text, change it to render utf-8 encoding.

Beautiful Soup 4 not printing text from a webpage

I'm using python 3.4 with Beautiful Soup 4 and requests.
I am trying to grab the webpage, and print the text from it using beautiful soup. It can grab the webpage and print the title, It can even prettify if I provide it with the encoding, which is utf-8, but when I try to print the text from the page, it goofs over an encoding error.
from bs4 import BeautifulSoup
import requests
sparknotesSearch = requests.get("http://www.sparknotes.com/search?q=Sonnet")
soup = BeautifulSoup(sparknotesSearch.text)
print (soup.title)
#Can't print this?
print(soup.get_text())
The error/output I get is this:
<title>SparkNotes Search Results: sONNET</title>
Traceback (most recent call last):
File "C:\Users\Cayle J. Elsey\Dropbox\Programming\Salient_Point\networking.py", line 10, in <module>
print(soup.get_text())
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 6238: character maps to <undefined>
[Finished in 0.5s]
just encode your string into UTF-8 . and Your problem will be solved
html= soup.prettify()
html=html.encode('UTF-8')

UnicodeDecodeError Python Error

I'm trying to code a python google api. Getting some unicode issues. My really basic PoC so far is:
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
query = "filetype%3Apdf"
url = "http://www.google.com/search?sclient=psy-ab&hl=en&site=&source=hp&q="+query+"&btnG=Search"
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
data = response.read()
data = data.decode('UTF-8', 'ignore')
data = data.encode('UTF-8', 'ignore')
soup = BeautifulSoup(data)
print u""+soup.prettify('UTF-8')
My traceback is:
Traceback (most recent call last):
File "./google.py", line 22, in <module>
print u""+soup.prettify('UTF-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 48786: ordinal not in range(128)
Any ideas?
You are converting your soup tree to UTF-8 (an encoded byte string), then try to concatenate this to an empty u'' unicode string.
Python will automatically try and decode your encoded byte string, using the default encoding, which is ASCII, and fails to decode the UTF-8 data.
You need to explicitly decode the prettify() output:
print u"" + soup.prettify('UTF-8').decode('UTF-8')
The Python Unicode HOWTO explains this better, including about default encodings. I really, really recommend you read Joel Spolsky's article on Unicode as well.

urlopen, BeautifulSoup and UTF-8 Issue

I am just trying to retrieve a web page, but somehow a foreign character is embedded in the HTML file. This character is not visible when I use "View Source."
isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_opener.read()
html = BeautifulSoup(page)
html #This line causes error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 21555: ordinal not in range(128)
I also tried...
html = BeautifulSoup(page.encode('utf-8'))
How can I read this web page into BeautifulSoup without getting this error?
This error is probably actually happening when you try to print the representation of the BeautifulSoup file, which will happen automatically if, as I suspect, you are working in the interactive console.
# This code will work fine, note we are assigning the result
# of the BeautifulSoup object to prevent it from printing immediately.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(u'\xa0')
# This will probably show the error you saw
print soup
# And this would probably be fine
print soup.encode('utf-8')
You could try the following :
import codecs
f = codecs.open(filename,'r','utf-8')
soup = BeautifulSoup(f.read(),"html.parser")
I encountered a similar problem with bs4

Categories