error: UnicodeEncodeError: 'gbk' codec can't encode charactor - python

I'm a python beginner. I wrote code as following:
from bs4 import BeautifulSoup
import requests
url = "http://www.google.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.find_all("a")
for link in links:
print(link.text)
When run this .py file in windows powershell, the print(link.text) causes the following error.
error: UnicodeEncodeError: 'gbk' codec can't encode charactor '\xbb' in position 5:
illegal multibyte sequence.
I know the error is caused by some chinese characters, and It seem like I should use 'decode' or 'ignore', but I don't know how to fix my code. Help please! Thanks!

If you don't wish to display those special chars:
You can ignore them by:
print(link.text.encode(errors="ignore"))

You can encode the string in utf8.
for link in links:
print(link.text.encode('utf8'))
But better approach is:
response = requests.get(url)
soup = BeautifulSoup(response.text.encode("utf8"), "html.parser")
To understand more about the problem you are facing, you should look at this stackoverflow answer.

Related

web scraping accented and other special characters in beautiful soup

I made a script to parse spanish dict for the verb conjugations but always get the error
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 3: ordinal not in range(128)
when I try to print the characters with accents. When looking at the other similar questions, I realized that I need to encode and decode the accented characters, but don't know how to do this with my code.
Here is my code:
from requests import get
from bs4 import BeautifulSoup
url = "http://www.spanishdict.com/translate/hacer"
response = get(url)
#print(response.text[:500])
soup = BeautifulSoup(response.text,'html.parser')
container = soup.find_all("a", class_ = "vtable-word-text")
for n in range(0,30,1):
try:
print(container[n].text)
except Exception as e:
print("accented character")
I have also added a try and except to tell me which ones are accented characters. Could anyone help me with my problem? Thanks in advance.

Encoding error when printing XPATH result

I am trying to fetch recurring information (product names) from an e-commerce website. To do this, I use XPath. I have followed this tutorial to do so.
from lxml import html
import requests
page = requests.get("https://search.rakuten.co.jp/search/mall/-/565210/tg1000768/")
tree = html.fromstring(page.content)
urls = tree.xpath('//div[#class="image"]/a/img/#src')
titles = tree.xpath('//div[#class="content title"]/h2/a/text()')
print(len(titles))
print(titles)
The print(len(titles)) displays a correct number. However, the print(titles) raises an error
print(titles)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-15: ordinal not in range(128)
What am I supposed to do ?

Ascii codec can't encode character error, content.decode('utf-8','ignore') won't work. Any other options?

Example code:
import requests
import bs4 as bs
import urllib
import urllib.request
from urllib.request import Request
req = Request("https://www.openbugbounty.org/researchers/Spam404/vip/page/1/",
headers ={'User-Agent':'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
print(soup)
My output gives me the following error:
File "/Users/student/Desktop/AutoBots/kbfza2.py", line 15, in
print(soup)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 5194: ordinal not in range(128)
After searching for a solution online for a while, it seemed that changing my soup line to:
soup = bs.BeautifulSoup(sauce.decode('utf-8','ignore'),'lxml)
would solve this for me but this hasn't fixed anything for me.
Am I mistaken in thinking that the decode function with the ignore argument should allow me to print(soup) without error even if it isn't successfully decoded completely?
Just re-read your question and I believe you are trying to print Unicode text to a console that doesn't support that character set (I couldn't reproduce the error with the code you posted).
You may need to force your console output to utf-8 encoding, or if you are using an IDE like sublime text, change it to render utf-8 encoding.

Python: UnicodeDecodeError: 'utf-8' codec can't decode byte...invalid continuation byte

I'm building a web scraper using BeautifulSoup on Python 3.3
However I get a problem which prevents me from getting a valid strin* that I can use with BeautifulSoup. That is:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 7047: invalid continuation byte
I know there are dozens of similar questions but I haven't so far found a method that can help me to diagnose what's wrong with the following code:
import urllib.request
URL = "<url>" # sorry, I cannot show the url for privacy reasons, but it's a normal html document
page = urllib.request.urlopen(URL)
page = page.read().decode("utf-8") # from bytes to <source encodings>
As I guessed I notice this error occurs only with some URLS and not with others. Even with this same error I wasn't having this error until yesterday. Then today I run the program again and the error popped up..
Any clue on how to diagnose the error?
You should not decode the response. First of all, you are incorrectly assuming the response is UTF-8 encoded (it is not, as the error shows), but more importantly, BeautifulSoup will detect the encoding for you. See the Encodings section of the BeautifulSoup documentation.
Pass a byte string to BeautifulSoup and it'll use any <meta> header proclaiming the correct encoding, or do great job of autodetecting the encoding for you.
In the event that auto-detection fails, you can always fall back to the server-provided encoding:
encoding = page.info().get_charset()
page = page.read()
soup = BeautifulSoup(page)
if encoding is not None and soup.original_encoding != encoding:
print('Server and BeautifulSoup disagree')
print('Content-type states it is {}, BS4 states thinks it is {}'.format(encoding, soup.original_encoding)
print('Forcing encoding to server-supplied codec')
soup = BeautifulSoup(page, from_encoding=encoding)
This still leaves the actual decoding to BeautifulSoup, but if the server included a charset parameter in the Content-Type header then the above assumes that the server is correctly configured and forces BeautifulSoup to use that encoding.

Encountering Error while using BeautifulSoup

I am trying to extract the words(verbs) starting with R from this page. But on executing the following code:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.usingenglish.com/reference/phrasal-verbs/r.html"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
print soup.prettify()
The Error thrown was something like this:
UnicodeEncodeError: 'charmap' codec can't encode character u '\xa9' in position 57801: character maps to undefined
Can someone please tell me what the error is and how to fix and proceed?
It would be much easier if you showed us the whole stack trace or, at least, at which line it points.
Anyway, I bet, the problem is with the last line. Change it to:
print(soup.prettify().encode('utf-8'))

Categories