Python Selenuim - UnicodeEncodeError 'charmap' codec can't encode - python

I keep getting this error when I try to print or use driver.source_page.
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1494-1498: character maps to <undefined>
Not only with selenium, also with request and urllib3. I tried several solutions but none of them worked.
such as str(),.encode("utf-8-sig" OR "utf-8"),
BeautifulSoup(source,from_encoding="utf-8").
my code:
import base64;from bs4 import BeautifulSoup;from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.example.com/")
source = driver.page_source
driver.close()
with open("test.html","wb") as W:
W.write(source)
soup = BeautifulSoup(source.encode("utf-8"),"html.parser")#.encode("utf-8")
print(soup.find_all("img"))
Any Idea about making it work?

finally I solved it!
the problem here is because of Arabic language.
Simply ignore the Arabic text from the content by using:
.encode("utf-8","ignore")
but first you have to make soup as string so:
str(soup.find("span","captcha-container")).encode("utf-8","ignore")

Related

web scraping accented and other special characters in beautiful soup

I made a script to parse spanish dict for the verb conjugations but always get the error
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 3: ordinal not in range(128)
when I try to print the characters with accents. When looking at the other similar questions, I realized that I need to encode and decode the accented characters, but don't know how to do this with my code.
Here is my code:
from requests import get
from bs4 import BeautifulSoup
url = "http://www.spanishdict.com/translate/hacer"
response = get(url)
#print(response.text[:500])
soup = BeautifulSoup(response.text,'html.parser')
container = soup.find_all("a", class_ = "vtable-word-text")
for n in range(0,30,1):
try:
print(container[n].text)
except Exception as e:
print("accented character")
I have also added a try and except to tell me which ones are accented characters. Could anyone help me with my problem? Thanks in advance.

Encoding error when printing XPATH result

I am trying to fetch recurring information (product names) from an e-commerce website. To do this, I use XPath. I have followed this tutorial to do so.
from lxml import html
import requests
page = requests.get("https://search.rakuten.co.jp/search/mall/-/565210/tg1000768/")
tree = html.fromstring(page.content)
urls = tree.xpath('//div[#class="image"]/a/img/#src')
titles = tree.xpath('//div[#class="content title"]/h2/a/text()')
print(len(titles))
print(titles)
The print(len(titles)) displays a correct number. However, the print(titles) raises an error
print(titles)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-15: ordinal not in range(128)
What am I supposed to do ?

Ascii codec can't encode character error, content.decode('utf-8','ignore') won't work. Any other options?

Example code:
import requests
import bs4 as bs
import urllib
import urllib.request
from urllib.request import Request
req = Request("https://www.openbugbounty.org/researchers/Spam404/vip/page/1/",
headers ={'User-Agent':'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
print(soup)
My output gives me the following error:
File "/Users/student/Desktop/AutoBots/kbfza2.py", line 15, in
print(soup)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 5194: ordinal not in range(128)
After searching for a solution online for a while, it seemed that changing my soup line to:
soup = bs.BeautifulSoup(sauce.decode('utf-8','ignore'),'lxml)
would solve this for me but this hasn't fixed anything for me.
Am I mistaken in thinking that the decode function with the ignore argument should allow me to print(soup) without error even if it isn't successfully decoded completely?
Just re-read your question and I believe you are trying to print Unicode text to a console that doesn't support that character set (I couldn't reproduce the error with the code you posted).
You may need to force your console output to utf-8 encoding, or if you are using an IDE like sublime text, change it to render utf-8 encoding.

How to encode '-' in python using Beautiful Soup

I am trying to scrape the paragraphs from a wikipedia page.
I am getting this error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013'
in position 530: character maps to <undefined>
For example, I used this wikipedia page and wrote the following script in Python with BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
soup=BeautifulSoup(r.content,"html.parser")
for i in soup.find_all("p"):
print i.text
print "\n"

error: UnicodeEncodeError: 'gbk' codec can't encode charactor

I'm a python beginner. I wrote code as following:
from bs4 import BeautifulSoup
import requests
url = "http://www.google.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.find_all("a")
for link in links:
print(link.text)
When run this .py file in windows powershell, the print(link.text) causes the following error.
error: UnicodeEncodeError: 'gbk' codec can't encode charactor '\xbb' in position 5:
illegal multibyte sequence.
I know the error is caused by some chinese characters, and It seem like I should use 'decode' or 'ignore', but I don't know how to fix my code. Help please! Thanks!
If you don't wish to display those special chars:
You can ignore them by:
print(link.text.encode(errors="ignore"))
You can encode the string in utf8.
for link in links:
print(link.text.encode('utf8'))
But better approach is:
response = requests.get(url)
soup = BeautifulSoup(response.text.encode("utf8"), "html.parser")
To understand more about the problem you are facing, you should look at this stackoverflow answer.

Categories