web scraping accented and other special characters in beautiful soup - python

I made a script to parse spanish dict for the verb conjugations but always get the error
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 3: ordinal not in range(128)
when I try to print the characters with accents. When looking at the other similar questions, I realized that I need to encode and decode the accented characters, but don't know how to do this with my code.
Here is my code:
from requests import get
from bs4 import BeautifulSoup
url = "http://www.spanishdict.com/translate/hacer"
response = get(url)
#print(response.text[:500])
soup = BeautifulSoup(response.text,'html.parser')
container = soup.find_all("a", class_ = "vtable-word-text")
for n in range(0,30,1):
try:
print(container[n].text)
except Exception as e:
print("accented character")
I have also added a try and except to tell me which ones are accented characters. Could anyone help me with my problem? Thanks in advance.

Related

Python Selenuim - UnicodeEncodeError 'charmap' codec can't encode

I keep getting this error when I try to print or use driver.source_page.
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1494-1498: character maps to <undefined>
Not only with selenium, also with request and urllib3. I tried several solutions but none of them worked.
such as str(),.encode("utf-8-sig" OR "utf-8"),
BeautifulSoup(source,from_encoding="utf-8").
my code:
import base64;from bs4 import BeautifulSoup;from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.example.com/")
source = driver.page_source
driver.close()
with open("test.html","wb") as W:
W.write(source)
soup = BeautifulSoup(source.encode("utf-8"),"html.parser")#.encode("utf-8")
print(soup.find_all("img"))
Any Idea about making it work?
finally I solved it!
the problem here is because of Arabic language.
Simply ignore the Arabic text from the content by using:
.encode("utf-8","ignore")
but first you have to make soup as string so:
str(soup.find("span","captcha-container")).encode("utf-8","ignore")

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 261060: character maps to <undefined>

I'm currently trying to extract the href (Emails) from HTML files provided by a client of my company. They sent me 6 months worth of data but I'm unable to extract the emails from 2 particular files. I keep getting the same UnicodeDecodeError everytime no matter what I try. According to my analysis, these files are encoded in "utf-8" format. I'll leave the code down below:
from bs4 import BeautifulSoup as bsoup
url = r"C:\Users\Maximiliano\Documents\enero.html"
soup = bsoup((open(url).read()))
data = []
for p in soup.find_all("a"):
datos = p.get("href")
if datos[0] != "m":
pass
else:
data.append(datos)
print(data)
I've already tried adding a ".decode("utf-8") after the read but it is not doing anything.
Please help me!
file: https://gofile.io/?c=SFM1T3
As suggested in the comments, you simply have to add the encoding parameter:
soup = bsoup((open(url, encoding="utf-8").read()))

Encoding error when printing XPATH result

I am trying to fetch recurring information (product names) from an e-commerce website. To do this, I use XPath. I have followed this tutorial to do so.
from lxml import html
import requests
page = requests.get("https://search.rakuten.co.jp/search/mall/-/565210/tg1000768/")
tree = html.fromstring(page.content)
urls = tree.xpath('//div[#class="image"]/a/img/#src')
titles = tree.xpath('//div[#class="content title"]/h2/a/text()')
print(len(titles))
print(titles)
The print(len(titles)) displays a correct number. However, the print(titles) raises an error
print(titles)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-15: ordinal not in range(128)
What am I supposed to do ?

error: UnicodeEncodeError: 'gbk' codec can't encode charactor

I'm a python beginner. I wrote code as following:
from bs4 import BeautifulSoup
import requests
url = "http://www.google.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.find_all("a")
for link in links:
print(link.text)
When run this .py file in windows powershell, the print(link.text) causes the following error.
error: UnicodeEncodeError: 'gbk' codec can't encode charactor '\xbb' in position 5:
illegal multibyte sequence.
I know the error is caused by some chinese characters, and It seem like I should use 'decode' or 'ignore', but I don't know how to fix my code. Help please! Thanks!
If you don't wish to display those special chars:
You can ignore them by:
print(link.text.encode(errors="ignore"))
You can encode the string in utf8.
for link in links:
print(link.text.encode('utf8'))
But better approach is:
response = requests.get(url)
soup = BeautifulSoup(response.text.encode("utf8"), "html.parser")
To understand more about the problem you are facing, you should look at this stackoverflow answer.

Use soup.get_text() with UTF-8

I need to get all the text from a page using BeautifulSoup. At BeautifulSoup's documentation, it showed that you could do soup.get_text() to do this. When I tried doing this on reddit.com, I got this error:
UnicodeEncodeError in soup.py:16
'cp932' codec can't encode character u'\xa0' in position 2262: illegal multibyte sequence
I get errors like that on most of the sites I checked.
I got similar errors when I did soup.prettify() too, but I fixed it by changing it to soup.prettify('UTF-8'). Is there any way to fix this? Thanks in advance!
Update June 24
I've found a bit of code that seems to work for other people, but I still need to use UTF-8 instead of the default. Code:
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('', str(element)): return False
elif re.match('\n', str(element)): return False
return True
visible_texts = filter(visible, texts)
print visible_texts
Error is different, though. Progress?
UnicodeEncodeError in soup.py:29
'ascii' codec can't encode character u'\xbb' in position 1: ordinal not in range
(128)
soup.get_text() returns a Unicode string that's why you're getting the error.
You can solve this in a number of ways including setting the encoding at the shell level.
export PYTHONIOENCODING=UTF-8
You can reload sys and set the encoding by including this in your script.
if __name__ == "__main__":
reload(sys)
sys.setdefaultencoding("utf-8")
Or you can encode the string as utf-8 in code. For your reddit problem something like the following would work:
import urllib
from bs4 import BeautifulSoup
url = "https://www.reddit.com/r/python"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# get text
text = soup.get_text()
print(text.encode('utf-8'))
You can't do str(text) if you may be dealing with unicode on the page. Instead of str(), use unicode().

Categories