I am trying to fetch recurring information (product names) from an e-commerce website. To do this, I use XPath. I have followed this tutorial to do so.
from lxml import html
import requests
page = requests.get("https://search.rakuten.co.jp/search/mall/-/565210/tg1000768/")
tree = html.fromstring(page.content)
urls = tree.xpath('//div[#class="image"]/a/img/#src')
titles = tree.xpath('//div[#class="content title"]/h2/a/text()')
print(len(titles))
print(titles)
The print(len(titles)) displays a correct number. However, the print(titles) raises an error
print(titles)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-15: ordinal not in range(128)
What am I supposed to do ?
Related
I keep getting this error when I try to print or use driver.source_page.
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1494-1498: character maps to <undefined>
Not only with selenium, also with request and urllib3. I tried several solutions but none of them worked.
such as str(),.encode("utf-8-sig" OR "utf-8"),
BeautifulSoup(source,from_encoding="utf-8").
my code:
import base64;from bs4 import BeautifulSoup;from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.example.com/")
source = driver.page_source
driver.close()
with open("test.html","wb") as W:
W.write(source)
soup = BeautifulSoup(source.encode("utf-8"),"html.parser")#.encode("utf-8")
print(soup.find_all("img"))
Any Idea about making it work?
finally I solved it!
the problem here is because of Arabic language.
Simply ignore the Arabic text from the content by using:
.encode("utf-8","ignore")
but first you have to make soup as string so:
str(soup.find("span","captcha-container")).encode("utf-8","ignore")
I'm currently trying to extract the href (Emails) from HTML files provided by a client of my company. They sent me 6 months worth of data but I'm unable to extract the emails from 2 particular files. I keep getting the same UnicodeDecodeError everytime no matter what I try. According to my analysis, these files are encoded in "utf-8" format. I'll leave the code down below:
from bs4 import BeautifulSoup as bsoup
url = r"C:\Users\Maximiliano\Documents\enero.html"
soup = bsoup((open(url).read()))
data = []
for p in soup.find_all("a"):
datos = p.get("href")
if datos[0] != "m":
pass
else:
data.append(datos)
print(data)
I've already tried adding a ".decode("utf-8") after the read but it is not doing anything.
Please help me!
file: https://gofile.io/?c=SFM1T3
As suggested in the comments, you simply have to add the encoding parameter:
soup = bsoup((open(url, encoding="utf-8").read()))
I made a script to parse spanish dict for the verb conjugations but always get the error
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 3: ordinal not in range(128)
when I try to print the characters with accents. When looking at the other similar questions, I realized that I need to encode and decode the accented characters, but don't know how to do this with my code.
Here is my code:
from requests import get
from bs4 import BeautifulSoup
url = "http://www.spanishdict.com/translate/hacer"
response = get(url)
#print(response.text[:500])
soup = BeautifulSoup(response.text,'html.parser')
container = soup.find_all("a", class_ = "vtable-word-text")
for n in range(0,30,1):
try:
print(container[n].text)
except Exception as e:
print("accented character")
I have also added a try and except to tell me which ones are accented characters. Could anyone help me with my problem? Thanks in advance.
I am trying to scrape the paragraphs from a wikipedia page.
I am getting this error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013'
in position 530: character maps to <undefined>
For example, I used this wikipedia page and wrote the following script in Python with BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
soup=BeautifulSoup(r.content,"html.parser")
for i in soup.find_all("p"):
print i.text
print "\n"
I'm a python beginner. I wrote code as following:
from bs4 import BeautifulSoup
import requests
url = "http://www.google.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.find_all("a")
for link in links:
print(link.text)
When run this .py file in windows powershell, the print(link.text) causes the following error.
error: UnicodeEncodeError: 'gbk' codec can't encode charactor '\xbb' in position 5:
illegal multibyte sequence.
I know the error is caused by some chinese characters, and It seem like I should use 'decode' or 'ignore', but I don't know how to fix my code. Help please! Thanks!
If you don't wish to display those special chars:
You can ignore them by:
print(link.text.encode(errors="ignore"))
You can encode the string in utf8.
for link in links:
print(link.text.encode('utf8'))
But better approach is:
response = requests.get(url)
soup = BeautifulSoup(response.text.encode("utf8"), "html.parser")
To understand more about the problem you are facing, you should look at this stackoverflow answer.