How to encode '-' in python using Beautiful Soup - python

I am trying to scrape the paragraphs from a wikipedia page.
I am getting this error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013'
in position 530: character maps to <undefined>
For example, I used this wikipedia page and wrote the following script in Python with BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
soup=BeautifulSoup(r.content,"html.parser")
for i in soup.find_all("p"):
print i.text
print "\n"

Related

Python Selenuim - UnicodeEncodeError 'charmap' codec can't encode

I keep getting this error when I try to print or use driver.source_page.
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1494-1498: character maps to <undefined>
Not only with selenium, also with request and urllib3. I tried several solutions but none of them worked.
such as str(),.encode("utf-8-sig" OR "utf-8"),
BeautifulSoup(source,from_encoding="utf-8").
my code:
import base64;from bs4 import BeautifulSoup;from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.example.com/")
source = driver.page_source
driver.close()
with open("test.html","wb") as W:
W.write(source)
soup = BeautifulSoup(source.encode("utf-8"),"html.parser")#.encode("utf-8")
print(soup.find_all("img"))
Any Idea about making it work?
finally I solved it!
the problem here is because of Arabic language.
Simply ignore the Arabic text from the content by using:
.encode("utf-8","ignore")
but first you have to make soup as string so:
str(soup.find("span","captcha-container")).encode("utf-8","ignore")

web scraping accented and other special characters in beautiful soup

I made a script to parse spanish dict for the verb conjugations but always get the error
UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 3: ordinal not in range(128)
when I try to print the characters with accents. When looking at the other similar questions, I realized that I need to encode and decode the accented characters, but don't know how to do this with my code.
Here is my code:
from requests import get
from bs4 import BeautifulSoup
url = "http://www.spanishdict.com/translate/hacer"
response = get(url)
#print(response.text[:500])
soup = BeautifulSoup(response.text,'html.parser')
container = soup.find_all("a", class_ = "vtable-word-text")
for n in range(0,30,1):
try:
print(container[n].text)
except Exception as e:
print("accented character")
I have also added a try and except to tell me which ones are accented characters. Could anyone help me with my problem? Thanks in advance.

Encoding error when printing XPATH result

I am trying to fetch recurring information (product names) from an e-commerce website. To do this, I use XPath. I have followed this tutorial to do so.
from lxml import html
import requests
page = requests.get("https://search.rakuten.co.jp/search/mall/-/565210/tg1000768/")
tree = html.fromstring(page.content)
urls = tree.xpath('//div[#class="image"]/a/img/#src')
titles = tree.xpath('//div[#class="content title"]/h2/a/text()')
print(len(titles))
print(titles)
The print(len(titles)) displays a correct number. However, the print(titles) raises an error
print(titles)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-15: ordinal not in range(128)
What am I supposed to do ?

Ascii codec can't encode character error, content.decode('utf-8','ignore') won't work. Any other options?

Example code:
import requests
import bs4 as bs
import urllib
import urllib.request
from urllib.request import Request
req = Request("https://www.openbugbounty.org/researchers/Spam404/vip/page/1/",
headers ={'User-Agent':'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
print(soup)
My output gives me the following error:
File "/Users/student/Desktop/AutoBots/kbfza2.py", line 15, in
print(soup)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 5194: ordinal not in range(128)
After searching for a solution online for a while, it seemed that changing my soup line to:
soup = bs.BeautifulSoup(sauce.decode('utf-8','ignore'),'lxml)
would solve this for me but this hasn't fixed anything for me.
Am I mistaken in thinking that the decode function with the ignore argument should allow me to print(soup) without error even if it isn't successfully decoded completely?
Just re-read your question and I believe you are trying to print Unicode text to a console that doesn't support that character set (I couldn't reproduce the error with the code you posted).
You may need to force your console output to utf-8 encoding, or if you are using an IDE like sublime text, change it to render utf-8 encoding.

Beautiful Soup 4 not printing text from a webpage

I'm using python 3.4 with Beautiful Soup 4 and requests.
I am trying to grab the webpage, and print the text from it using beautiful soup. It can grab the webpage and print the title, It can even prettify if I provide it with the encoding, which is utf-8, but when I try to print the text from the page, it goofs over an encoding error.
from bs4 import BeautifulSoup
import requests
sparknotesSearch = requests.get("http://www.sparknotes.com/search?q=Sonnet")
soup = BeautifulSoup(sparknotesSearch.text)
print (soup.title)
#Can't print this?
print(soup.get_text())
The error/output I get is this:
<title>SparkNotes Search Results: sONNET</title>
Traceback (most recent call last):
File "C:\Users\Cayle J. Elsey\Dropbox\Programming\Salient_Point\networking.py", line 10, in <module>
print(soup.get_text())
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 6238: character maps to <undefined>
[Finished in 0.5s]
just encode your string into UTF-8 . and Your problem will be solved
html= soup.prettify()
html=html.encode('UTF-8')

Categories