UnicodeDecodeError Python Error - python

I'm trying to code a python google api. Getting some unicode issues. My really basic PoC so far is:
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
query = "filetype%3Apdf"
url = "http://www.google.com/search?sclient=psy-ab&hl=en&site=&source=hp&q="+query+"&btnG=Search"
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
data = response.read()
data = data.decode('UTF-8', 'ignore')
data = data.encode('UTF-8', 'ignore')
soup = BeautifulSoup(data)
print u""+soup.prettify('UTF-8')
My traceback is:
Traceback (most recent call last):
File "./google.py", line 22, in <module>
print u""+soup.prettify('UTF-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 48786: ordinal not in range(128)
Any ideas?

You are converting your soup tree to UTF-8 (an encoded byte string), then try to concatenate this to an empty u'' unicode string.
Python will automatically try and decode your encoded byte string, using the default encoding, which is ASCII, and fails to decode the UTF-8 data.
You need to explicitly decode the prettify() output:
print u"" + soup.prettify('UTF-8').decode('UTF-8')
The Python Unicode HOWTO explains this better, including about default encodings. I really, really recommend you read Joel Spolsky's article on Unicode as well.

Related

using request and beautiful soup module in python

My code is very short code but it is giving unexpected error
import bs4
url = "https://www."+input("Enter The Name of the website: ")+".com"
req = re.get(url)
html_text = req.text
htmls = bs4.BeautifulSoup(html_text, "html.parser").prettify()
with open("facebook.html", "w+") as file:
file.write(htmls)
Traceback (most recent call last):
File "C:\Users\Tanay Mishra\PycharmProjects\First_Project\Web Scraping\web.py", line 9, in <module>
file.write(htmls)
File "C:\Users\Tanay Mishra\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to <undefined>```
\u200b means ZERO WIDTH SPACE in Unicode. Try to specify encoding="utf-8" in open() function. Also good practice is to use .content property of Response object and let BeautifulSoup to guess the encoding:
import bs4
import requests
url = "https://www."+input("Enter The Name of the website: ")+".com"
req = requests.get(url)
htmls = bs4.BeautifulSoup(req.content, "html.parser").prettify()
with open("facebook.html", "w+", encoding="utf-8") as file:
file.write(htmls)

Ascii codec can't encode character error, content.decode('utf-8','ignore') won't work. Any other options?

Example code:
import requests
import bs4 as bs
import urllib
import urllib.request
from urllib.request import Request
req = Request("https://www.openbugbounty.org/researchers/Spam404/vip/page/1/",
headers ={'User-Agent':'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
print(soup)
My output gives me the following error:
File "/Users/student/Desktop/AutoBots/kbfza2.py", line 15, in
print(soup)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 5194: ordinal not in range(128)
After searching for a solution online for a while, it seemed that changing my soup line to:
soup = bs.BeautifulSoup(sauce.decode('utf-8','ignore'),'lxml)
would solve this for me but this hasn't fixed anything for me.
Am I mistaken in thinking that the decode function with the ignore argument should allow me to print(soup) without error even if it isn't successfully decoded completely?
Just re-read your question and I believe you are trying to print Unicode text to a console that doesn't support that character set (I couldn't reproduce the error with the code you posted).
You may need to force your console output to utf-8 encoding, or if you are using an IDE like sublime text, change it to render utf-8 encoding.

python - Cant encode 'windows-1255' page

I'm using BeautifulSoup and trying to read a site which is written in hebrew and encoded in windows-1255 according to this line:
<meta http-EQUIV="Content-Type" Content="text/html; charset=windows-1255">
when I'm trying to encode it, I get the following error:
> UnicodeEncodeError: 'charmap' codec can't encode characters in position 6949-6950: character maps to <undefined>
The code:
from bs4 import BeautifulSoup
import requests
r = requests.get('http://www.plonter.co.il')
soup = BeautifulSoup(r.text)
print soup.prettify().encode('windows-1255')
If the site is already encoded in windows-1255 you should decode it to get unicode or just use it with the current encoding.
--edit
I didn't know r.text was already decoded.
>>> import requests
>>> r = requests.get('http://www.plonter.co.il')
>>> isinstance(r.text, unicode)
True
>>> isinstance(r.content, unicode)
False
>>> isinstance(r.content, str)
True
>>> r.encoding
'ISO-8859-1'
>>> r.content.decode(r.encoding).encode('utf-8') # works
>>> r.content.decode(r.encoding).encode('windows-1255') # fails
>>> r.content.decode(r.encoding).encode('windows-1255', 'ignore'). # works
>>> r.content.decode(r.encoding).encode('windows-1252') # works
So, I think you got the encoding "wrong". 'windows-1255' can't handle the content encode without errors. On the other hand 'utf-8', 'iso-8859-1' and 'windows-1252' seem to be able to handle it.
>>> r.content.decode(r.encoding) == r.text
True

Decoding in utf-8 in parsed data from website via Python

I am trying to parse data from website and I am getting an error.
Here's my python code
import urllib.request
import re
url = "http://ihned.cz"
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
respData = resp.read().decode('utf-8')
#print(respData) #html kód
authors = re.findall(r'data-author="(.*?)"', str(respData))
for author in authors:
print(authors)
And here's the error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 368: invalid continuation byte
Can you please help me?
Thank you.
The source of that website says charset="windows-1250". Try decode('windows-1250').

Python: UnicodeDecodeError: 'utf-8' codec can't decode byte...invalid continuation byte

I'm building a web scraper using BeautifulSoup on Python 3.3
However I get a problem which prevents me from getting a valid strin* that I can use with BeautifulSoup. That is:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 7047: invalid continuation byte
I know there are dozens of similar questions but I haven't so far found a method that can help me to diagnose what's wrong with the following code:
import urllib.request
URL = "<url>" # sorry, I cannot show the url for privacy reasons, but it's a normal html document
page = urllib.request.urlopen(URL)
page = page.read().decode("utf-8") # from bytes to <source encodings>
As I guessed I notice this error occurs only with some URLS and not with others. Even with this same error I wasn't having this error until yesterday. Then today I run the program again and the error popped up..
Any clue on how to diagnose the error?
You should not decode the response. First of all, you are incorrectly assuming the response is UTF-8 encoded (it is not, as the error shows), but more importantly, BeautifulSoup will detect the encoding for you. See the Encodings section of the BeautifulSoup documentation.
Pass a byte string to BeautifulSoup and it'll use any <meta> header proclaiming the correct encoding, or do great job of autodetecting the encoding for you.
In the event that auto-detection fails, you can always fall back to the server-provided encoding:
encoding = page.info().get_charset()
page = page.read()
soup = BeautifulSoup(page)
if encoding is not None and soup.original_encoding != encoding:
print('Server and BeautifulSoup disagree')
print('Content-type states it is {}, BS4 states thinks it is {}'.format(encoding, soup.original_encoding)
print('Forcing encoding to server-supplied codec')
soup = BeautifulSoup(page, from_encoding=encoding)
This still leaves the actual decoding to BeautifulSoup, but if the server included a charset parameter in the Content-Type header then the above assumes that the server is correctly configured and forces BeautifulSoup to use that encoding.

Categories