Why is Python able to parse Amazon but not Google/Reddit?

Why is Python able to parse Amazon but not Google/Reddit? - python

I've searched for awhile to no result. Python seems to be able to handle some-- but not all--webpages:
import requests, webbrowser, bs4
res = requests.get('http://www.reddit.com')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
print soup.prettify()
Surprisingly, this is able to print the Amazon.com homepage, but not Reddit. The error I get is:
Traceback (most recent call last):File "testweb.py", line 7, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xd7' in position 37769: character maps to <undefined>
My question: How can I write a program that can encode for any webpage? Where am I going wrong?
EDIT: Further testing shows google.com also does not work. It's a similar error message:
Traceback (most recent call last):File "testweb.py", line 7, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 9651: character maps to <undefined>
EDIT 2: Tried decoding res.text to utf-8 but got this error:
Traceback (most recent call last):File "testweb.py", line 5, in <module>
soup = bs4.BeautifulSoup(res.text.decode('utf-8'), 'html.parser')File "C:\PYTHON27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 9358: ordinal not in range(128)
Edit 3: Tried encoding res.text to utf-8 but got this error:
Traceback (most recent call last):File "testweb.py", line 8, in <module>
print soup.prettify()File "C:\PYTHON27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 9622: character maps to <undefined>

Change the output encoding to utf-8, so it'll output utf-8 encoded text, and try to encode the request text, instead of decoding it.
Example:
# -*- coding: utf-8 -*-
import requests, webbrowser, bs4
res = requests.get('http://www.reddit.com')
soup = bs4.BeautifulSoup(res.text.encode('utf-8'), 'html.parser')
print (soup.prettify())
Try to encode directly in prettify:
print (soup.prettify('latin-1')) or print (soup.prettify('utf-8'))

Related

UnicodeEncodeError: 'charmap' codec can't encode character '\u221e' in position 4297: character maps to <undefined>

Background details: -Using atom with installed package script-Python version is 3.8.3 -Trying to web scrape from a URL which is an online directory
-Would like to know more about this error and solve it-image link of script and error : https://i.stack.imgur.com/QQWtE.jpg
from bs4 import BeautifulSoup
import requests
url = "https://www.timesbusinessdirectory.com/company-listings"
source_url=requests.get(url).text
html=BeautifulSoup(source_url, 'html.parser')
print(html.prettify())
Traceback (most recent call last):
File "C:\Users\User\Desktop\beautiful\scrap.py", line 8, in <module>
print(html)
File "C:\Users\User\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u221e' in position 5103: character maps to <undefined>

Python - cannot decode html (urllib)

I'm trying to write html from webpage to file, but I have problem with decode characters:
import urllib.request
response = urllib.request.urlopen("https://www.google.com")
charset = response.info().get_content_charset()
print(response.read().decode(charset))
Last line causes error:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in
position 6079: ordinal not in range(128)
response.info().get_content_charset() returns iso-8859-2, but if i check content of response without decoding (print(resposne.read())) there is "utf-8" encoding as html metatag. If i use "utf-8" in decode function there is also similar problem:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position
6111: invalid start byte
What's going on?

You can ignore invalid characters using
response.read().decode("utf-8", 'ignore')
Instead of ignore there are other options, e.g. replace
https://www.tutorialspoint.com/python/string_encode.htm
https://docs.python.org/3/howto/unicode.html#the-string-type
(There is also str.encode(encoding='UTF-8',errors='strict') for strings.)

"UnicodeEncodeError: 'charmap' codec can't encode characters" when trying to parse .xlsx by openpyxl

--- update ---
I think this console log nails the issue, however it's still not clear how to fix it:
>>> workbook = openpyxl.load_workbook('data.xlsx')
>>> worksheet = workbook.active
>>> worksheet['A2'].value
u'\u041c\u0435\u0448\u043e\u043a \u0434\u0435\u043d\u0435\u0433'
>>> print worksheet['A2'].value
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to <undefined>
--- end update ---
I'm trying to print the values of some .xlsx cells using openpyxl:
import openpyxl
workbook = openpyxl.load_workbook(filename='puzzles.xlsx')
worksheet = workbook.active
for row in worksheet.iter_rows('A2:K5'):
print row[0].value
Which results in the following error:
Traceback (most recent call last):
File "xls_import.py", line 8, in <module>
print row[0].value
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to <undefined>
As far as I know, XLSX is encoded as UTF-8, however:
print row[0].value.decode('utf-8')
does not help either:
Traceback (most recent call last):
File "xls_import.py", line 8, in <module>
print row[0].value.decode('utf-8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
Any suggestions?
I'm running Python 2.7 and openpyxl 2.2.5.

openpyxl returns unicode strings (XML itself is encoded in UTF-8) so you don't need to decode them (decoding goes from an encoding to unicode) but encode them in encoding of your choice.

Why html2text module throws UnicodeDecodeError?

I have problem with html2text module...shows me UnicodeDecodeError:
UnicodeDecodeError: 'ascii' codec can't decode byte
0xbe in position 6: ordinal not in range(128)
Example :
#!/usr/bin/python
# -*- coding: utf-8 -*-
import html2text
import urllib
h = html2text.HTML2Text()
h.ignore_links = True
html = urllib.urlopen( "http://google.com" ).read()
print h.handle( html )
...also have tried h.handle( unicode( html, "utf-8" ) with no success. Any help.
EDIT :
Traceback (most recent call last):
File "test.py", line 12, in <module>
print h.handle(html)
File "/home/alex/Desktop/html2text-master/html2text.py", line 254, in handle
return self.optwrap(self.close())
File "/home/alex/Desktop/html2text-master/html2text.py", line 266, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 6: ordinal not in range(128)

The issue is easily reproducable when not decoding, but works just fine when you decode your source correctly. You also get the error if you reuse the parser!
You can try this out with a known good Unicode source, such as http://www.ltg.ed.ac.uk/~richard/unicode-sample.html.
If you don't decode the response to unicode, the library fails:
>>> h = html2text.HTML2Text()
>>> h.handle(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Now, if you reuse the HTML2Text object, its state is not cleared up, it still holds the incorrect data, so even passing in Unicode will now fail:
>>> h.handle(html.decode('utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
You need to use a new object and it'll work just fine:
>>> h = html2text.HTML2Text()
>>> result = h.handle(html.decode('utf8'))
>>> len(result)
12750
>>> type(result)
<type 'unicode'>

Beautiful Soup Unicode encode error

I am trying the following code with a particular HTML file
from BeautifulSoup import BeautifulSoup
import re
import codecs
import sys
f = open('test1.html')
html = f.read()
soup = BeautifulSoup(html)
body = soup.body.contents
para = soup.findAll('p')
print str(para).encode('utf-8')
I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 9: ordinal not in range(128)
How do I debug this?
I do not get any error when I remove the call to print function.

The str(para) builtin is trying to use the default (ascii) encoding for the unicode in para.
This is done before the encode() call:
>>> s=u'123\u2019'
>>> str(s)
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
>>> s.encode("utf-8")
'123\xe2\x80\x99'
>>>
Try encoding para directly, maybe by applying encode("utf-8") to each list element.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is Python able to parse Amazon but not Google/Reddit? - python

Related

UnicodeEncodeError: 'charmap' codec can't encode character '\u221e' in position 4297: character maps to <undefined>

Python - cannot decode html (urllib)

"UnicodeEncodeError: 'charmap' codec can't encode characters" when trying to parse .xlsx by openpyxl

Why html2text module throws UnicodeDecodeError?

Beautiful Soup Unicode encode error

Categories

Resources