UnicodeEncodeError: handling special characters - python

I am trying to scrap a web page. To keep care of all characters other then ASCII, I have written this code.
mydata = ''.join([i if ord(i) < 128 else ' ' for i in response.text])
and processed it further using beautiful soup python library. Now this is not handling some special characters that are on webpage like [tick], [star] (can't show a picture here).
Any clue on how to escape these characters and replace them with a space.
Right now I have this error
UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 62: character maps to <undefined>

It's always preferable to process everything in Unicode, and convert to any specific encoding only before storage or transfer. For example,
s = u"Hi, привет, ciao"
> s
u'Hi, \u043f\u0440\u0438\u0432\u0435\u0442, ciao'
> s.encode('ascii', 'ignore')
'Hi, , ciao'
> s.encode('ascii', 'replace')
'Hi, ??????, ciao'
If you need to replace non-ascii chars specifically with spaces, you can write and register your own conversion error handler, see codecs.register_error().

fp = open("output.txt","w")
gives you a file open for writing text using the default encoding, which in your case is an encoding that doesn't have the character ✓ (probably cp1252), hence the error. Open the file with an encoding that supports it and you'll be fine:
fp = open('output.txt', 'w', encoding='utf-8')
Note also that:
print("result: "+ str(ele))
can fail if your console doesn't support Unicode, which under Windows it likely will not. Use print(ascii(...)) to get an ASCII-safe representation for debugging purposes.
The probable reason your attempt to get rid of non-ASCII characters fails is that you are removing them before parsing the HTML, rather than from the values you get after parsing. So a literal ✓ would be removed, but if a character reference like ✓ were used, it would be left alone, get parsed by bs4, and end up as ✓.
(I am sad that the default reaction to Unicode errors always seems to be to try to get rid of non-ASCII characters completely, instead of fixing the code to handle them correctly.)
You're also extracting text in a pretty weird way, using str() to get markup and then trying to pick out the tags and remove them. This is unreliable—HTML is not that straightforward to parse, which is why BeautifulSoup is a thing—and needless because you already have a perfectly good HTML parser that can give you the pure text in an element (get_text()).

Most of your code is not necessary. request is already doing the correct decoding for you, beautifulsoup is doing the text extraction for you, and python is doing the correct encoding for you when writing to a file:
import requests
from bs4 import BeautifulSoup
#keyterm = input("Enter a keyword to search:")
URL = 'https://www.google.com/search?q=jaguar&num=30'
#NO_OF_LINKS_TO_BE_EXTRACTED = 10
print("Requesting data from %s" % URL)
response = requests.get(URL)
soup = BeautifulSoup(response.text)
#print(soup.prettify())
metaM = soup.findAll("span","st")
#metaM = soup.find("div", { "class" : "f slp" })
with open("output.txt", "w", encoding='utf8') as fp:
for ele in metaM:
print("result: %r" % ele)
fp.write(ele.get_text().replace('\n', ' ') + '\n')

Related

Python - decode ('utf-8') issue

I am very new to Python.Please help me fix this issue.
I am trying to get the revenue from the link below :
https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898
I am using below commands :
import re
import urllib.request
data=urllib.request.urlopen(url).read()
data1=data.decode("utf-8")
Issue :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position
10798: invalid start byte
Maybe better with requests:
import requests
url = "https://www.google.co.in/?gfe_r...."
req = requests.get(url)
req.encoding = "utf-8"
data = req.text
The result of downloading the specific URL given in the question, is HTML code. I was able to use BeautifulSoup to scrape the page after using the following Python code to get the data:
import requests
url = "https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898"
response = requests.get(url)
data = response.content.decode('utf-8', errors="replace")
print (data)
Please note that I used Python3 in my code example. The syntax for print() may vary a little.
0xa0 or in unicode notation U+00A0 is the character NO-BREAK SPACE. In UTF8 it is represented as b'\xc2\xa0'. If you find it as a raw byte it probably means that your input is not UTF8 encoded but Latin1 encoded.
A quick look on the linked page shows that it is indeed latin1 encoded - but I got a french version...
The rule when you are not sure of the exact convertion is to use the replace errors processing:
data1=data.decode("utf-8", errors="replace")
then, all offending characters are replaced with the REPLACEMENT CHARACTER (U+FFFD) (displayed as �). If only few are found, that means the page contains erroneous characters, but if almost all non-ascii characters are replaced, then it means that the encoding is not UTF8. If is commonly Latin1 for west european languages, but your mileage may vary for other languages.

BeautifulSoup4 cannot get the printing right. Python3

I'm currently in the learning process of Python3, I am scraping a site for some data, which works fine, but when it comes to printing out the p tags I just can't get it to work as I expect.
import urllib
import lxml
from urllib import request
from bs4 import BeautifulSoup
data = urllib.request.urlopen('www.site.com').read()
soup = BeautifulSoup(data, 'lxml')
stat = soup.find('div', {'style' : 'padding-left: 10px';})
dialog = stat.findChildren('p')
for child in dialog:
childtext = child.get_text()
#have tried child.string aswell (exactly the same result)
childlist.append(childtext.encode('utf-8', 'ignore')
#Have tried with str(childtext.encode('utf-8', 'ignore'))
print (childlist)
That all works, but the printing is "bytes"
b'This is a ptag.string'
b'\xc2\xa0 (probably &nbsp'
b'this is anotherone'
Real sample text that is ascii encoded:
b"Announcementb'Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"
Note that Announcement is p and the rest is 'strong' under a p tag.
Same sample with utf-8 encode
b"Announcement\xc2\xa0\xe2\x80\x93\xc2\xa0b'Firefox users may encounter browser warnings encountering SSL SHA-1 "
I WISH to get:
"Announcement"
(newline / new item in list)
"Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"
As you see, the incorrect chars are stripped in "ascii", but as some are that destroys some linebreaks and I have yet to figure out how to print that correctly, also, the b's are still there then!
I really can't figure out how to remove b's and encode or decode properly. I have tried every "solution" that I can google up.
HTML Content = utf-8
I would most rather not change the full data before processing because it will mess up my other work and I don't think it is needed.
Prettify does not work.
Any suggestions?
First, you're getting output of the form b'stuff' because you are calling .encode(), which returns a bytes object. If you want to print strings for reading, keep them as strings!
As a guess, I assume you're looking to print strings from HTML nicely, pretty much as they would be seen in a browser. For that, you need to decode the HTML string encoding, as described in this SO answer, which for Python 3.5 means:
import html
html.unescape(childtext)
Among other things, this will convert any sequences in the HTML string into '\xa0' characters, which are printed as spaces. However, if you want to break lines on these characters despite literally meaning "non-breaking space", you'll have to replace those with actual spaces before printing, e.g. using x.replace('\xa0', ' ').

LXML ValueError and UTF strings

I am making a little Python script for mass-editing of HTML files (replacing links to images etc.). Now, the HTML files contain some Cyrillic, that means I have to encode the string UTF-8. I replace all the links in the HTML, and type tag.set(data) and BOOM, the console displays:
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters.
How can I fix this? I'm pretty sure that there aren't any control characters or NULL bytes. I'm using Python 2.7.11.
value = tag.get('value').encode('utf-8')
#h = HTMLParser.HTMLParser()
#value = h.unescape(value)
urls = regex.finditer(value)
if urls is None: continue
for turl in urls:
ufile = turl.group().rsplit('/', 1)[-1]
value = value.replace(turl.group(), '/'+newsrc+'/'+ufile)
#value = cgi.escape(value, True)
value = value.replace('\0', '')
tag.set('value', value)
It's easy. You only need to remove the encode('utf-8') part. You see LXML doesn't like people messing with the character encodings of strings. Just leave it to LXML to convert text into the suitable encoding and everything will be fine. :)

output html in terminal error

I'm trying to print out html content in following way:
from lxml import html
import requests
url = 'http://www.amazon.co.uk/Membrane-Protectors-FujiFilm-FinePix-SL1000/dp/B00D2UVI9C/ref=pd_sim_ph_3?ie=UTF8&refRID=06BDVRBE6TT4DNRFWFVQ'
page = requests.get(url)
print page.text
then i execute python print_url.py > out, and I got the following error:
print page.text UnicodeEncodeError: 'ascii' codec can't encode
character u'\xa3' in position 113525: ordinal not in range(128)
Could anyone give me some idea? I had these problem before, but i couldn't figure it out.
Thanks
Your page.txt is not in your local encoding. Instead it is probably unicode. To print the contents of page.text you must first encode them in the encoding that stdout expects:
import sys
print page.text.encode(sys.stdout.encoding)
The page contains non-ascii unicode characters. You may get this error if you try to print to a shell that doesn't support them, or because you are redirecting the output to a file and it's assuming an ascii encoding for output. I specify this because some shells will have no problem, while others will (my current shell/terminal defaults to uf8 for instance)
If you want the output to be encoded as utf8, you should explicitly encode it:
print page.text.encode('utf8')
If you want it to be encoded as something the shell can handle or ascii with non-printable characters removed or replaced, use one of these:
print page.text.encode(sys.stdout.encoding or "ascii", 'xmlcharrefreplace') - replace nonprintable characters with numeric entities
print page.text.encode(sys.stdout.encoding or "ascii", 'replace') - replace nonprintable characters with "?"
print page.text.encode(sys.stdout.encoding or "ascii", 'ignore') - replace nonprintable characters with nothing (delete them)

How to return plain text from Beautiful Soup instead of unicode

I am using BeautifulSoup4 to scrape this web page, however I'm getting the weird unicode text that BeautifulSoup returns.
Here is my code:
site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
req.add_header('Accept-enconding', 'gzip') #Header to check for gzip
page = urllib2.urlopen(req)
if page.info().get('Content-Encoding') == 'gzip': #IF checks gzip
data = page.read()
data = StringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()
soup = BeautifulSoup(html, fromEncoding='gbk')
else:
soup = BeautifulSoup(page)
section = soup.find('span', id='Events').parent
events = section.find_next('ul').find_all('li')
print soup.originalEncoding
for x in events:
print x
Bascially I want x to be in plain English. I get, instead, things that look like this:
<li>153 BC – Roman consuls begin their year in office.</li>
There's only one example in this particular string, but you get the idea.
Related: I go on to cut up this string with some regex and other string cutting methods, should I switch this to plain text before or after I cut it up? I'm assuming it doesn't matter but seeing as I'm defering to SO anyways, I thought I'd ask.
If anyone knows how to fix this, I'd appreciate it. Thanks
EDIT: Thanks J.F. for the tip, I now used this after my for loop:
for x in events:
x = x.encode('ascii')
x = str(x)
#Find Content
regex2 = re.compile(">[^>]*<")
textList = re.findall(regex2, x)
text = "".join(textList)
text = text.replace(">", "")
text = text.replace("<", "")
contents.append(text)
However, I still get things like this:
2013 – At least 60 people are killed and 200 injured in a stampede after celebrations at Félix Houphouët-Boigny Stadium in Abidjan, Ivory Coast.
EDIT:
Here is how I make my excel spreadsheet (csv) and send in my list
rows = zip(days, contents)
with open("events.csv", "wb") as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
So the csv file is created during the program and everything is imported after the lists are generated. I just need to it to be readable text at that point.
fromEncoding (which has been renamed to from_encoding for compliance with PEP8) tells the parser how to interpret the data in the input. What you (your browser or urllib) receive from the server is just a stream of bytes. In order to make sense of it, i.e. in order to build a sequence of abstract characters from this stream of bytes (this process is called decoding), one has to know how the information was encoded. This piece of information is required and you have to provide it in order to make sure that your code behaves properly. Wikipedia tells you how they encode the data, it's stated right at the top of the source of each of their web pages, e.g.
<meta charset="UTF-8" />
Hence, the bytestream received from Wikipedia's web servers should be interpreted with the UTF-8 codec. You should invoke
soup = BeautifulSoup(html, from_encoding='utf-8')
instead of BeautifulSoup(html, fromEncoding='gbk'), which tries to decode the bytestream with some Chinese character codec (I guess you blindly copied that piece of code from here).
You really need to make sure that you understand the basic concept of text encodings. Actually, you want unicode in the output, which is an abstract representation of a sequence of characters/symbols. In this context, there is no such thing as "plain English".
There is no such thing as plain text. What you see are bytes interpreted as text using incorrect character encoding i.e., the encoding of the strings is different from the one used by your terminal unless the error were introduced earlier by using incorrect character encoding for the web page.
print x calls str(x) that returns UTF-8 encoded string for BeautifulSoup objects.
Try:
print unicode(x)
Or:
print x.encode('ascii')

Categories