UnicodeDammit: detwingle crashes on a website - python

I’m scraping websites and use BeautifulSoup4 to parse them. As the websits can have really random char sets, I use UnicodeDammit.detwingle to ensure that I feed proper data to BeautifulSoup. It worked fine... until it crashed. One website causes the code to break. The code to build "soup" looks like this:
u = bs.UnicodeDammit.detwingle( html_blob ) <--- here it crashes
u = bs.UnicodeDammit( u.decode('utf-8'),
smart_quotes_to='html',
is_html = True )
u = u.unicode_markup
soup = bs.BeautifulSoup( u )
And the error (standard Python-Unicode hell duo)
File ".../something.py", line 92, in load_bs_from_html_blob
u = bs.UnicodeDammit.detwingle( html_blob )
File ".../beautifulsoup4-4.1.3-py2.7.egg/bs4/dammit.py", line 802, in detwingle
return b''.join(byte_chunks)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0:
ordinal not in range(128)
The offending website is this one
Question: How to make a proper and bulletproof website source decoding?

This website is not a special case in terms of character encoding at all, it's perfectly valid utf-8 with even the http header set correctly. It then follows that your code would have crashed on any website encoded in utf-8 with code points beyond ASCII.
It is also evident from the documentation, that UnicodeDammit.detwingle takes an unicode string. You are passing it html_blob, and the variable naming suggests that it's not a decoded unicode string. (Misunderstanding)
To handle any website encoding is not trivial in the case the http header or markup lies about the encoding or is not included at all. You need to perform various heuristics and even then you won't get it right. But this website is sending the charset header correctly and has been encoded correctly in that charset.
Interesting trivia. The only beyond ASCII text in the website are these javascript comments (after being decoded as utf-8):
image = new Array(4); //¶¨ÒåimageΪͼƬÊýÁ¿µÄÊý×é
image[0] = 'sample_BG_image01.png' //±³¾°Í¼ÏóµÄ·¾¶
If you then encode those to ISO-8859-1, and decode the result as GB2312, you get:
image = new Array(4); //定义image为图片数量的数组
image[0] = 'sample_BG_image01.png' //背景图象的路径
Which google chinese -> english, translates to:
image = new Array(4); //Defined image of the array of the number of images
image[0] = 'sample_BG_image01.png' //The path of the background image

Related

Codec error while reading a file in python - 'charmap' codec can't decode byte 0x81 in position 3124: character maps to <undefined>

I am working on a Machine Learning Project which filters spam/phishing emails out of all emails. For this, I am using the SpamAssassin dataset. The dataset contains different mails in this format:
For identifying phishing emails, first thing I have to do is finding out how many web-links the email has. For doing that, I have written the following code:
wordsInLine = []
tempWord = []
urlList = []
base_dir = "C:/Users/keert/Downloads/Spam_Assassin/spam"
def count():
flag = 0
print("Reading all file names in sorted order")
for filename in sorted(os.listdir(base_dir)):
file=open(os.path.join(base_dir, filename))
count1 = 0
for line in file:
wordsInLine = line.split(' ')
for word in wordsInLine:
if re.search('href="http',word,re.I):
count1=count1+1
file.close()
urlList.append(count1)
if flag!=0:
print("File Name = " + filename)
print ("Number of links = ",count1)
flag = flag + 1
count()
final = urlList[1:]
print("List of number of links in each email")
print(final)
with open('count_links.csv', 'wb') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
for val in final:
wr.writerow([val])
print("CSV file generated")
But this code is giving me an error saying that: 'charmap' codec can't decode byte 0x81 in position 3124: character maps to
I have even tried opening the file by adding encoding = 'utf8' option. But still, the clash remains and I got an error like: 'utf-8' codec can't decode byte 0x81 in position 3124: character maps to
I guess this is due to the special characters that are in the file. Is there any way to deal with this because I can not skip the special characters as they are also important. Please suggest me a way for doing this. Thank you in advance
You have to open and read the file using the same encoding that was used to write the file. In this case, that might be a bit difficult, since you are dealing with e-mails and they can be in any encoding, dependent on the sender. In the example file you showed, the message is encoded using 'iso-8859-1' encoding.
However, e-mails are a bit strange, since they consist of a header (which is in ASCII format as far as I know), followed by an empty line and the body. The body is encoded in the encoding that was specified in the header. So two different encodings could be used in the same file!
If you're sure that all the e-mails use iso-8859-1 encoding and you're looking for a quick-and-dirty solution, then you could also just open the file using 'iso-8859-1' encoding, since e-mail headers are compatible with iso-8859-1. However, be prepared that you will have to deal with other e-mail formatting/encoding/escaping issues as well, or your script might not work completely as expected.
I think the best solution would be to look for a Python module that can handle e-mails, so it will deal with all the decoding stuff and you don't have to worry about that. It will also solve other problems such as escape characters and line breaks.
I don't have experience with this myself, but it seems that Python has built-in support for parsing e-mails using the e-mail package. I recommend to take a look at that.

python UnicodeDecodeEorror, even I use decode('utf-8')

I use python to read some html, the page contains some japanese and chinese characters,
the code as:
response = urllib.urlopen(pageurl).read()
when I print the response, python tells me a DecodeError
then I changed the code as
response = urllib.urlopen(pageurl).read().decode("utf-8")
python still tells that
UnicodeDecodeError: "utf8" code can't decode byte 0xd1 in position...
what should I do....
by the way, the html chartset is gb2312...
If it is using GB2312, it does not compatible with UTF-8, but can be considered a subset GBK, which is supported by Python decoder. Therefore, you should try response = urllib.urlopen(pageurl).read().decode("gbk") instead.

Save file with any language using unicode

I'm creating a simple script that takes a list of images as an input and outputs a pdf file, using the Reportlab pdf-generation module. The script takes the filename as shown above:
from reportlab.pdfgen import canvas
filename = raw_input("Enter pdf filename: ")
c = canvas.Canvas(filename + ".pdf")
c.save()
Everything is awesome, until the user input non-english filename (Hebrew, Arabic), which cause the code to throw the following exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf9 in position 0: invalid start byte
So, I decided to use unicode instead, but when I use unicode() it throws me another exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf9 in position 0: ordinal not in range(128)
However, when I decode the string encoding it works like a charm (Hebrew example):
from reportlab.pdfgen import canvas
filename = raw_input("Enter pdf filename: ")
filename = filename.decode("windows-1255")
c = canvas.Canvas(filename + ".pdf")
c.save()
I continued to try another methods, and found that if I write before the string u like in the example above, it works in any language:
from reportlab.pdfgen import canvas
filename = u"أ" #arabic
c = canvas.Canvas(filename + ".pdf")
c.save()
The problem is that I dont know what encoding I should use. The input string could be in any language. What can I do to fix it, or in other words: How can I add u before string without speicfy the encoding?
PS: If you have better title, please write me down below
Edit: The filename is actually provided from a website (I use urllib). I didnt thought it matters and I used raw_input() to make the problem more clear. Sorry for that
raw_input() strings are encoded by the terminal or console, so you'd ask the terminal or console for the right codec to use.
Python has already done this at startup time, and stored the codec in sys.stdin.encoding:
import sys
filename = raw_input("Enter pdf filename: ")
filename = filename.decode(sys.stdin.encoding)
From the comments you indicated that the filename is not actually sourced from raw_input(). For different sources, you'll need to use different techniques to detect the character set used.
For example, HTTP responses may include a charset parameter in the Content-Type header; a urllib or urllib2 response lets you extract that with:
encoding = response.info().getparam('charset')
This can still return None, at which point it depends on the exact mimetype returned. The default for text/ mimetypes (such as HTML) is Latin-1, but the HTML standard also allows for <meta> headers in the document itself to tell you the characterset used. For HTML, I'd use BeautifulSoup to parse the response, it'll detect the characterset for you.
Without more information on how you actually load the filename from a URL, however, I cannot say anything more specific.
OK, I got the solution! Once I got the text from the server I parsed it using BeutifulSoup (Thank you #Martijn Pieters!), that has charset detection library:
resp = urllib2.urlopen("http://example.com").read()
soup = BeautifulSoup(resp)
string = soup.find_all("span")[0].text
And then I just used string as the file name:
c = canvas.Canvas(path + "/" + string + ".pdf")
The full credit goes to #Martijn Pieters that recommended me to use BS.
This is not the first script HTML parsing script I wrote, and I always used regex. I highly recommend anyone to use BeautifulSoup instead, trust me it's much better then regex.

Python ─ UTF-8 filename from HTML form via CherryPy

Python Header: # ! /usr/bin/env python
# -*- coding: utf-8 -*-
# image_upload.py
Cherrypy Config: cherrypy.config.update(
{'tools.encode.on': True,
'tools.encode.encoding': 'utf-8',
'tools.decode.on': True,
},)
HTML Header: <head><meta http-equiv="Content-Type"
content="text/html;charset=ISO-8859-1"></head>
""" Python 2.7.3
Cherrypy 3.2.2
Ubuntu 12.04
"""
With an HTML form, I'm uploading an image file to a database. That works so far without problems. However, if the filename ist not 100% in ASCII, there seems to be no way to retrieve it in UTF-8. This is weird, because with the HTML text input fields it works without problems, from saving until showing. Therefore I assume that it's an encoding or decoding problem with the web application framework CherryPy, because the upload is handeld by it, like here.
How it works:
The HTML form POSTs the uploaded file to another Python function, which receives the file in the standard dictionary **kwargs. From here you get the filename with extention, like this: filename = kwargs['file'].filename. But that's already with the wrong encoding. Until now the image hasn't been processed, stored or used in any way.
I'm asking for a solution, which would prevent it, to just parse the filename and change it back "manually". I guess the result already is in UTF-8, which makes it cumbersome to get it right. That's why getting CherryPy to do it, might be the best way. But maybe it's even an HTML issue, because the file comes from a form.
Here are the wrong decoded umlauts.
What I need is the input as result.
input → result input → result
ä → ä Ä → Ä
ö → ö Ö → Ö
ü → ü Ü → Ãœ
Following are the failed attempts to get the right result, which would be: "Würfel"
NOTE: img_file = kwargs['file']
original attempt:
result = img_file.filename.rsplit('.',1)[0]
result: "Würfel"
change system encoding:
reload(sys)
sys.setdefaultencoding('utf-8')
result: "Würfel"
encoding attempt 1:
result = img_file.filename.rsplit('.',1)[0].encode('utf-8')
result: "Würfel"
encoding attempt 2:
result = unicode(img_file.filename.rsplit('.',1)[0], 'urf-8')
Error Message:
TypeError: decoding Unicode is not supported
decoding attempt:
result = img_file.filename.rsplit('.',1)[0].decode('utf-8')
Error Message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)
cast attempt:
result = str(img_file.filename.rsplit('.',1)[0])
Error Message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)
Trying with your string it seems I can get the filename using latin1 encoding.
>>> s = u'W\xc3\xbcrfel.jpg'
>>> print s.encode('latin1')
Würfel.jpg
>>>
You simply need to use that .encode('latin1') before splitting.
But the problem here is broader. You really need to figure out why your web encoding is latin1 instead of utf8. I don't know cherrypy but try to ensure to use utf8 or you could get in other glitches when serving your application through a webserver like apache or nginx.
The problem is that you serve your HTML with charset ISO-8859-1; this makes the browsers confused and they use the charset also when sending to server. Serve all your HTML always with UTF-8, code in UTF-8, and set your terminal to UTF-8, and you shouldn't have problems.

How to open an ascii-encoded file as UTF8?

My files are in US-ASCII and a command like a = file( 'main.html') and a.read() loads them as an ASCII text. How do I get it to load as UTF8?
The problem I am tring to solve is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 38: ordinal not in range(128)
I was using the content of the files for templating as in template_str.format(attrib=val). But the string to interpolate is of a superset of ASCII.
Our team's version control and text editors does not care about the encoding. So how do I handle it in the code?
You are trying to opening files without specifying an encoding, which means that python uses the default value (ASCII).
You need to decode the byte-string explicitly, using the .decode() function:
template_str = template_str.decode('utf8')
Your val variable you tried to interpolate into your template is itself a unicode value, and python wants to automatically convert your byte-string template (read from the file) into a unicode value too, so that it can combine both, and it'll use the default encoding to do so.
Did I mention already you should read Joel Spolsky's article on Unicode and the Python Unicode HOWTO? They'll help you understand what happened here.
A solution working in Python2:
import codecs
fo = codecs.open('filename.txt', 'r', 'ascii')
content = fo.read() ## returns unicode
assert type(content) == unicode
fo.close()
utf8_content = content.encode('utf-8')
assert type(utf8_content) == str
I suppose that you are sure that your files are encoded in ASCII. Are you? :) As ASCII is included in UTF-8, you can decode this data using UTF-8 without expecting problems. However, when you are sure that the data is just ASCII, you should decode the data using just ASCII and not UTF-8.
"How do I get it to load as UTF8?"
I believe you mean "How do I get it to load as unicode?". Just decode the data using the ASCII codec and, in Python 2.x, the resulting data will be of type unicode. In Python 3, the resulting data will be of type str.
You will have to read about this topic in order to learn how to perform this kind of decoding in Python. Once understood, it is very simple.

Categories