Python - UnicodeDecodeError when trying to parse HTML file which is ASCII - python

Am using Python 2.7.6.
Have an HTML file which contains values prepended with "$". Wrote a program which takes in JSON data and replaces the values prepended with $ with the JSON values.
This was working fine until someone opened up the set of HTML files with a different editor and changed it from UTF-8 to ASCII.
class FileUtil:
#staticmethod
def replace_all(output_file, data):
homedir = os.path.expanduser("~")
dest_dir = homedir + "/dest_dir"
with open(output_file, "r") as my_file:
contents = my_file.read()
destination_file = dest_dir + "/" + data["filename"]
fp = open(destination_file, "w")
for key, value in data.iteritems():
contents = contents.replace("$" + str(key), value)
fp.write(contents)
fp.close()
Whenever my program encounters a file which is in ASCII it throws this error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 239, in process
return self.handle()
File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 230, in handle
return self._delegate(fn, self.fvars, args)
File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 420, in _delegate
return handle_class(cls)
File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 396, in handle_class
return tocall(*args)
FileUtil.replace_all(output_file, data)
File "/home/devuser/demo/utils/fileutils.py", line 11, in replace_all
contents = contents.replace("$" + str(key), value)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 54826: ordinal not in range(128)
Question(s):
Is there a way to make the contents value to be strictly UTF-8 in python?
Is it better to use a command line utility in Ubuntu Linux to convert the file before running this python script?
Is the error an encoding problem (e.g. file is ASCII and not UTF8)?

#Apalala
Thank you very much regarding chardet! It was a very useful tool.
#Ulrich Ekhardt
You are right, it is UTF-8 and not ASCII.
This was the solution:
iconv --from-code UTF-8 --to-code US-ASCII -c hello.htm > hello.html

Related

How do I determine which text in a corpus contains an error generated by the NLTK suite in Python?

I am trying to do some rudimentary corpus analysis with Python. I am getting the following error message(s):
Traceback (most recent call last):
File "<pyshell#28>", line 2, in <module>
print(len(poems.words(f)), f)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 240, in __len__
for tok in self.iterate_from(self._toknum[-1]):
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 306, in iterate_from
tokens = self.read_block(self._stream)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\plaintext.py", line 134, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1220, in readline
new_chars = self._read(readsize)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1458, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1489, in _incr_decode
return self.decode(bytes, 'strict')
File "C:\Python38-32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 12: invalid start byte
My assumption is that there is a UTF error in one of the 202 text files I am looking at.
Is there any way of telling, from the error messages, which file or files have the problem?
Assuming that you know the files ids (the paths of your corpus files) you can open all of them with encoding="utf-8"
If you don't know the paths, assuming that you are using the nltk corpus loader, you can get them by using:
poems.fileids()
After that, for every file in your list of files (for example fileids) you can try:
for file_ in fileids:
try:
with open(file_, encoding="utf-8") a f_i:
f_i.readlines()
except:
print("You got problems with the file: ", file_)
Anyway, your loader has also a parameter named "encoding" that you can use for the correct encoding of your corpus. By default is set to "utf-8"
More details here: nltk corpus loader

Umlauts in JSON files lead to errors in Python code created by ANTLR4

I've created python modules from the JSON grammar on github / antlr4 with
antlr4 -Dlanguage=Python3 JSON.g4
I've written a main program "JSON2.py" following this guide: https://github.com/antlr/antlr4/blob/master/doc/python-target.md
and downloaded the example1.json also from github.
python3 ./JSON2.py example1.json # works perfectly, but
python3 ./JSON2.py bookmarks-2017-05-24.json # the bookmarks contain German Umlauts like "ü"
...
File "/home/xyz/lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 227: ordinal not in range(128)
The offending line in JSON2.py is
input = FileStream(argv[1])
I've searched stackoverflow and tried this instead of using the above FileStream:
fp = codecs.open(argv[1], 'rb', 'utf-8')
try:
input = fp.read()
finally:
fp.close()
lexer = JSONLexer(input)
stream = CommonTokenStream(lexer)
parser = JSONParser(stream)
tree = parser.json() # This is line 39, mentioned in the error message
Execution of this program ends with an error message, even if the input file doesn't contain Umlauts:
python3 ./JSON2.py example1.json
Traceback (most recent call last):
File "./JSON2.py", line 46, in <module>
main(sys.argv)
File "./JSON2.py", line 39, in main
tree = parser.json()
File "/home/x/Entwicklung/antlr/links/JSONParser.py", line 108, in json
self.enterRule(localctx, 0, self.RULE_json)
File "/home/xyz/lib/python3.5/site-packages/antlr4/Parser.py", line 358, in enterRule
self._ctx.start = self._input.LT(1)
File "/home/xyz/lib/python3.5/site-packages/antlr4/CommonTokenStream.py", line 61, in LT
self.lazyInit()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit
self.setup()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 189, in setup
self.sync(0)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 111, in sync
fetched = self.fetch(n)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 123, in fetch
t = self.tokenSource.nextToken()
File "/home/xyz/lib/python3.5/site-packages/antlr4/Lexer.py", line 111, in nextToken
tokenStartMarker = self._input.mark()
AttributeError: 'str' object has no attribute 'mark'
This parses correctly:
javac *.java
grun JSON json -gui bookmarks-2017-05-24.json
So the grammar itself is not the problem.
So finally the question: How should I process the input file in python, so that lexer and parser can digest it?
Thanks in advance.
Make sure your input file is actually encoded as UTF-8. Many problems with character recognition by the lexer are caused by using other encodings. I just took a testbed application, added ëto the list of available characters for an IDENTIFIER and it works again. UTF-8 is the key -- and make sure your grammar also allows these characters where you want to accept them.
I solved it by passing the encoding info:
input = FileStream(sys.argv[1], encoding = 'utf8')
If without the encoding info, I will have the same issue as yours.
Traceback (most recent call last):
File "test.py", line 20, in <module>
main()
File "test.py", line 9, in main
input = FileStream(sys.argv[1])
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 20, in __init__
super().__init__(self.readDataFrom(fileName, encoding, errors))
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)
Where my input data is
[今明]天(台南|高雄)的?天氣如何

Terrible "invalid start byte" Unicode Error with Opening a CSV file

Please please please help. I've been strugglign with this for a while and ran into problem after problem. I'm just trying to make a loop that opens every csv file in a folder. Here's the loop:
folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams2/'
for file in os.listdir (folder):
with codecs.open(file, mode='rU', encoding='utf-8') as f:
m=min(int(line[1]) for line in csv.reader(f))
f.seek(0)
for line in csv.reader(f):
if int(line[1])==m:
print line
Here's the error:
Traceback (most recent call last):
File "findfirsttrigram.py", line 11, in <module>
m=min(int(line[1]) for line in csv.reader(f))
File "findfirsttrigram.py", line 11, in <genexpr>
m=min(int(line[1]) for line in csv.reader(f))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 684, in next
return self.reader.next()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 615, in next
line = self.readline()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 530, in readline
data = self.read(readsize, firstline=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x87 in position 0: invalid start byte
I got here because first I had a "Null Byte" error, which I solved with this post: "Line contains NULL byte" in CSV reader (Python)
Then I got an integer error, which I solved with this post "an integer is required" when open()'ing a file as utf-8?
Then I got an error that said: 'UnicodeException: UTF-16 stream does not start with BOM' which I solved with this post utf-16 file seeking in python. how?
Then I realized that the csv module requires utf-8 so here I am.
But I've finally hit the limit of the existing questions. I can't figure out what is going on. Please please help.
I'm not sure why but this ultimately worked:
import csv
import os
import unicodecsv
folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams3/'
for file in os.listdir (folder):
with open(os.path.join(folder, file), mode='rU') as f:
try:
m=min(int(line[1]) for line in unicodecsv.reader(f, encoding='utf-8', errors='replace'))
except:
print "one no work"
continue
f.seek(0)
for line in unicodecsv.reader(f):
if int(line[1])==m:
print line
Perhaps try using a os.walk along with using for files in files?
folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams2/'
for subdir, dirs, files in os.walk(folder):
for file in files:
with codecs.open(file, mode='rU', encoding='utf-16-be') as f:
#Your code here
Clearly then your file is not encoded in UTF-8. Try another encoding. If you're using Windows, 'mbcs' will use the default encoding for your version of Windows.

UnicodeDecodeError is raised when getting a cookie in Google App Engine

I have a GAE project in Python where I am setting a cookie in one of my RequestHandlers with this code:
self.response.headers['Set-Cookie'] = 'app=ABCD; expires=Fri, 31-Dec-2020 23:59:59 GMT'
I checked in Chrome and I can see the cookie listed, so it appears to be working.
Then later in another RequestHandler, I get the cookie to check it:
appCookie = self.request.cookies['app']
This line gives the following error when executed:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
It seems that it is trying to decode the incoming cookie info using an ASCII codec rather than UTF-8.
How do I force Python to use UTF-8 to decode this?
Are there any other Unicode-related gotchas that I need to be aware of as a newbie to Python and Google App Engine (but an experienced programmer in other languages)?
Here is the full Traceback:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4144, in _HandleRequest
self._Dispatch(dispatcher, self.rfile, outfile, env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4049, in _Dispatch
base_env_dict=env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 616, in Dispatch
base_env_dict=base_env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3120, in Dispatch
self._module_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3024, in ExecuteCGI
reset_modules = exec_script(handler_path, cgi_path, hook)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 2887, in ExecuteOrImportScript
exec module_code in script_module.__dict__
File "/Users/ken/hgdev/juicekit/main.py", line 402, in <module>
main()
File "/Users/ken/hgdev/juicekit/main.py", line 399, in main
run_wsgi_app(application)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 98, in run_wsgi_app
run_bare_wsgi_app(add_wsgi_middleware(application))
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 116, in run_bare_wsgi_app
result = application(env, _start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 721, in __call__
response.wsgi_write(start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 296, in wsgi_write
body = self.out.getvalue()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/StringIO.py", line 270, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
You're looking to use the decode function somewhat like this (cred #agf:):
self.request.cookies['app'].decode('utf-8')
From official python documentation (plus a couple added details):
Python’s 8-bit strings have a .decode([encoding], [errors]) method that interprets the string using the given encoding. The following example shows the string as it goes to unicode and then back to 8-bit string:
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
>>> type(u), u # Examine
(<type 'unicode'>, u'\ua000abcd\u07b4')
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version # Examine
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True
First, encode any unicode value you set in the cookies. You also need to quote them in case they can break the header:
import urllib
# This is the value we want to set.
initial_value = u'äëïöü'
# WebOb version that comes with SDK doesn't quote cookie values
# in the Response, neither webapp.Response. So we have to do it.
quoted_value = urllib.quote(initial_value.encode('utf-8'))
rsp = webapp.Response()
rsp.headers['Set-Cookie'] = 'app=%s; Path=/' % quoted_value
Now let's read the value. To test it, create a fake Request to test the cookie we have set. This code was extracted from a real unittest:
cookie = rsp.headers.get('Set-Cookie')
req = webapp.Request.blank('/', headers=[('Cookie', cookie)])
# The stored value is the same quoted value from before.
# Notice that here we use .str_cookies, not .cookies.
stored_value = req.str_cookies.get('app')
self.assertEqual(stored_value, quoted_value)
Our value is still encoded and quoted. We must do the reverse to get the initial one:
# And we can get the initial value unquoting and decoding.
final_value = urllib.unquote(stored_value).decode('utf-8')
self.assertEqual(final_value, initial_value)
If you can, consider using webapp2. webob.Response does all the hard work of quoting and setting cookies, and you can set unicode values directly. See a summary of these issues here.

UnicodeDecodeError when reading dictionary words file with simple Python script

First time doing Python in a while, and I'm having trouble doing a simple scan of a file when I run the following script with Python 3.0.1,
with open("/usr/share/dict/words", 'r') as f:
for line in f:
pass
I get this exception:
Traceback (most recent call last):
File "/home/matt/install/test.py", line 2, in <module>
for line in f:
File "/home/matt/install/root/lib/python3.0/io.py", line 1744, in __next__
line = self.readline()
File "/home/matt/install/root/lib/python3.0/io.py", line 1817, in readline
while self._read_chunk():
File "/home/matt/install/root/lib/python3.0/io.py", line 1565, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/home/matt/install/root/lib/python3.0/io.py", line 1299, in decode
output = self.decoder.decode(input, final=final)
File "/home/matt/install/root/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1689-1692: invalid data
The line in the file it blows up on is "Argentinian", which doesn't seem to be unusual in any way.
Update: I added,
encoding="iso-8559-1"
to the open() call, and it fixed the problem.
How have you determined from "position 1689-1692" what line in the file it has blown up on? Those numbers would be offsets in the chunk that it's trying to decode. You would have had to determine what chunk it was -- how?
Try this at the interactive prompt:
buf = open('the_file', 'rb').read()
len(buf)
ubuf = buf.decode('utf8')
# splat ... but it will give you the byte offset into the file
buf[offset-50:60] # should show you where/what the problem is
# By the way, from the error message, looks like a bad
# FOUR-byte UTF-8 character ... interesting
Can you check to make sure it is valid UTF-8? A way to do that is given at this SO question:
iconv -f UTF-8 /usr/share/dict/words -o /dev/null
There are other ways to do the same thing.

Categories