Except Python codec errors? - python

File "/usr/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 805: invalid start byte
Hi, I get this exception. How do I catch it, and continue reading my files when I get this exception.
My program has a loop that reads a text file line-by-line and tries to do some processing. However, some files I encounter may not be text files, or have lines that are not properly formatted (foreign language etc). I want to ignore those lines.
The following is not working
for line in sys.stdin:
if line != "":
try:
matched = re.match(searchstuff, line, re.IGNORECASE)
print (matched)
except UnicodeDecodeError, UnicodeEncodeError:
continue

Look at http://docs.python.org/py3k/library/codecs.html. When you open the codecs stream, you probably want to use the additional argument errors='ignore'
In Python 3, sys.stdin is by default opened as a text stream (see http://docs.python.org/py3k/library/sys.html), and has strict error checking.
You need to reopen it as an error-tolerant utf-8 stream. Something like this will work:
sys.stdin = codecs.getreader('utf8')(sys.stdin.detach(), errors='ignore')

Related

Encoding 'UTF-8' throws an exception in macOS

I'm trying to read a GloVe file: glove.twitter.27B.200d.txt. I have the next function to read the file:
def glove_reader(glove_file):
glove_dict = {}
with open(glove_file, 'rt', encoding='utf-8') as glove_reader:
for line in glove_reader:
tokens = line.rstrip().split()
vect = [float(token) for token in tokens[1:]]
glove_dict[tokens[0]] = vect
return glove_dict
The problem is that I get the next error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 0: invalid continuation byte
I tried with latin-1 but it didn't work. Throws me the next error:
ValueError: could not convert string to float: 'Ù\x86'
I also tried change 'rt' with 'r' and 'rb'. I think is a problem of macOS because in Windows didn't throw me this error. Can someone please help me to know why I can't read this file.

UTF-8 decoding an ANSI encoded file throws an error

Here's something I'm trying to understand. I was under the impression that UTF-8 was backwards compatible, so that I can always decode a text file with UTF-8, even if it's an ANSI file. But that doesn't seem to be the case:
In [1]: ansi_str = 'éµaØc'
In [2]: with open('test.txt', 'w', encoding='ansi') as f:
...: f.write(ansi_str)
...:
In [3]: with open('test.txt', 'r', encoding='utf-8') as f:
...: print(f.read())
...:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-3-b0711b7b947e> in <module>
1 with open('test.txt', 'r', encoding='utf-8') as f:
----> 2 print(f.read())
3
c:\program files\python37\lib\codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
So it looks like if my code expects UTF-8, and is likely to encounter an ANSI-encoded file, I need to handle the UnicodeDecodeError. That's fine, but I would appreciate if anyone could throw some light on my initial misunderstanding.
Thanks!
UTF-8 is backwards compatible with ASCII. Not ANSI. "ANSI" doesn't even describe any one particular encoding. And those characters you're testing with are well outside the ASCII range, so unless you actually encode them with UTF-8, you can't read them as UTF-8.

Why does Python3 get a UnicodeDecodeError reading a text file where Python2 does not?

I'm reading in a text file. I've been doing it just fine with python2, but I decided to run my code with python3 instead.
My code for reading the text file is:
neg_words = []
with open('negative-words.txt', 'r') as f:
for word in f:
neg_words.append(word)
When I run this code on python 3 I get the following error:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-14-1e2ff142b4c1> in <module>()
3 pos_words = []
4 with open('negative-words.txt', 'r') as f:
----> 5 for word in f:
6 neg_words.append(word)
7 with open('positive-words.txt', 'r') as f:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/codecs.py in
decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 3988: invalid continuation byte
It seems to me that there is a certain form of text that python2 decodes without any issue, which python3 can't.
Could someone please explain what the difference is between python2 and python3 with respect to this error. Why does it occur in one version but not the other? How can I stop it?
Your file is not UTF-8 encoded. Figure out what encoding is used and specificy that explicitly when opening the file:
with open('negative-words.txt', 'r', encoding="<correct codec>") as f:
In Python 2, str is a binary string, containing encoded data, not Unicode text. If you were to use import io then io.open(), you'd get the same issues, or if you were to try to decode the data you read with word.decode('utf8').
You probably want to read up on Unicode and Python. I strongly recommend Ned Batchelder's Pragmatic Unicode.
Or we can simply read file the under binary mode:
with open(filename, 'rb') as f:
pass
'r' open for reading (default)
'b' binary mode

UnicodeDecodeError (UTF-8) for JSON

BLUF: Why is the decode() method on a bytes object failing to decode ç?
I am receiving a UnicodeDecodeError: 'utf-8' codec can't decode by 0xe7 in position..... Upon tracking down the character, it is the ç character. So when I get to reading the response from the server:
conn = http.client.HTTPConnection(host = 'something.com')
conn.request('GET', url = '/some/json')
resp = conn.getresponse()
content = resp.read().decode() # throws error
I am unable to get the content. If I just do content = resp.read() it is successful, I can write to file using wb but then whever the ç is, it is replaced with 0xE7 in the file upon writing. Even if I open the file in Notepad++ and set the encoding to UTF-8, the character only shows as the hex version.
Why am I not able to decode this UTF-8 character from an HTTPResponse? Am I not correctly writing it to file either?
When you have issues with encoding/decoding, you should take a look at the UTF-8 Encoding Debugging Chart.
If you look in the chart for the Windows 1252 code point 0xE7 you find the expected character is ç showing that the encoding is CP1252.

DecodeError in Paramiko Remote File

I have a large remote file that is generated automatically each day. I have no control over how the file is generated. I'm using Paramiko to open the file and then search through it to find if a given line matches a line in the file.
However, I'm receiving the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 57: invalid start byte
My code:
self.ssh = paramiko.SSHClient()
self.ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
self.ssh.connect(host, username=user, password=pass)
self.sftp_client = self.ssh.open_sftp()
self.remote_file = self.sftp_client.open(filepath, mode='r')
def checkPhrase(self, phrase):
found = 0
self.remote_file.seek(0)
for line in self.remote_file:
if phrase in line:
found = 1
break
return found
I'm receiving the error at the line: for line in self.remote_file: Obviously there is a character in the file that is out of the range for utf8.
Is there a way to re-encode the line as it's read or to simply ignore the error?
So, files are bytes. They may or may not have some particular encoding. Additionally, paramiko is always returning bytes, since it ignores the 'b' flag that normal open functions take.
Instead, you should try and decode each line yourself. First, open the file in binary mode, then read a line, then try to decode that line as utf-8. If that fails, just skip the line.
def checkPhrase(self, phrase):
self.remote_file.seek(0)
for line in self.remote_file:
try:
decoded_line = line.decode('utf-8') # Decode from bytes to utf-8
except UnicodeDecodeError:
continue # just keep processing other lines in the file, since this one it's utf-8
if phrase in decoded_line:
return True # We found the phrase, so just return a True (instead of 1)
return False # We never found the phrase, so return False (instead of 0)
Additionally, i've found Ned Batcheldar's Unipain pycon talk immensly helpful in understanding bytes vs unicode.

Categories