UTF-8 decoding an ANSI encoded file throws an error - python

Here's something I'm trying to understand. I was under the impression that UTF-8 was backwards compatible, so that I can always decode a text file with UTF-8, even if it's an ANSI file. But that doesn't seem to be the case:
In [1]: ansi_str = 'éµaØc'
In [2]: with open('test.txt', 'w', encoding='ansi') as f:
...: f.write(ansi_str)
...:
In [3]: with open('test.txt', 'r', encoding='utf-8') as f:
...: print(f.read())
...:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-3-b0711b7b947e> in <module>
1 with open('test.txt', 'r', encoding='utf-8') as f:
----> 2 print(f.read())
3
c:\program files\python37\lib\codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
So it looks like if my code expects UTF-8, and is likely to encounter an ANSI-encoded file, I need to handle the UnicodeDecodeError. That's fine, but I would appreciate if anyone could throw some light on my initial misunderstanding.
Thanks!

UTF-8 is backwards compatible with ASCII. Not ANSI. "ANSI" doesn't even describe any one particular encoding. And those characters you're testing with are well outside the ASCII range, so unless you actually encode them with UTF-8, you can't read them as UTF-8.

Related

Unicode Encode Error : 'charmap' codec can't encode character '\ufb01' in position 2090: character maps to <undefined>

this is the code i'm trying to execute for extracting text from image and save in a path.
def main():
path =r"D drive where images are stored"
fullTempPath =r"D drive where extracted texts are stored in xls file"
for imageName in os.listdir(path):
inputPath = os.path.join(path, imageName)
img = Image.open(inputPath)
text = pytesseract.image_to_string(img, lang ="eng")
file1 = open(fullTempPath, "a+")
file1.write(imageName+"\n")
file1.write(text+"\n")
file1.close()
file2 = open(fullTempPath, 'r')
file2.close()
if __name__ == '__main__':
main()
I'm getting the below error, and can someone help me on this
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-7-fb69795bce29> in <module>
13 file2.close()
14 if __name__ == '__main__':
---> 15 main()
<ipython-input-7-fb69795bce29> in main()
8 file1 = open(fullTempPath, "a+")
9 file1.write(imageName+"\n")
---> 10 file1.write(text+"\n")
11 file1.close()
12 file2 = open(fullTempPath, 'r')
~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 2090: character maps to <undefined>
I don't know why Tesseract would be returning a string containing an invalid Unicode character, but that appears to be what is going on. It is possible to tell Python to ignore encoding errors. Try changing the line that opens the output file to the following:
file1 = open(fullTempPath, "a+", errors="ignore")
text = 'unicode error on this text'
text = text.decode('utf-8')
try to decode text
The default file encoding used for open is the value returned by locale.getpreferredencoding(False) which on Windows is generally a legacy encoding that doesn't support all Unicode characters. In this case the error message indicates it was cp1252 (a.k.a Windows-1252). Best to specify the encoding you want explicitly. UTF-8 handles all Unicode characters:
file1 = open(fullTempPath, "a+", encoding='utf8')
FYI, U+FB01 is LATIN SMALL LIGATURE FI (fi) if that makes any sense on the image being processed.
Also, Windows editors tend to assume the same legacy encoding unless the encoding is utf-8-sig which adds an encoded BOM character to the beginning of the file as an encoding hint that it is UTF-8.

Error to upload entire video file into memory

I am trying to upload an entire video file into RAM for quick access. i want the file to be as it is without any decoding, etc. I just want to point to a location in RAM instead of remote driver. The file is only 2 GB. I have 128 GB RAM. I need to do frame by frame analysis and readying from a server takes forever.
I thought i would do something like this
with open('my_file.txt', 'r') as f:
file_content = f.read() # Read whole file in the file_content string
print(file_content)
But I can an error. Is there another way to do it? Like using IO library?
In [11]: u = open("/net/server/raw/2020.04.02/IMG_9261.MOV",'r')
In [12]: data = u.read()
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-12-eecbc439fbf0> in <module>
----> 1 data = u.read()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 31: invalid continuation byte
this example uses requests.get, but this works only for HTTP i have a local server I can mount with NFS
import requests
from pygame import mixer
import io
r = requests.get("http://example.com/somesmallmp3file.mp3")
inmemoryfile = io.BytesIO(r.content)
mixer.music.init()
mixer.music.load(inmemoryfile)
mixer.music.play()
Adding a 'b' for binary mode should make it work.
u = open("/net/server/raw/2020.04.02/IMG_9261.MOV",'rb')

Why does Python3 get a UnicodeDecodeError reading a text file where Python2 does not?

I'm reading in a text file. I've been doing it just fine with python2, but I decided to run my code with python3 instead.
My code for reading the text file is:
neg_words = []
with open('negative-words.txt', 'r') as f:
for word in f:
neg_words.append(word)
When I run this code on python 3 I get the following error:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-14-1e2ff142b4c1> in <module>()
3 pos_words = []
4 with open('negative-words.txt', 'r') as f:
----> 5 for word in f:
6 neg_words.append(word)
7 with open('positive-words.txt', 'r') as f:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/codecs.py in
decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 3988: invalid continuation byte
It seems to me that there is a certain form of text that python2 decodes without any issue, which python3 can't.
Could someone please explain what the difference is between python2 and python3 with respect to this error. Why does it occur in one version but not the other? How can I stop it?
Your file is not UTF-8 encoded. Figure out what encoding is used and specificy that explicitly when opening the file:
with open('negative-words.txt', 'r', encoding="<correct codec>") as f:
In Python 2, str is a binary string, containing encoded data, not Unicode text. If you were to use import io then io.open(), you'd get the same issues, or if you were to try to decode the data you read with word.decode('utf8').
You probably want to read up on Unicode and Python. I strongly recommend Ned Batchelder's Pragmatic Unicode.
Or we can simply read file the under binary mode:
with open(filename, 'rb') as f:
pass
'r' open for reading (default)
'b' binary mode

UnicodeEncodeError: 'ascii' codec can't encode

Ia have the following data container which is constantly being updated:
data = []
for val, track_id in zip(values,list(track_ids)):
#below
if val < threshold:
#structure data as dictionary
pre_data = {"artist": sp.track(track_id)['artists'][0]['name'], "track":sp.track(track_id)['name'], "feature": filter_name, "value": val}
data.append(pre_data)
#write to file
with open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w') as f:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
but I am getting a lot of errors like this:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
File"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 6: ordinal not in range(128)
Is there a way I can get rid of this encoding problem once and for all?
I was told that this would do it:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
but many people do not recommend it.
I use python 2.7.10
any clues?
When you write to a file that was opened in text mode, Python encodes the string for you. The default encoding is ascii, which generates the error you see; there are a lot of characters that can't be encoded to ASCII.
The solution is to open the file in a different encoding. In Python 2 you must use the codecs module, in Python 3 you can add the encoding= parameter directly to open. utf-8 is a popular choice since it can handle all of the Unicode characters, and for JSON specifically it's the standard; see https://en.wikipedia.org/wiki/JSON#Data_portability_issues.
import codecs
with codecs.open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w', encoding='utf-8') as f:
Your object has unicode strings and python 2.x's support for unicode can be a bit spotty. First, lets make a short example that demonstrates the problem:
>>> obj = {"artist":u"Björk"}
>>> import json
>>> with open('deleteme', 'w') as f:
... json.dump(obj, f, ensure_ascii=False)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 3: ordinal not in range(128)
From the json.dump help text:
If ``ensure_ascii`` is true (the default), all non-ASCII characters in the
output are escaped with ``\uXXXX`` sequences, and the result is a ``str``
instance consisting of ASCII characters only. If ``ensure_ascii`` is
``False``, some chunks written to ``fp`` may be ``unicode`` instances.
This usually happens because the input contains unicode strings or the
``encoding`` parameter is used. Unless ``fp.write()`` explicitly
understands ``unicode`` (as in ``codecs.getwriter``) this is likely to
cause an error.
Ah! There is the solution. Either use the default ensure_ascii=True and get ascii escaped unicode characters or use the codecs module to open the file with the encoding you want. This works:
>>> import codecs
>>> with codecs.open('deleteme', 'w', encoding='utf-8') as f:
... json.dump(obj, f, ensure_ascii=False)
...
>>>
Why not encode the specific string instead? try, the .encode('utf-8') method on the string that is raising the exception.

Except Python codec errors?

File "/usr/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 805: invalid start byte
Hi, I get this exception. How do I catch it, and continue reading my files when I get this exception.
My program has a loop that reads a text file line-by-line and tries to do some processing. However, some files I encounter may not be text files, or have lines that are not properly formatted (foreign language etc). I want to ignore those lines.
The following is not working
for line in sys.stdin:
if line != "":
try:
matched = re.match(searchstuff, line, re.IGNORECASE)
print (matched)
except UnicodeDecodeError, UnicodeEncodeError:
continue
Look at http://docs.python.org/py3k/library/codecs.html. When you open the codecs stream, you probably want to use the additional argument errors='ignore'
In Python 3, sys.stdin is by default opened as a text stream (see http://docs.python.org/py3k/library/sys.html), and has strict error checking.
You need to reopen it as an error-tolerant utf-8 stream. Something like this will work:
sys.stdin = codecs.getreader('utf8')(sys.stdin.detach(), errors='ignore')

Categories