Encoding 'UTF-8' throws an exception in macOS - python

I'm trying to read a GloVe file: glove.twitter.27B.200d.txt. I have the next function to read the file:
def glove_reader(glove_file):
glove_dict = {}
with open(glove_file, 'rt', encoding='utf-8') as glove_reader:
for line in glove_reader:
tokens = line.rstrip().split()
vect = [float(token) for token in tokens[1:]]
glove_dict[tokens[0]] = vect
return glove_dict
The problem is that I get the next error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 0: invalid continuation byte
I tried with latin-1 but it didn't work. Throws me the next error:
ValueError: could not convert string to float: 'Ù\x86'
I also tried change 'rt' with 'r' and 'rb'. I think is a problem of macOS because in Windows didn't throw me this error. Can someone please help me to know why I can't read this file.

Related

unicode decode error while importing Medical Data on pandas

I tried importing a medical data and I ran into this unicode error, here is my code:
output_path = r"C:/Users/muham/Desktop/AI projects/cancer doc classification"
my_file = glob.glob(os.path.join(output_path, '*.csv'))
for files in my_file:
data = pd.read_csv(files)
print(data)
My error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 3314: invalid start byte
Try other encodings, default one is utf-8
like
import pandas
pandas.read_csv(path, encoding="cp1252")
or ascii, latin1, etc ...

Decode a compressed bytes to utf-8 in python?

I have a base64 format variable
variable = b'gAN9cQAoGdS4='
To store in database I decode it
st1 = string.decode('utf-8')
st1
Out[35]: 'gAN9cQAoGdS4='
Now I have a very large variable more than 4GB, so I compress it using zlib
import zlib
variable_comp = zlib.compress(variable)
Now to store in db, I can't decode it
st1 = variable_comp.decode('utf-8')
I get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 1: invalid start byte
so I tried
st2 = variable_comp.decode('utf-8', errors="ignore")
But when I decompress it, I get error
variable_decomp = zlib.decompress(st2)
TypeError: a bytes-like object is required, not 'str'
may I know the fix for this, whether gzip will fix it?

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8e in position 1

This is my first post here excuse me if i miss anything.
I have some data in my CSV file and am trying to import data into my prod and getting UnicodeDecodeError. I have some french words in my csv file
Code:
open_csv = csv.DictReader(open('filename',''rb))
for i in open_csv:
x = find(where={})#mongodb query
x.something = i.get(row_header)
x.save()
am getting UnicodeDecodeError: 'utf8' codec can't decode byte 0x8e in position 1 error while saving the data
I would suggest you to try the following code:
import codecs
open_csv = csv.DictReader(codecs.open('filename','rb'))
for i in open_csv:
x = find(where={})
x.something = i.get(row_header)
x.save()
I work in Python 3.x but this should work in 2.x too if that is what you are using.

Error reading file with accented vowels

The following statement to fill a list from a file :
action = []
with open (os.getcwd() + "/files/" + "actions.txt") as temp:
         action = list (temp)
gives me the following error:
(result, consumed) = self._buffer_decode (data, self.errors, end)
UnicodeDecodeError: 'utf-8' codec can not decode byte 0xf1 in position 67: invalid continuation byte
if I add errors = 'ignore':
action = []
with open (os.getcwd () + "/ files /" + "actions.txt", errors = 'ignore') as temp:
         action = list (temp)
Is read the file but not the ñ and vowels accented á-é-í-ó-ú being that python 3 works, as I have understood, default to 'utf-8'
I'm looking for a solution for two or more days, and I'm getting more confused.
In advance thank you very much for any suggestions.
You should use codecs to open the file with the correct encoding.
import codecs
with codecs.open(os.getcwd () + "/ files /" + "actions.txt", "r", encoding="utf8") as temp:
action = list(temp)
See the codecs docs
As #Bogdan pointed out, you're likely not dealing with utf-8 data. You can leverage a module like chardet to try to determine the encoding. If you're on a unix-y environment, you can also try running the file command on it to guess the encoding.
Using your error message character:
>>> import chardet
>>> sample_string = '\xf1'
>>> chardet.detect(sample_string)
{'confidence': 0.5, 'encoding': 'windows-1252'}

Except Python codec errors?

File "/usr/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 805: invalid start byte
Hi, I get this exception. How do I catch it, and continue reading my files when I get this exception.
My program has a loop that reads a text file line-by-line and tries to do some processing. However, some files I encounter may not be text files, or have lines that are not properly formatted (foreign language etc). I want to ignore those lines.
The following is not working
for line in sys.stdin:
if line != "":
try:
matched = re.match(searchstuff, line, re.IGNORECASE)
print (matched)
except UnicodeDecodeError, UnicodeEncodeError:
continue
Look at http://docs.python.org/py3k/library/codecs.html. When you open the codecs stream, you probably want to use the additional argument errors='ignore'
In Python 3, sys.stdin is by default opened as a text stream (see http://docs.python.org/py3k/library/sys.html), and has strict error checking.
You need to reopen it as an error-tolerant utf-8 stream. Something like this will work:
sys.stdin = codecs.getreader('utf8')(sys.stdin.detach(), errors='ignore')

Categories