Error reading file with accented vowels - python

The following statement to fill a list from a file :
action = []
with open (os.getcwd() + "/files/" + "actions.txt") as temp:
         action = list (temp)
gives me the following error:
(result, consumed) = self._buffer_decode (data, self.errors, end)
UnicodeDecodeError: 'utf-8' codec can not decode byte 0xf1 in position 67: invalid continuation byte
if I add errors = 'ignore':
action = []
with open (os.getcwd () + "/ files /" + "actions.txt", errors = 'ignore') as temp:
         action = list (temp)
Is read the file but not the ñ and vowels accented á-é-í-ó-ú being that python 3 works, as I have understood, default to 'utf-8'
I'm looking for a solution for two or more days, and I'm getting more confused.
In advance thank you very much for any suggestions.

You should use codecs to open the file with the correct encoding.
import codecs
with codecs.open(os.getcwd () + "/ files /" + "actions.txt", "r", encoding="utf8") as temp:
action = list(temp)
See the codecs docs

As #Bogdan pointed out, you're likely not dealing with utf-8 data. You can leverage a module like chardet to try to determine the encoding. If you're on a unix-y environment, you can also try running the file command on it to guess the encoding.
Using your error message character:
>>> import chardet
>>> sample_string = '\xf1'
>>> chardet.detect(sample_string)
{'confidence': 0.5, 'encoding': 'windows-1252'}

Related

UnicodeDecodeError - Add encoding to custom function input

Could you tell me where I'm going wrong with my current way of thinking? This is my function:
def replace_line(file_name, line_num, text):
lines = open(f"realfolder/files/{item}.html", "r").readlines()
lines[line_num] = text
out = open(file_name, 'w')
out.writelines(lines)
out.close()
This is an example of it being called:
replace_line(f'./files/{item}.html', 9, f'text {item} wordswordswords' + '\n')
I need to encode the text input as utf-8. I'm not sure why I haven't been able to do this already. I also need to retain the fstring value.
I've been doing things like adding:
str.encode(text)
#or
text.encode(encoding = 'utf-8')
To the top of my replace line function. This hasn't worked. I have tried dozens of different methods but each continues to leave me with this error.
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2982: character maps to
undefined
You need to set the encoding to utf-8 for both opening the file to read from
lines = open(f"realfolder/files/{item}.html", "r", encoding="utf-8").readlines()
and opening the file to write to
out = open(file_name, 'w', encoding="utf-8")

Encoding 'UTF-8' throws an exception in macOS

I'm trying to read a GloVe file: glove.twitter.27B.200d.txt. I have the next function to read the file:
def glove_reader(glove_file):
glove_dict = {}
with open(glove_file, 'rt', encoding='utf-8') as glove_reader:
for line in glove_reader:
tokens = line.rstrip().split()
vect = [float(token) for token in tokens[1:]]
glove_dict[tokens[0]] = vect
return glove_dict
The problem is that I get the next error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 0: invalid continuation byte
I tried with latin-1 but it didn't work. Throws me the next error:
ValueError: could not convert string to float: 'Ù\x86'
I also tried change 'rt' with 'r' and 'rb'. I think is a problem of macOS because in Windows didn't throw me this error. Can someone please help me to know why I can't read this file.

unable to decode this string using python

I have this text.ucs file which I am trying to decode using python.
file = open('text.ucs', 'r')
content = file.read()
print content
My result is
\xf\xe\x002\22
I tried doing decoding with utf-16, utf-8
content.decode('utf-16')
and getting error
Traceback (most recent call last): File "", line 1, in
File "C:\Python27\lib\encodings\utf_16.py", line 16, in
decode
return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position
32-33: illegal encoding
Please let me know if I am missing anything or my approach is wrong
Edit: Screenshot has been asked
The string is encoded as UTF16-BE (Big Endian), this works:
content.decode("utf-16-be")
oooh, as i understand you using python 2.x.x but encoding parameter was added only in python 3.x.x as I know, i am doesn't master of python 2.x.x but you can search in google about io.open for example try:
file = io.open('text.usc', 'r',encoding='utf-8')
content = file.read()
print content
but chek do you need import io module or not
You can specify which encoding to use with the encoding argument:
with open('text.ucs', 'r', encoding='utf-16') as f:
text = f.read()
your string need to Be Uncoded With The Coding utf-8 you can do What I Did Now for decode your string
f = open('text.usc', 'r',encoding='utf-8')
print f

Why does Python3 get a UnicodeDecodeError reading a text file where Python2 does not?

I'm reading in a text file. I've been doing it just fine with python2, but I decided to run my code with python3 instead.
My code for reading the text file is:
neg_words = []
with open('negative-words.txt', 'r') as f:
for word in f:
neg_words.append(word)
When I run this code on python 3 I get the following error:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-14-1e2ff142b4c1> in <module>()
3 pos_words = []
4 with open('negative-words.txt', 'r') as f:
----> 5 for word in f:
6 neg_words.append(word)
7 with open('positive-words.txt', 'r') as f:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/codecs.py in
decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 3988: invalid continuation byte
It seems to me that there is a certain form of text that python2 decodes without any issue, which python3 can't.
Could someone please explain what the difference is between python2 and python3 with respect to this error. Why does it occur in one version but not the other? How can I stop it?
Your file is not UTF-8 encoded. Figure out what encoding is used and specificy that explicitly when opening the file:
with open('negative-words.txt', 'r', encoding="<correct codec>") as f:
In Python 2, str is a binary string, containing encoded data, not Unicode text. If you were to use import io then io.open(), you'd get the same issues, or if you were to try to decode the data you read with word.decode('utf8').
You probably want to read up on Unicode and Python. I strongly recommend Ned Batchelder's Pragmatic Unicode.
Or we can simply read file the under binary mode:
with open(filename, 'rb') as f:
pass
'r' open for reading (default)
'b' binary mode

UnicodeEncodeError: 'ascii' codec can't encode

Ia have the following data container which is constantly being updated:
data = []
for val, track_id in zip(values,list(track_ids)):
#below
if val < threshold:
#structure data as dictionary
pre_data = {"artist": sp.track(track_id)['artists'][0]['name'], "track":sp.track(track_id)['name'], "feature": filter_name, "value": val}
data.append(pre_data)
#write to file
with open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w') as f:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
but I am getting a lot of errors like this:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
File"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 6: ordinal not in range(128)
Is there a way I can get rid of this encoding problem once and for all?
I was told that this would do it:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
but many people do not recommend it.
I use python 2.7.10
any clues?
When you write to a file that was opened in text mode, Python encodes the string for you. The default encoding is ascii, which generates the error you see; there are a lot of characters that can't be encoded to ASCII.
The solution is to open the file in a different encoding. In Python 2 you must use the codecs module, in Python 3 you can add the encoding= parameter directly to open. utf-8 is a popular choice since it can handle all of the Unicode characters, and for JSON specifically it's the standard; see https://en.wikipedia.org/wiki/JSON#Data_portability_issues.
import codecs
with codecs.open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w', encoding='utf-8') as f:
Your object has unicode strings and python 2.x's support for unicode can be a bit spotty. First, lets make a short example that demonstrates the problem:
>>> obj = {"artist":u"Björk"}
>>> import json
>>> with open('deleteme', 'w') as f:
... json.dump(obj, f, ensure_ascii=False)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 3: ordinal not in range(128)
From the json.dump help text:
If ``ensure_ascii`` is true (the default), all non-ASCII characters in the
output are escaped with ``\uXXXX`` sequences, and the result is a ``str``
instance consisting of ASCII characters only. If ``ensure_ascii`` is
``False``, some chunks written to ``fp`` may be ``unicode`` instances.
This usually happens because the input contains unicode strings or the
``encoding`` parameter is used. Unless ``fp.write()`` explicitly
understands ``unicode`` (as in ``codecs.getwriter``) this is likely to
cause an error.
Ah! There is the solution. Either use the default ensure_ascii=True and get ascii escaped unicode characters or use the codecs module to open the file with the encoding you want. This works:
>>> import codecs
>>> with codecs.open('deleteme', 'w', encoding='utf-8') as f:
... json.dump(obj, f, ensure_ascii=False)
...
>>>
Why not encode the specific string instead? try, the .encode('utf-8') method on the string that is raising the exception.

Categories