encode 'UCS-2 Little Endian' file to 'utf8' using python error - python

I'm trying to encode from UCS-2 Little Endian file to utf8 using python and I'm getting a weird error.
The code I'm using:
file=open("C:/AAS01.txt", 'r', encoding='utf8')
lines = file.readlines()
file.close()
And I'm getting the following error:
Traceback (most recent call last):
File "C:/Users/PycharmProjects/test.py", line 18, in <module>
main()
File "C:/Users/PycharmProjects/test.py", line 7, in main
lines = file.readlines()
File "C:\Python34\lib\codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I tried to use codecs commands, but also didn't work...
Any idea what I can do?

The encoding argument to open sets the input encoding. Use encoding='utf_16_le'.

If you're trying to read UCS-2, why are you telling Python it's UTF-8? The 0xff is most likely the first byte of a little endian byte order marker:
>>> codecs.BOM_UTF16_LE
b'\xff\xfe'
UCS-2 is also deprecated, for the simple reason that Unicode outgrew it. The typical replacement would be UTF-16.
More info linked in Python 3: reading UCS-2 (BE) file

Related

How to avoid problem with encode UTF-8 error

I've got problem with reading text files. When I start program and add file, it throws an error:
Traceback (most recent call last):
File "c:/Users/Marcin/Desktop/python/graf_menu.py", line 38, in <module>
main_func()
File "c:/Users/Marcin/Desktop/python/graf_menu.py", line 32, in main_func
read_file()
File "c:/Users/Marcin/Desktop/python/graf_menu.py", line 15, in read_file
for i in f.read():
File "C:\Users\Marcin\AppData\Local\Programs\Python\Python38-32\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 19: invalid start byte
In my code there is a line with "encoding="UTF-8". How to solve the problem. The code below:
files = input("File name: ")
try:
with open(files,"r",encoding="UTF-8") as f:
for i in f.read():
print(i,end='')
except FileNotFoundError:
print("FileNotFoundError")
There is nothing wrong with the program itself. You are getting this error because you are trying to read a file which is not encoded as UTF-8 as UTF-8-encoded. You have to either convert the contents of the file to UTF-8 or specify a different encoding (the one that the file actually uses) in the call to open.
This file is not encoded as UTF-8 try to use encoded="iso-8859-1"

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

import os
import shutil
import codecs
directory = '~/Desktop/ra/clean_tokenized/1987'
for filename in os.listdir(directory):
full_name = directory + '/' + filename
with open(full_name, 'r') as article:
for line in article:
print(line)
Here's the traceback:
Traceback (most recent call last):
File "~/Desktop/corpus_filter/01_corpus.py", line 11, in
for line in article:
File "~/.conda/envs/MangerRA/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
The file contains Japanese characters and I'm just trying make a CSV file with all the words that have come up in the files. But I can't get over this error.
Python is trying to open your file using the UTF-8 encoding (which is the default most of the time these days). Unfortunately, your file is using some other encoding (or is otherwise corrupted), and so the decoding fails.
Unfortunately, I can't tell what encoding your file uses. You'll have to investigate that yourself. You might try another encoding like Shift JIS (using open(full_name, 'r', encoding='shift-jis')), and see if you get valid text or mojibake.
If all else fails, you can open the file in binary mode ('rb' rather than just 'r'), and check out what is located at byte 3131 and immediately afterwards. It may be just a messed up bit of data in the file that you can delete or fix manually.

Not capable to splitlines from a file (python3)

So I'm doing what I always do when I read the file:
code:
f= open(filename,'r')
t= f.read().splitlines()
print(t)
but I'm getting a UnicodeDecodeError I don't know why.
the error:
Traceback (most recent call last):
File "try.py", line 21, in <module>
t= f.read().splitlines()
File "/Users/jamilaldani/miniconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 307: invalid start byte
As described in a few places on this site, and AndreaConte's comment this is likely to be a file encoded in a different encoding (ie not UTF-8)
This answer may help: https://stackoverflow.com/a/19706723/70131,
as may this one if you're willing to lose some data: https://stackoverflow.com/a/12468274/70131

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 1072: invalid start byte

I'm trying to open and read a csv file with python but I keep getting this error.
Traceback (most recent call last):
File "openfile.py", line 7 in <module>
data = csvfile("1.csv")
File "openfile.py", line 4, in csvfile
data = np.loadtxt(filename, delimiter = ",", skiprows =9)
File "/Users/ZEN/anaconda/lib/python3.6/site-packages/numpy/lib/npyio.py", line 880, in loadtxt
next(fh)
File "/Users/ZEN/anaconda/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 1072: invalid start byte
I don't know why I can't run this code normally. I'm on python 3.6.0 and Anaconda 4.3.1(x86_64) on a mac. I recently upgraded to python 3.x from 2.x and downloaded numpy.
This is the code I'm trying to run:
import numpy as np
def csvfile(filename):
data = np.loadtxt(filename, delimiter = ",", skiprows =9)
return data
data = csvfile("1.csv")
print (data)
It would be great if anyone could help me!
I am not sure what exactly the coding you are using.
If your data contains Chinese or some special characters, try this
data = np.loadtxt(path,encoding="gbk", delimiter=",",dtype="str")
If not, generally, using
data = np.loadtxt(path,encoding="utf8", delimiter=",",dtype="str")
can solve the problem.
It is important to know the encoding that your file using.

Python fails with parsing file using re

I have a file that is mostly ascii file, but there appear some non-ascii characters sometimes. I want to parse this files and extract the lines that are marked in a certain way. Previously I used sed for this, but now I need to do the same in python. (Of course I still can use os.system, but I'm hoping for something more convenient).
I'm doing following.
p = re.compile(".*STATWAH ([0-9]*):([0-9]*):([0-9 ]*):([0-9 ]*) STATWAH.*")
f = open("capture_8_8_8__1_2_3.log", encoding="ascii")
fl = filter(lambda line: p.match(line), f)
len(list(fl))
And in the last line I get following error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 2227: ordinal not in range(128)
If I remove encoding parameter from the second line, i. e. use default encoding which is utf-8, the error is following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 2227: invalid start byte
Could you help me please what can I do here, except calling sed from python?
UPD.
Thanks to #Wooble I found the answer.
The correct code looks following:
p = re.compile(rb".*STATWAH ([0-9]*):([0-9]*):([0-9 ]*):([0-9 ]*) STATWAH.*")
f = open("capture_8_8_8__1_2_3.log", "rb")
fl = filter(lambda line: p.match(line), f)
len(list(fl))
I opened file in binary mode and also compile regex from binary string representation.

Categories