Python fails with parsing file using re - python

I have a file that is mostly ascii file, but there appear some non-ascii characters sometimes. I want to parse this files and extract the lines that are marked in a certain way. Previously I used sed for this, but now I need to do the same in python. (Of course I still can use os.system, but I'm hoping for something more convenient).
I'm doing following.
p = re.compile(".*STATWAH ([0-9]*):([0-9]*):([0-9 ]*):([0-9 ]*) STATWAH.*")
f = open("capture_8_8_8__1_2_3.log", encoding="ascii")
fl = filter(lambda line: p.match(line), f)
len(list(fl))
And in the last line I get following error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 2227: ordinal not in range(128)
If I remove encoding parameter from the second line, i. e. use default encoding which is utf-8, the error is following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 2227: invalid start byte
Could you help me please what can I do here, except calling sed from python?
UPD.
Thanks to #Wooble I found the answer.
The correct code looks following:
p = re.compile(rb".*STATWAH ([0-9]*):([0-9]*):([0-9 ]*):([0-9 ]*) STATWAH.*")
f = open("capture_8_8_8__1_2_3.log", "rb")
fl = filter(lambda line: p.match(line), f)
len(list(fl))
I opened file in binary mode and also compile regex from binary string representation.

Related

error Traceback (most recent call last): after reading .txt file in Python

this is my first time trying to program in Python
with open('/Users/solidaneziri/Downloads/Data_Exercise_1.txt') as infile:
for line in infile:
print(line.split()[0])
this is the code that I wrote when reading the file and it complied and ran the first time, after the first time I keep getting this error and I don't know to fix it
/usr/bin/python3 /Users/solidaneziri/IdeaProjects/Abgabe1/src/BonusAufgabe/aufgabe1.py
Traceback (most recent call last):
File "/Users/solidaneziri/IdeaProjects/Abgabe1/src/BonusAufgabe/aufgabe1.py", line 2, in <module>
for line in infile:
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9f in position 190: invalid start byte
I replaced the letters with accents to normal ones and it worked. This is not the optimal Solution but it worked

How to avoid problem with encode UTF-8 error

I've got problem with reading text files. When I start program and add file, it throws an error:
Traceback (most recent call last):
File "c:/Users/Marcin/Desktop/python/graf_menu.py", line 38, in <module>
main_func()
File "c:/Users/Marcin/Desktop/python/graf_menu.py", line 32, in main_func
read_file()
File "c:/Users/Marcin/Desktop/python/graf_menu.py", line 15, in read_file
for i in f.read():
File "C:\Users\Marcin\AppData\Local\Programs\Python\Python38-32\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 19: invalid start byte
In my code there is a line with "encoding="UTF-8". How to solve the problem. The code below:
files = input("File name: ")
try:
with open(files,"r",encoding="UTF-8") as f:
for i in f.read():
print(i,end='')
except FileNotFoundError:
print("FileNotFoundError")
There is nothing wrong with the program itself. You are getting this error because you are trying to read a file which is not encoded as UTF-8 as UTF-8-encoded. You have to either convert the contents of the file to UTF-8 or specify a different encoding (the one that the file actually uses) in the call to open.
This file is not encoded as UTF-8 try to use encoded="iso-8859-1"

LookupError: unknown encoding: utf8r

when I try the code:
f = open("xronia.txt", "r")
for x in f:
print(x)
I always take this Error:Traceback (most recent call last):
File "C:\Users\Desktop\PYTHON\Προγραμματισμός Σταύρος\disekta.py",
line 2, in
lines=fo.readlines() File "C:\Users\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1253.py",
line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position
0: character maps to
I have tried to use encoding='utf8' but it didn't work. The file is an excel file formatted as .txt(as I read in a site). I am new to this world, so any help is acceptable..

Not capable to splitlines from a file (python3)

So I'm doing what I always do when I read the file:
code:
f= open(filename,'r')
t= f.read().splitlines()
print(t)
but I'm getting a UnicodeDecodeError I don't know why.
the error:
Traceback (most recent call last):
File "try.py", line 21, in <module>
t= f.read().splitlines()
File "/Users/jamilaldani/miniconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 307: invalid start byte
As described in a few places on this site, and AndreaConte's comment this is likely to be a file encoded in a different encoding (ie not UTF-8)
This answer may help: https://stackoverflow.com/a/19706723/70131,
as may this one if you're willing to lose some data: https://stackoverflow.com/a/12468274/70131

encode 'UCS-2 Little Endian' file to 'utf8' using python error

I'm trying to encode from UCS-2 Little Endian file to utf8 using python and I'm getting a weird error.
The code I'm using:
file=open("C:/AAS01.txt", 'r', encoding='utf8')
lines = file.readlines()
file.close()
And I'm getting the following error:
Traceback (most recent call last):
File "C:/Users/PycharmProjects/test.py", line 18, in <module>
main()
File "C:/Users/PycharmProjects/test.py", line 7, in main
lines = file.readlines()
File "C:\Python34\lib\codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I tried to use codecs commands, but also didn't work...
Any idea what I can do?
The encoding argument to open sets the input encoding. Use encoding='utf_16_le'.
If you're trying to read UCS-2, why are you telling Python it's UTF-8? The 0xff is most likely the first byte of a little endian byte order marker:
>>> codecs.BOM_UTF16_LE
b'\xff\xfe'
UCS-2 is also deprecated, for the simple reason that Unicode outgrew it. The typical replacement would be UTF-16.
More info linked in Python 3: reading UCS-2 (BE) file

Categories