LookupError: unknown encoding: utf8r - python

when I try the code:
f = open("xronia.txt", "r")
for x in f:
print(x)
I always take this Error:Traceback (most recent call last):
File "C:\Users\Desktop\PYTHON\Προγραμματισμός Σταύρος\disekta.py",
line 2, in
lines=fo.readlines() File "C:\Users\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1253.py",
line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position
0: character maps to
I have tried to use encoding='utf8' but it didn't work. The file is an excel file formatted as .txt(as I read in a site). I am new to this world, so any help is acceptable..

Related

Python read text

I am simply trying to read a text file that has 4000+ lines of nouns all single column and I’m getting an error:
Traceback (most recent call last):
File "/private/var/mobile/Library/Mobile Documents/iCloud~com~omz-software~Pythonista3/Documents/nouns.py", line 4, in <module>
for i in nouns_file:
File "/var/containers/Bundle/Application/107074CD-03B1-4FB3-809A-CBD44D6CF245/Pythonista3.app/Frameworks/Py3Kit.framework/pylib/encodings/ascii.py", line 27, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2241: ordinal not in range(128)
With code:
with open("nounlist.txt", "r") as nouns_file:
for i in nouns_file:
print(i)
I’m not sure what’s causing this. I would think that it would just output all of the nouns from my nounlist.txt file.

How to read and understand the .hcc file with Python?

I am having a .hcc file, which I am trying to read but I am getting error.
This is what I am tried:
chardetect 2016.hcc
2016.hcc: windows-1253 with confidence 0.2724130248827703
I have tried the following:
>>> with open("2016.hcc","r",encoding="windows-1253") as f:
... print(f.read())
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python35\lib\encodings\cp1253.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9c in position 232: character maps to <undefined>
then I tried this without using encoding:
>>> with open("2016.hcc","r") as f:
... print(f.read())
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python35\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 284: character maps to <undefined>
After opening the file in byte mode, I was able to read but none was understandable.
Here is the sample file: 2016.hcc
Please let me know how I can do that.
**UPDATED ATTEMPT: **
>>> with open("2016.hcc","r",encoding="utf-16") as f:
... print(f.read())
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python35\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "C:\Python35\lib\encodings\utf_16.py", line 61, in _buffer_decode
codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 15390-15391: illegal encoding

Unicode Decode Error while trying to extract data from text file

I am writing a code to extract specific lines from a text file into an output file, and here is what my code looks like:
with open('output.txt', 'w') as outfile:
with open('testfile.txt') as infile:
for line in infile.readlines:
if 'Emotion' in line or 'PANS.RESP' in line:
outfile.write(line)
outfile.write('-----------------------------------------------------')
I keep getting this error:
Traceback (most recent call last): File
"/Users/Sam/Desktop/readstuff.py", line 3, in
for line in infile: File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py",
line 26, in decode
return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
154: ordinal not in range(128)
What should I do?

Python fails with parsing file using re

I have a file that is mostly ascii file, but there appear some non-ascii characters sometimes. I want to parse this files and extract the lines that are marked in a certain way. Previously I used sed for this, but now I need to do the same in python. (Of course I still can use os.system, but I'm hoping for something more convenient).
I'm doing following.
p = re.compile(".*STATWAH ([0-9]*):([0-9]*):([0-9 ]*):([0-9 ]*) STATWAH.*")
f = open("capture_8_8_8__1_2_3.log", encoding="ascii")
fl = filter(lambda line: p.match(line), f)
len(list(fl))
And in the last line I get following error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 2227: ordinal not in range(128)
If I remove encoding parameter from the second line, i. e. use default encoding which is utf-8, the error is following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 2227: invalid start byte
Could you help me please what can I do here, except calling sed from python?
UPD.
Thanks to #Wooble I found the answer.
The correct code looks following:
p = re.compile(rb".*STATWAH ([0-9]*):([0-9]*):([0-9 ]*):([0-9 ]*) STATWAH.*")
f = open("capture_8_8_8__1_2_3.log", "rb")
fl = filter(lambda line: p.match(line), f)
len(list(fl))
I opened file in binary mode and also compile regex from binary string representation.

Why html2text module throws UnicodeDecodeError?

I have problem with html2text module...shows me UnicodeDecodeError:
UnicodeDecodeError: 'ascii' codec can't decode byte
0xbe in position 6: ordinal not in range(128)
Example :
#!/usr/bin/python
# -*- coding: utf-8 -*-
import html2text
import urllib
h = html2text.HTML2Text()
h.ignore_links = True
html = urllib.urlopen( "http://google.com" ).read()
print h.handle( html )
...also have tried h.handle( unicode( html, "utf-8" ) with no success. Any help.
EDIT :
Traceback (most recent call last):
File "test.py", line 12, in <module>
print h.handle(html)
File "/home/alex/Desktop/html2text-master/html2text.py", line 254, in handle
return self.optwrap(self.close())
File "/home/alex/Desktop/html2text-master/html2text.py", line 266, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 6: ordinal not in range(128)
The issue is easily reproducable when not decoding, but works just fine when you decode your source correctly. You also get the error if you reuse the parser!
You can try this out with a known good Unicode source, such as http://www.ltg.ed.ac.uk/~richard/unicode-sample.html.
If you don't decode the response to unicode, the library fails:
>>> h = html2text.HTML2Text()
>>> h.handle(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Now, if you reuse the HTML2Text object, its state is not cleared up, it still holds the incorrect data, so even passing in Unicode will now fail:
>>> h.handle(html.decode('utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
You need to use a new object and it'll work just fine:
>>> h = html2text.HTML2Text()
>>> result = h.handle(html.decode('utf8'))
>>> len(result)
12750
>>> type(result)
<type 'unicode'>

Categories