How to read and understand the .hcc file with Python? - python

I am having a .hcc file, which I am trying to read but I am getting error.
This is what I am tried:
chardetect 2016.hcc
2016.hcc: windows-1253 with confidence 0.2724130248827703
I have tried the following:
>>> with open("2016.hcc","r",encoding="windows-1253") as f:
... print(f.read())
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python35\lib\encodings\cp1253.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9c in position 232: character maps to <undefined>
then I tried this without using encoding:
>>> with open("2016.hcc","r") as f:
... print(f.read())
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python35\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 284: character maps to <undefined>
After opening the file in byte mode, I was able to read but none was understandable.
Here is the sample file: 2016.hcc
Please let me know how I can do that.
**UPDATED ATTEMPT: **
>>> with open("2016.hcc","r",encoding="utf-16") as f:
... print(f.read())
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python35\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "C:\Python35\lib\encodings\utf_16.py", line 61, in _buffer_decode
codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 15390-15391: illegal encoding

Related

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 886: invalid start byte: jsonlines

I am trying to read lines from a jsonl file, but I am getting the following error.
Traceback (most recent call last): File "insertion_script.py", line
12, in
for line in f.iter(): File "C:\Users\Administrator\Anaconda3\lib\site-packages\jsonlines\jsonlines.py",
line 204, in iter
skip_empty=skip_empty) File "C:\Users\Administrator\Anaconda3\lib\site-packages\jsonlines\jsonlines.py",
line 143, in read
lineno, line = next(self._line_iter) File "C:\Users\Administrator\Anaconda3\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position
886: invalid start byte
BH_data = []
with jsonlines.open('2401659.jsonl','r') as f:
for line in f.iter():
BH_data.append(line)
The implication is that your data is not actually in UTF-8. 0xA3 happens to be the British pound sterling symbol in the Windows code page. You should try
import codecs
with codecs.open('2401659.jsonl','r',encoding='cp1252') as jfile:
with jsonlines.Reader(jfile) as f:

LookupError: unknown encoding: utf8r

when I try the code:
f = open("xronia.txt", "r")
for x in f:
print(x)
I always take this Error:Traceback (most recent call last):
File "C:\Users\Desktop\PYTHON\Προγραμματισμός Σταύρος\disekta.py",
line 2, in
lines=fo.readlines() File "C:\Users\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1253.py",
line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position
0: character maps to
I have tried to use encoding='utf8' but it didn't work. The file is an excel file formatted as .txt(as I read in a site). I am new to this world, so any help is acceptable..

UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 11: invalid start byte

i am trying to read this xlsx file
tran=np.loadtxt("s.xlsx",delimiter=",")
x_train=tran[:,1:45]
y_train=tran[:,45]
test=np.loadtxt("s.xlsx",delimiter=",")
x_test=test[:,45]
y_test=test[:,:]
Traceback (most recent call last):
File "Example_1.py", line 11, in <module>
tran=np.loadtxt("s.xlsx",delimiter=",")
File "/home/siddharth/.local/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1086, in loadtxt
first_line = next(fh)
File "/usr/lib/python2.7/codecs.py", line 314, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 11: invalid start byte
but when i try running the program i get this error

Not capable to splitlines from a file (python3)

So I'm doing what I always do when I read the file:
code:
f= open(filename,'r')
t= f.read().splitlines()
print(t)
but I'm getting a UnicodeDecodeError I don't know why.
the error:
Traceback (most recent call last):
File "try.py", line 21, in <module>
t= f.read().splitlines()
File "/Users/jamilaldani/miniconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 307: invalid start byte
As described in a few places on this site, and AndreaConte's comment this is likely to be a file encoded in a different encoding (ie not UTF-8)
This answer may help: https://stackoverflow.com/a/19706723/70131,
as may this one if you're willing to lose some data: https://stackoverflow.com/a/12468274/70131

Why html2text module throws UnicodeDecodeError?

I have problem with html2text module...shows me UnicodeDecodeError:
UnicodeDecodeError: 'ascii' codec can't decode byte
0xbe in position 6: ordinal not in range(128)
Example :
#!/usr/bin/python
# -*- coding: utf-8 -*-
import html2text
import urllib
h = html2text.HTML2Text()
h.ignore_links = True
html = urllib.urlopen( "http://google.com" ).read()
print h.handle( html )
...also have tried h.handle( unicode( html, "utf-8" ) with no success. Any help.
EDIT :
Traceback (most recent call last):
File "test.py", line 12, in <module>
print h.handle(html)
File "/home/alex/Desktop/html2text-master/html2text.py", line 254, in handle
return self.optwrap(self.close())
File "/home/alex/Desktop/html2text-master/html2text.py", line 266, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 6: ordinal not in range(128)
The issue is easily reproducable when not decoding, but works just fine when you decode your source correctly. You also get the error if you reuse the parser!
You can try this out with a known good Unicode source, such as http://www.ltg.ed.ac.uk/~richard/unicode-sample.html.
If you don't decode the response to unicode, the library fails:
>>> h = html2text.HTML2Text()
>>> h.handle(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Now, if you reuse the HTML2Text object, its state is not cleared up, it still holds the incorrect data, so even passing in Unicode will now fail:
>>> h.handle(html.decode('utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
You need to use a new object and it'll work just fine:
>>> h = html2text.HTML2Text()
>>> result = h.handle(html.decode('utf8'))
>>> len(result)
12750
>>> type(result)
<type 'unicode'>

Categories