Why html2text module throws UnicodeDecodeError? - python

I have problem with html2text module...shows me UnicodeDecodeError:
UnicodeDecodeError: 'ascii' codec can't decode byte
0xbe in position 6: ordinal not in range(128)
Example :
#!/usr/bin/python
# -*- coding: utf-8 -*-
import html2text
import urllib
h = html2text.HTML2Text()
h.ignore_links = True
html = urllib.urlopen( "http://google.com" ).read()
print h.handle( html )
...also have tried h.handle( unicode( html, "utf-8" ) with no success. Any help.
EDIT :
Traceback (most recent call last):
File "test.py", line 12, in <module>
print h.handle(html)
File "/home/alex/Desktop/html2text-master/html2text.py", line 254, in handle
return self.optwrap(self.close())
File "/home/alex/Desktop/html2text-master/html2text.py", line 266, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 6: ordinal not in range(128)

The issue is easily reproducable when not decoding, but works just fine when you decode your source correctly. You also get the error if you reuse the parser!
You can try this out with a known good Unicode source, such as http://www.ltg.ed.ac.uk/~richard/unicode-sample.html.
If you don't decode the response to unicode, the library fails:
>>> h = html2text.HTML2Text()
>>> h.handle(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Now, if you reuse the HTML2Text object, its state is not cleared up, it still holds the incorrect data, so even passing in Unicode will now fail:
>>> h.handle(html.decode('utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
You need to use a new object and it'll work just fine:
>>> h = html2text.HTML2Text()
>>> result = h.handle(html.decode('utf8'))
>>> len(result)
12750
>>> type(result)
<type 'unicode'>

Related

LookupError: unknown encoding: utf8r

when I try the code:
f = open("xronia.txt", "r")
for x in f:
print(x)
I always take this Error:Traceback (most recent call last):
File "C:\Users\Desktop\PYTHON\Προγραμματισμός Σταύρος\disekta.py",
line 2, in
lines=fo.readlines() File "C:\Users\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1253.py",
line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position
0: character maps to
I have tried to use encoding='utf8' but it didn't work. The file is an excel file formatted as .txt(as I read in a site). I am new to this world, so any help is acceptable..

How to read and understand the .hcc file with Python?

I am having a .hcc file, which I am trying to read but I am getting error.
This is what I am tried:
chardetect 2016.hcc
2016.hcc: windows-1253 with confidence 0.2724130248827703
I have tried the following:
>>> with open("2016.hcc","r",encoding="windows-1253") as f:
... print(f.read())
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python35\lib\encodings\cp1253.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9c in position 232: character maps to <undefined>
then I tried this without using encoding:
>>> with open("2016.hcc","r") as f:
... print(f.read())
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python35\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 284: character maps to <undefined>
After opening the file in byte mode, I was able to read but none was understandable.
Here is the sample file: 2016.hcc
Please let me know how I can do that.
**UPDATED ATTEMPT: **
>>> with open("2016.hcc","r",encoding="utf-16") as f:
... print(f.read())
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python35\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "C:\Python35\lib\encodings\utf_16.py", line 61, in _buffer_decode
codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 15390-15391: illegal encoding

Python - cannot decode html (urllib)

I'm trying to write html from webpage to file, but I have problem with decode characters:
import urllib.request
response = urllib.request.urlopen("https://www.google.com")
charset = response.info().get_content_charset()
print(response.read().decode(charset))
Last line causes error:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in
position 6079: ordinal not in range(128)
response.info().get_content_charset() returns iso-8859-2, but if i check content of response without decoding (print(resposne.read())) there is "utf-8" encoding as html metatag. If i use "utf-8" in decode function there is also similar problem:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position
6111: invalid start byte
What's going on?
You can ignore invalid characters using
response.read().decode("utf-8", 'ignore')
Instead of ignore there are other options, e.g. replace
https://www.tutorialspoint.com/python/string_encode.htm
https://docs.python.org/3/howto/unicode.html#the-string-type
(There is also str.encode(encoding='UTF-8',errors='strict') for strings.)

Resolving ascii codec can't decode byte in position ordinal not in range

I've seen all of the other posts and done quite a bit of research but I am still scratching my head.
Here is the problem:
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a=u'My Mate\u2019s'
>>> b='\xe2\x80\x99s BBQ'
>>> print a
My Mate’s
>>> print b
’s BBQ
So, the variables are finely printed themselves, but printing a concatenation:
>>> print a+b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
gives a decode error. So, I try to decode the string:
>>> print a.decode('utf-8')+b.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
The error changes into an encode error. So, I try a couple of ways to inform the encoding:
>>> print a.decode('utf-8').encode('utf-8')+b.decode('utf-8').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>> print a.decode('ascii','ignore')+b.decode('ascii','ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>> print a.decode('utf-8').encode('ascii','ignore') +b.decode('utf-8').encode('ascii','ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>>
The error persists no matter what I try.
I suppose the problem might be very simple. I'd appreciate someone helping with an explanation of what's going on, and how to resolve this.
I have python 2.7 on ubuntu.
b is encoded as UTF-8 so you have to .decode it to Unicode.
print a + b.decode('utf-8')
Tested in Python 2.7.6 on Ubuntu.
If you want both in UTF-8 you can do:
print a.encode('utf-8') + b
I'll explain why each one of your attempts doesn't work:
a + b # the default decoding is ascii which cannot decode UTF-8
a.decode('utf-8')+b.decode('utf-8') # you don't need to decode Unicode
Again you don't need to decode Unicode.
a.decode('utf-8').encode('utf-8')+b.decode('utf-8').encode('utf-8')
You keep trying to decode Unicode. What you should do instead is to encode it, or to decode b.
a.decode('ascii','ignore')+b.decode('ascii','ignore')
And finally you again try to decode Unicode. The point to be made here is that UTF-8 is an encoding. You decode from UTF-8 to Unicode.
a.decode('utf-8').encode('ascii','ignore') +b.decode('utf-8').encode('ascii','ignore')

Error while using urllib.request.urlopen in Python

What's wrong with this code?
>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
print(line.decode("utf-8"))
<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=windows-1251"><title>Google</title><script>window.google={kEI:"XMECT7XyDcGn0AWFk7ywAQ",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},https:function(){return window.location.protocol=="https:"},kEXPI:"33492,35300",kCSI:{e:"33492,35300",ei:"XMECT7XyDcGn0AWFk7ywAQ"},authuser:0,
ml:function(){},kHL:"uk",time:function(){return(new Date).getTime()},log:function(a,b,c,e){var d=new Image,g=google,h=g.lc,f=g.li,j="";d.onerror=(d.onload=(d.onabort=function(){delete h[f]}));h[f]=d;if(!c&&b.search("&ei=")==-1)j="&ei="+google.getEI(e);var i=c||"/gen_204?atyp=i&ct="+a+"&cad="+b+j+"&zx="+google.time(),k=/^http:/i;if(k.test(i)&&google.https()){google.ml(new Error("GLMM"),false,{src:i});
delete h[f];return}d.src=i;g.li=f+1},lc:[],li:0,Toolbelt:{},y:{},x:function(a,b){google.y[a.id]=
[a,b];return false}};
window.google.sn="webhp";window.google.timers={};window.google.startTick=function(a,b){window.google.timers[a]={t:{start:(new Date).getTime()},bfr:!(!b)}};window.google.tick=function(a,b,c){if(!window.google.timers[a])google.startTick(a);window.google.timers[a].t[b]=c||(new Date).getTime()};google.startTick("load",true);try{}catch(u){}
var _gjwl=location;function _gjuc(){var e=_gjwl.href.indexOf("#");if(e>=0){var a=_gjwl.href.substring(e);if(a.indexOf("&q=")>0||a.indexOf("#q=")>=0){a=a.substring(1);if(a.indexOf("#")==-1){for(var c=0;c<a.length;){var d=c;if(a.charAt(d)=="&")++d;var b=a.indexOf("&",d);if(b==-1)b=a.length;var f=a.substring(d,b);if(f.indexOf("fp=")==0){a=a.substring(0,c)+a.substring(b,a.length);b=c}else if(f=="cad=h")return 0;c=b}_gjwl.href="/search?"+a+"&cad=h";return 1}}}return 0}function _gjp(){!(window._gjwl.hash&&
window._gjuc())&&setTimeout(_gjp,500)};
Traceback (most recent call last):
File "<pyshell#109>", line 2, in <module>
print(line.decode("utf-8"))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 2364: invalid continuation byte
Google sends you text in windows-1251 encoding, it says it in meta tag. This will work:
>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
print(line.decode("cp1251"))
That's your failing line (last part of it):
>>> line
b'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Im\xe1genes</a>'
>>> line.decode()
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
line.decode()
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 62: invalid continuation byte
The failing code is from a spanish word that has accent:
>>> bite = 0xe1
>>> bite
225
>>> chr(225)
'á'
You will be ok with latins decoding accordingly:
>>> line.decode('latin-1')
'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Imágenes</a>'
btw, Imágenes is spanish images

Categories