HTML Unescape by converting custom elements to ASCII? - python

I need to unescape some HTML Entities from a string.
However for few characters like “’“”' I would like to replace with nearest ASCII. All the others needs to be stripped off.
How can I do that in Python ? I tried the following snippet but it doesn't do "nearest" the way I want it.
import HTMLParser
import unicodedata
parser = HTMLParser.HTMLParser()
parsed = parser.unescape("‘")
nearest = unicodedata.normalize('NFKD', parsed).encode('ascii','ignore')
nearest is empty in the above code. Can I supply an argument to HTMLParser.unescape to convert it to ASCII quotes? I want to supply custom mapping like this : {'&lsquo':'"','&rsquo':'"'} where the items in maps should be converted to ASCII.
xml.sax.parse has some an API unescape(html_text, entities={' ': ' ', """: '"'}), does HTMLParser have something similar.

Related

Python converting characters to URL encoding

I am trying to convert texts into URLs, but certain characters are not being converted as I'm expecting. For example:
>>> import urllib
>>> my_text="City of Liège"
>>> my_url=urllib.parse.quote(my_text,safe='')
>>> my_url
'City%20of%20Li%C3%A8ge'
The spaces get converted properly, however, the "è" should get converted into %E8, but it is returned as %C3%A8. What am I missing ?
I am using Python 3.6.
Your string is UTF-8 encoded, and the URL encoded string reflects this.
0xC3A8 is the UTF-8 encoding of the Unicode value U+00E8, which is described as "LATIN SMALL LETTER E WITH GRAVE".
In order to get the string you are after, you need to let Python know which codepage you're using, like this:
my_text=bytes("City of Liège",'cp1252')

Is there a function in python to convert html entities to percent encoding?

I am retrieving Japanese and Chinese text from a website in the form of JSON using urllib2 and converting them to HTML entities using encode(xmlcharrefreplace).
I then use curl to post the same content(after making minor changes) back on the website using percent encoding. My code works fine for English text with special characters, but I need to convert all Japanese/Chinese characters from html encoding to percent encoding.
Is there a function in Python which could do this magic?
PS: For English text, I have my own function to convert special chars to percent encoding. I cannot use this method for the Japanese/Chinese characters as there are too many of them.
You want to combine two things:
HTML decoding
URL encoding
Here is an example (Python3):
>>> import html
>>> html.unescape('{')
'{'
>>> import urllib.parse
>>> urllib.parse.quote('{')
'%7B'

Converting html source content into readable format with Python 2.x

Python 2.7
I have a program that gets video titles from the source code of a webpage but the titles are encoded in some HTML format.
This is what I've tried so far:
>>> import urllib2
>>> urllib2.unquote('£')
'£'
So that didn't work...
Then I tried:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('£')
u'\xa3'
as you can see that doesn't work either nor any combination of the two.
I managed to find out that '£' is an HTML character entity name. The '\xa3' I wasn't able to find out.
Does anyone know how to do this, how to convert HTML content into a readable format in python?
£ is the html character entity for the POUND SIGN, which is unicode character U+00A3. You can see this if you print it:
>>> print u'\xa3'
£
When you use unescape(), you converted the character entity to it's native unicode character, which is what u'\xa3' means--a single U+00A3 unicode character.
If you want to encode this into another format (e.g. utf-8), you would do so with the encode method of strings:
>>> u'\xa3'.encode('utf-8')
'\xc2\xa3'
You get a two-byte string representing the single "POUND SIGN" character.
I suspect that you are a bit unclear about how string encodings work in general. You need to convert your string from bytes to unicode (see this answer for one way to do that with urllib2), then unescape the html, then (possibly) convert the unicode into whatever output encoding you need.
Why doesn't that work?
In [1]: s = u'\xa3'
In [2]: s
Out[2]: u'\xa3'
In [3]: print s
£
When it comes to unescaping html entities I always used: http://effbot.org/zone/re-sub.htm#unescape-html.
The video title strings use HTML entities to encode special characters, such as ampersands and pound signs.
The \xa3 is the Python Unicode character literal for the pound sign (£). In your example, Python is displaying the __repr__() of a Unicode string, which is why you see the escapes. If you print this string, you can see it represents the pound sign:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('£')
u'\xa3'
>>> print h.unescape('£')
£
lxml, BeautifulSoup or PyQuery does the job pretty well. Or combination of these ;)

Converting html entity to text

I have ’ in my HTML file (which is a right curly quote) and I want to convert it to text (if possible).
I tried using HTMLParser and BeautifulSoup but to no success.
>>> h = HTMLParser.HTMLParser()
>>> h.unescape("'")
u"'"
>>> h.unescape("’")
u'\x92' # I was hoping for a right curly quote here.
My goal is very simple: Take the html input and output all the text (without any html codes).
"right curly quote" is not an ascii character. u'\x92' is the python representation of the unicode character representing it and not some "html code".
To display it properly in your terminal, use print h.unescape("’").encode('utf-8') (or whatever you terminal's charset is).

Unescape Python Strings From HTTP

I've got a string from an HTTP header, but it's been escaped.. what function can I use to unescape it?
myemail%40gmail.com -> myemail#gmail.com
Would urllib.unquote() be the way to go?
I am pretty sure that urllib's unquote is the common way of doing this.
>>> import urllib
>>> urllib.unquote("myemail%40gmail.com")
'myemail#gmail.com'
There's also unquote_plus:
Like unquote(), but also replaces plus signs by spaces, as required for unquoting HTML form values.
In Python 3, these functions are urllib.parse.unquote and urllib.parse.unquote_plus.
The latter is used for example for query strings in the HTTP URLs, where the space characters () are traditionally encoded as plus character (+), and the + is percent-encoded to %2B.
In addition to these there is the unquote_to_bytes that converts the given encoded string to bytes, which can be used when the encoding is not known or the encoded data is binary data. However there is no unquote_plus_to_bytes, if you need it, you can do:
def unquote_plus_to_bytes(s):
if isinstance(s, bytes):
s = s.replace(b'+', b' ')
else:
s = s.replace('+', ' ')
return unquote_to_bytes(s)
More information on whether to use unquote or unquote_plus is available at URL encoding the space character: + or %20.
Yes, it appears that urllib.unquote() accomplishes that task. (I tested it against your example on codepad.)

Categories