Converting html entity to text - python

I have ’ in my HTML file (which is a right curly quote) and I want to convert it to text (if possible).
I tried using HTMLParser and BeautifulSoup but to no success.
>>> h = HTMLParser.HTMLParser()
>>> h.unescape("'")
u"'"
>>> h.unescape("’")
u'\x92' # I was hoping for a right curly quote here.
My goal is very simple: Take the html input and output all the text (without any html codes).

"right curly quote" is not an ascii character. u'\x92' is the python representation of the unicode character representing it and not some "html code".
To display it properly in your terminal, use print h.unescape("’").encode('utf-8') (or whatever you terminal's charset is).

Related

Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters

I use python 2.7 and I'm receiving a string from a server (not in unicode!).
Inside that string I find text with unicode escape sequences. For example like this:
<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>
How do I convert those \uxxxx - back to utf-8? The answers I found were either dealing with &# or required eval() which is too slow for my purposes. I need a universal solution for any text containing such sequenes.
Edit:
<\a> is a typo but I want a tolerance against such typos as well. There should only be reaction to \u
The example text is meant in proper python syntax like this:
"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
The desired output is in proper python syntax
"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"
Try
>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'
And then you can encode to utf8 as usual.
Python does contain some special string codecs for cases like this.
In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte-string using the "unicode_escape" codec to have a proper Unicode text object in Python.
(On which your program should be performing all textual operations) -
Whenever you are outputting that text again, you convert it to utf-8 as usual:
rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")
If there are othe bytes outside the 32-127 range, the unicode_escape codec
assumes them to be in the latin1 encoding. So if your response mixes utf-8 and these \uXXXX sequences you have to:
decode the original string using utf-8
encode back to latin1
decode using "unicode_escape"
work on the text
encode back to utf-8

Is there a function in python to convert html entities to percent encoding?

I am retrieving Japanese and Chinese text from a website in the form of JSON using urllib2 and converting them to HTML entities using encode(xmlcharrefreplace).
I then use curl to post the same content(after making minor changes) back on the website using percent encoding. My code works fine for English text with special characters, but I need to convert all Japanese/Chinese characters from html encoding to percent encoding.
Is there a function in Python which could do this magic?
PS: For English text, I have my own function to convert special chars to percent encoding. I cannot use this method for the Japanese/Chinese characters as there are too many of them.
You want to combine two things:
HTML decoding
URL encoding
Here is an example (Python3):
>>> import html
>>> html.unescape('{')
'{'
>>> import urllib.parse
>>> urllib.parse.quote('{')
'%7B'

Ascii to html encoding

How I can encode a string(ascii) to html code?
For example, string = "encoding html"
The result should be after encoding string_encoded = "encoding html"
I think you just need to use cgi.escape to replace the characters <, > and &. For most cases that will be all you need. Example:
>>> import cgi
>>> cgi.escape("<Foo & Bar>")
'&alt;Foo & Bar>'
The symbol isn't really needed unless you are forcefully adding a space to the markup, which no library will naturally do for you.

HTML Unescape by converting custom elements to ASCII?

I need to unescape some HTML Entities from a string.
However for few characters like “’“”' I would like to replace with nearest ASCII. All the others needs to be stripped off.
How can I do that in Python ? I tried the following snippet but it doesn't do "nearest" the way I want it.
import HTMLParser
import unicodedata
parser = HTMLParser.HTMLParser()
parsed = parser.unescape("‘")
nearest = unicodedata.normalize('NFKD', parsed).encode('ascii','ignore')
nearest is empty in the above code. Can I supply an argument to HTMLParser.unescape to convert it to ASCII quotes? I want to supply custom mapping like this : {'&lsquo':'"','&rsquo':'"'} where the items in maps should be converted to ASCII.
xml.sax.parse has some an API unescape(html_text, entities={' ': ' ', """: '"'}), does HTMLParser have something similar.

Converting html source content into readable format with Python 2.x

Python 2.7
I have a program that gets video titles from the source code of a webpage but the titles are encoded in some HTML format.
This is what I've tried so far:
>>> import urllib2
>>> urllib2.unquote('£')
'£'
So that didn't work...
Then I tried:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('£')
u'\xa3'
as you can see that doesn't work either nor any combination of the two.
I managed to find out that '£' is an HTML character entity name. The '\xa3' I wasn't able to find out.
Does anyone know how to do this, how to convert HTML content into a readable format in python?
£ is the html character entity for the POUND SIGN, which is unicode character U+00A3. You can see this if you print it:
>>> print u'\xa3'
£
When you use unescape(), you converted the character entity to it's native unicode character, which is what u'\xa3' means--a single U+00A3 unicode character.
If you want to encode this into another format (e.g. utf-8), you would do so with the encode method of strings:
>>> u'\xa3'.encode('utf-8')
'\xc2\xa3'
You get a two-byte string representing the single "POUND SIGN" character.
I suspect that you are a bit unclear about how string encodings work in general. You need to convert your string from bytes to unicode (see this answer for one way to do that with urllib2), then unescape the html, then (possibly) convert the unicode into whatever output encoding you need.
Why doesn't that work?
In [1]: s = u'\xa3'
In [2]: s
Out[2]: u'\xa3'
In [3]: print s
£
When it comes to unescaping html entities I always used: http://effbot.org/zone/re-sub.htm#unescape-html.
The video title strings use HTML entities to encode special characters, such as ampersands and pound signs.
The \xa3 is the Python Unicode character literal for the pound sign (£). In your example, Python is displaying the __repr__() of a Unicode string, which is why you see the escapes. If you print this string, you can see it represents the pound sign:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('£')
u'\xa3'
>>> print h.unescape('£')
£
lxml, BeautifulSoup or PyQuery does the job pretty well. Or combination of these ;)

Categories