Converting html source content into readable format with Python 2.x - python

Python 2.7
I have a program that gets video titles from the source code of a webpage but the titles are encoded in some HTML format.
This is what I've tried so far:
>>> import urllib2
>>> urllib2.unquote('£')
'£'
So that didn't work...
Then I tried:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('£')
u'\xa3'
as you can see that doesn't work either nor any combination of the two.
I managed to find out that '£' is an HTML character entity name. The '\xa3' I wasn't able to find out.
Does anyone know how to do this, how to convert HTML content into a readable format in python?

£ is the html character entity for the POUND SIGN, which is unicode character U+00A3. You can see this if you print it:
>>> print u'\xa3'
£
When you use unescape(), you converted the character entity to it's native unicode character, which is what u'\xa3' means--a single U+00A3 unicode character.
If you want to encode this into another format (e.g. utf-8), you would do so with the encode method of strings:
>>> u'\xa3'.encode('utf-8')
'\xc2\xa3'
You get a two-byte string representing the single "POUND SIGN" character.
I suspect that you are a bit unclear about how string encodings work in general. You need to convert your string from bytes to unicode (see this answer for one way to do that with urllib2), then unescape the html, then (possibly) convert the unicode into whatever output encoding you need.

Why doesn't that work?
In [1]: s = u'\xa3'
In [2]: s
Out[2]: u'\xa3'
In [3]: print s
£
When it comes to unescaping html entities I always used: http://effbot.org/zone/re-sub.htm#unescape-html.

The video title strings use HTML entities to encode special characters, such as ampersands and pound signs.
The \xa3 is the Python Unicode character literal for the pound sign (£). In your example, Python is displaying the __repr__() of a Unicode string, which is why you see the escapes. If you print this string, you can see it represents the pound sign:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('£')
u'\xa3'
>>> print h.unescape('£')
£

lxml, BeautifulSoup or PyQuery does the job pretty well. Or combination of these ;)

Related

Parse json with \uxxx chracters using python

I have JSON data, which contains a text data field with escape characters such as \n, \u4e0d etc.
Using Python 2.7, my goal is to write it to CSV "as-is" i.e. \n as \n and \u4e0d as \u4e0d. (raw strings)
str(data["text"]).encode('string_escape') works as expected for \n but not for \u, giving the error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e0d' in position 32
If I try data["text"]).encode('utf-8').encode('string_escape') it works but mangles the \u in input like \xe4\xb8\x8d
data = json.loads(line)
writer.writerow(data["text"]).encode('utf-8').encode('string_escape'))
Is there a way to achieve what I need?
Many thanks
One of the challenges of programming is how to write non-display characters such as newline that perform an action instead of displaying a glyph. Python uses the backslash plus additional characters to represent these characters. For strings, the python repr function gives you the backslash-escaped representation of a string as if you were typing it in.
If I type in your example string and print it, ... well I get the new line and the unicode glyph but writing to an ascii csv would result in a unicode decode error.
>>> test = u'\n hello \u4e0d'
>>> print test
hello 不
>>>
But if I print the string representation, its what I originally typed in
>>> print repr(test)
u'\n hello \u4e0d'
>>>
If I don't want the python string part, I can just strip it out
>>> print repr(test)[2:-1]
\n hello \u4e0d
>>>
Which is better depends on what happens to that string next. If you want to get back to the real string later, stick with the python representation and then ast.literal_eval to get it back again.
>>> test2 = repr(test)
>>> original = ast.literal_eval(test2)
>>> original == test
True
You have a unicode string. You want to write it into csv file as it is. Since you can't write a unicode string on file you tried to encoded it and it got some unwanted character like '\x'. Try this solution which will convert unicode string to string without adding any unwanted character -
import ast
data = u' \n \u4e0d'
str_data = ast.literal_eval(json.dumps(data))
writer.writerow(str_data.encode('string_escape'))
Try this technique to write data to your file. First encode the data using base64, and when you want to write to the file just decode it and write that data.
>>> import base64
>>> encoded_data = '\n \u4e0d'
>>> data = base64.b64encode(encoded_data)
>>> data
'CiBcdTRlMGQ='
>>> base64.b64decode(data)
'\n \\u4e0d'
>>>

Python: decoding a string that consists of both unicode code points and unicode text

Parsing some HTML content I got the following string:
АБВ\u003d\"res
The common advice on handling it appears to be to decode using unicode_escape. However, this results in the following:
ÐÐÐ="res
The escaped characters get correctly decoded, but cyrillic letters for some reason get mangled. Other than using regexes to extract everything that looks like a unicode string, decoding only them using unicode_escape and then putting everything into a new string, which other methods exist to decode strings with unicode code points in Python?
unicode_escape treats the input as Latin-1 encoded; any bytes that do not represent a Python string literal escape sequence are decoded mapping bytes directly to Unicode codepoints. You gave it UTF-8 bytes, so the cyrillic characters are represented with 2 bytes each where decoded to two Latin-1 characters each, one of which is U+00D0 Ð, the other unprintable:
>>> print repr('АБВ\\u003d\\"res')
'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print repr('АБВ\\u003d\\"res'.decode('latin1'))
u'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print 'АБВ\\u003d\\"res'.decode('latin1')
ÐÐÐ\u003d\"res
This kind of mis-decoding is called a Mojibake, and can be repaired by re-encoding to Latin-1, then decoding from the correct codec (UTF-8 in your case):
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape')
ÐÐÐ="res
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape').encode('latin1').decode('utf8')
АБВ="res
Note that this will fail if the \uhhhh escape sequences encode codepoints outside of the Latin-1 range (U+0000-U+00FF).
The Python 3 equivalent of the above uses codecs.encode():
>>> import codecs
>>> codecs.decode('АБВ\\u003d\\"res', 'unicode_escape').encode('latin1').decode('utf8')
'АБВ="res'
The regex really is the easiest solution (Python 3):
text = 'АБВ\\u003d\\"re'
re.sub(r'(?i)(?<!\\)(?:\\\\)*\\u([0-9a-f]{4})', lambda m: chr(int(m.group(1), 16)), text)
This works fine with any 4-nibble Unicode escape, and can be pretty easily extended to other escapes.
For Python 2, make all strings u'' strings, and use unichr.

Python convert unicode to ASCII

I have a list of strings with various different characters that are similar to latin ones, I get these from a website that I download from using urllib2. The website is encoded in utf-8. However, after trying quite a few variations, I can't figure out how to convert this to simple ASCII equivalent. So for example, one of the strings I have is:
u'Atl\xc3\xa9tico Madrid'
In plain text it's "Atlético Madrid", what I want, is to change it to just "Atletico Madrid".
If I use simple unidecode on this, I get "AtlA(c)tico Madrid". What am I doing wrong?
You have UTF-8 bytes in a Unicode string. That's not a proper Unicode string, that's a Mojibake:
>>> print u'Atl\xc3\xa9tico Madrid'
Atlético Madrid
Repair your string first:
>>> u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
u'Atl\xe9tico Madrid'
>>> print u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
Atlético Madrid
and Unidecode will give you what you expected:
>>> import unidecode
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid')
'AtlA(c)tico Madrid'
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8'))
'Atletico Madrid'
Better still would be to read your data correctly in the first place; you appear to have decoded the data as Latin-1 (or perhaps the Windows CP-1252 codepage) rather than as UTF-8.

Converting html entity to text

I have ’ in my HTML file (which is a right curly quote) and I want to convert it to text (if possible).
I tried using HTMLParser and BeautifulSoup but to no success.
>>> h = HTMLParser.HTMLParser()
>>> h.unescape("'")
u"'"
>>> h.unescape("’")
u'\x92' # I was hoping for a right curly quote here.
My goal is very simple: Take the html input and output all the text (without any html codes).
"right curly quote" is not an ascii character. u'\x92' is the python representation of the unicode character representing it and not some "html code".
To display it properly in your terminal, use print h.unescape("’").encode('utf-8') (or whatever you terminal's charset is).

How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?

For example, if I have a unicode string, I can encode it as an ASCII string like so:
>>> u'\u003cfoo/\u003e'.encode('ascii')
'<foo/>'
However, I have e.g. this ASCII string:
'\u003foo\u003e'
... that I want to turn into the same ASCII string as in my first example above:
'<foo/>'
It took me a while to figure this one out, but this page had the best answer:
>>> s = '\u003cfoo/\u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'
There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).
EDIT: See also Python Standard Encodings.
On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).
I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.
Anyway, this is it.
At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)
For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors
>>> s = '\u003cfoo\u003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>
Ned Batchelder said:
It's a little dangerous depending on where the string is coming from,
but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'
Actually this method can be made safe like so:
>>> s = '\u003cfoo\u003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]
Mind the triple-quote string and the dash right before the closing 3-quotes.
Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]
So there would be no need to worry about what the users enter, as long as it is captured in raw format.
It's a little dangerous depending on where the string is coming from, but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'

Categories