Python - Best way to detect accent HTML escape in a string? - python

Python has some good libraries to convert Unicode accent characters to its closest Ascii character, as well as libraries to encode codepoint to its Unicode character.
However, what options are there to check whether a string has unicode codepoint or HTML escape? For example, this string:
Rialta te Venice&#199
Has the &#199, which translates to a latin capital letter C. Is there a python library that detects codepoints/escape within a string and outputs the Unicode equivalent?

It's not quite clear to me what you're asking, but here is my best try:
&#199 is an HTML escape, which you can unescape like so:
>>> s = 'Rialta te Venice&#199'
>>> import html
>>> s2 = html.unescape(s); s2
'Rialta te VeniceÇ'
As you've said, there are libraries for normalizing/removing accents:
>>> import unidecode
>>> unidecode.unidecode(s2)
'Rialta te VeniceC'
You don't really need to check if it has Unicode codepoints, as this function won't change non-accented characters. But you could check anyway using s2.isascii().
So the complete solution is to use unidecode.unidecode(html.unescape(s)).

Related

How can I make a Python string to include unicode code points?

I want to have an ASCII representation of a string that could contain non-ascii characters such as German umlauts. The way the non-ascii characters should be encoded is as unicode code points, e.g. ß would be \u00df.
The problem is that I have those escape sequences in my database. It gets displayed like I want it, but when the user searches for something, he enters ß and not \u00df. For ß, it works for me to simply make search_query.replace('ß', r'\u00df'), but there are (many) more possible escape sequences.
What I tried
>>> name = 'Ein Spaß'
>>> name.encode('ascii', 'backslashreplace')
b'Ein Spa\\xdf'
>>> name.encode('ascii', 'xmlcharrefreplace')
b'Ein Spaß'
What I want to get:
'Ein Spa\\u00df'
As a dumb workaround, stdlib json encoding will use the 4 digit unicode escapes:
>>> name = 'Ein Spaß'
>>> json.dumps(name)
'"Ein Spa\\u00df"'
>>> ast.literal_eval(json.dumps(name)) == name
True
However, this will not really solve your search problem robustly. You'll need to normalize the query text before searching. And you'll want to normalize unicode data on the way into the database, too - or use a db + ORM which handles such details for you.
See this answer for details about a better tool for the job here: unicodedata.normalize.
encode in ascii if possible
else replace by code point as unicode string
ord : is a function to get character code point as integer base 10
new=[]
for e in name:
try:
new.append(e.encode("ascii").decode())
except:
new.append(u"\\u%04x"%ord(e))
"".join(new)
If the data in your database is stored as escaped unicode, you can use codecs.decode with encoding set to unicode_escape:
>>> name = "Ein Spa\\u00df"
>>> codecs.decode(name, "unicode_escape")
'Ein Spaß'

How to cope with diacritics while trying to match with regex in Python

Trying to use regular expression with unicode html escapes for diacritics:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
htmlstring=u'''/">čćđš</a>.../">España</a>'''
print re.findall( r'/">(.*?)</a', htmlstring, re.U )
produces :
[u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']
Any help, please?
This appears to be an encoding question. Your code is working as it should. Were you expecting something different? Your strings that are prefixed with u are unicode literals. The characters that begin with \u are unicode characters followed by four hex digits, whereas the characters that begin with \x are unicode characters followed by only two hex digits. If you print out your results (instead of looking at their __repr__ method), you will see that you have received the result that it appears you were looking for:
results = [u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']
for result in results:
print result
čćđš
España
In your code (i.e. in your list), you see the representation of these unicode literals:
for result in results:
print result.__repr__()
u'\u010d\u0107\u0111\u0161' # what shows up in your list
u'Espa\xf1a'
Incidentally, it appears that you are trying to parse html with regexes. You should try BeautifulSoup or something similar instead. It will save you a major headache down the road.

Python convert unicode to ASCII

I have a list of strings with various different characters that are similar to latin ones, I get these from a website that I download from using urllib2. The website is encoded in utf-8. However, after trying quite a few variations, I can't figure out how to convert this to simple ASCII equivalent. So for example, one of the strings I have is:
u'Atl\xc3\xa9tico Madrid'
In plain text it's "Atlético Madrid", what I want, is to change it to just "Atletico Madrid".
If I use simple unidecode on this, I get "AtlA(c)tico Madrid". What am I doing wrong?
You have UTF-8 bytes in a Unicode string. That's not a proper Unicode string, that's a Mojibake:
>>> print u'Atl\xc3\xa9tico Madrid'
Atlético Madrid
Repair your string first:
>>> u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
u'Atl\xe9tico Madrid'
>>> print u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
Atlético Madrid
and Unidecode will give you what you expected:
>>> import unidecode
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid')
'AtlA(c)tico Madrid'
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8'))
'Atletico Madrid'
Better still would be to read your data correctly in the first place; you appear to have decoded the data as Latin-1 (or perhaps the Windows CP-1252 codepage) rather than as UTF-8.

Converting html source content into readable format with Python 2.x

Python 2.7
I have a program that gets video titles from the source code of a webpage but the titles are encoded in some HTML format.
This is what I've tried so far:
>>> import urllib2
>>> urllib2.unquote('£')
'£'
So that didn't work...
Then I tried:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('£')
u'\xa3'
as you can see that doesn't work either nor any combination of the two.
I managed to find out that '£' is an HTML character entity name. The '\xa3' I wasn't able to find out.
Does anyone know how to do this, how to convert HTML content into a readable format in python?
£ is the html character entity for the POUND SIGN, which is unicode character U+00A3. You can see this if you print it:
>>> print u'\xa3'
£
When you use unescape(), you converted the character entity to it's native unicode character, which is what u'\xa3' means--a single U+00A3 unicode character.
If you want to encode this into another format (e.g. utf-8), you would do so with the encode method of strings:
>>> u'\xa3'.encode('utf-8')
'\xc2\xa3'
You get a two-byte string representing the single "POUND SIGN" character.
I suspect that you are a bit unclear about how string encodings work in general. You need to convert your string from bytes to unicode (see this answer for one way to do that with urllib2), then unescape the html, then (possibly) convert the unicode into whatever output encoding you need.
Why doesn't that work?
In [1]: s = u'\xa3'
In [2]: s
Out[2]: u'\xa3'
In [3]: print s
£
When it comes to unescaping html entities I always used: http://effbot.org/zone/re-sub.htm#unescape-html.
The video title strings use HTML entities to encode special characters, such as ampersands and pound signs.
The \xa3 is the Python Unicode character literal for the pound sign (£). In your example, Python is displaying the __repr__() of a Unicode string, which is why you see the escapes. If you print this string, you can see it represents the pound sign:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('£')
u'\xa3'
>>> print h.unescape('£')
£
lxml, BeautifulSoup or PyQuery does the job pretty well. Or combination of these ;)

Python and character normalization

Hello
I retrieve text based utf8 data from a foreign source which contains special chars such as u"ıöüç" while I want to normalize them to English such as "ıöüç" -> "iouc" . What would be the best way to achieve this ?
I recommend using Unidecode module:
>>> from unidecode import unidecode
>>> unidecode(u'ıöüç')
'iouc'
Note how you feed it a unicode string and it outputs a byte string. The output is guaranteed to be ASCII.
It all depends on how far you want to go in transliterating the result. If you want to convert everything all the way to ASCII (αβγ to abg) then unidecode is the way to go.
If you just want to remove accents from accented letters, then you could try decomposing your string using normalization form NFKD (this converts the accented letter á to a plain letter a followed by U+0301 COMBINING ACUTE ACCENT) and then discarding the accents (which belong to the Unicode character class Mn — "Mark, nonspacing").
import unicodedata
def remove_nonspacing_marks(s):
"Decompose the unicode string s and remove non-spacing marks."
return ''.join(c for c in unicodedata.normalize('NFKD', s)
if unicodedata.category(c) != 'Mn')
The simplest way I found:
unicodedata.normalize('NFKD', s).encode("ascii", "ignore")
import unicodedata
unicodedata.normalize()
http://docs.python.org/library/unicodedata.html

Categories