Python and character normalization - python

Hello
I retrieve text based utf8 data from a foreign source which contains special chars such as u"ıöüç" while I want to normalize them to English such as "ıöüç" -> "iouc" . What would be the best way to achieve this ?

I recommend using Unidecode module:
>>> from unidecode import unidecode
>>> unidecode(u'ıöüç')
'iouc'
Note how you feed it a unicode string and it outputs a byte string. The output is guaranteed to be ASCII.

It all depends on how far you want to go in transliterating the result. If you want to convert everything all the way to ASCII (αβγ to abg) then unidecode is the way to go.
If you just want to remove accents from accented letters, then you could try decomposing your string using normalization form NFKD (this converts the accented letter á to a plain letter a followed by U+0301 COMBINING ACUTE ACCENT) and then discarding the accents (which belong to the Unicode character class Mn — "Mark, nonspacing").
import unicodedata
def remove_nonspacing_marks(s):
"Decompose the unicode string s and remove non-spacing marks."
return ''.join(c for c in unicodedata.normalize('NFKD', s)
if unicodedata.category(c) != 'Mn')

The simplest way I found:
unicodedata.normalize('NFKD', s).encode("ascii", "ignore")

import unicodedata
unicodedata.normalize()
http://docs.python.org/library/unicodedata.html

Related

Python - Best way to detect accent HTML escape in a string?

Python has some good libraries to convert Unicode accent characters to its closest Ascii character, as well as libraries to encode codepoint to its Unicode character.
However, what options are there to check whether a string has unicode codepoint or HTML escape? For example, this string:
Rialta te Venice&#199
Has the &#199, which translates to a latin capital letter C. Is there a python library that detects codepoints/escape within a string and outputs the Unicode equivalent?
It's not quite clear to me what you're asking, but here is my best try:
&#199 is an HTML escape, which you can unescape like so:
>>> s = 'Rialta te Venice&#199'
>>> import html
>>> s2 = html.unescape(s); s2
'Rialta te VeniceÇ'
As you've said, there are libraries for normalizing/removing accents:
>>> import unidecode
>>> unidecode.unidecode(s2)
'Rialta te VeniceC'
You don't really need to check if it has Unicode codepoints, as this function won't change non-accented characters. But you could check anyway using s2.isascii().
So the complete solution is to use unidecode.unidecode(html.unescape(s)).

Char to Unicode in Python [duplicate]

I am trying to use the encode method of python strings to return the unicode escape codes for characters, like this:
>>> print( 'ф'.encode('unicode_escape').decode('utf8') )
\u0444
This works fine with non-ascii characters, but for ascii characters, it just returns the ascii characters themselves:
>>> print( 'f'.encode('unicode_escape').decode('utf8') )
f
The desired output would be \u0066. This script is for pedagogical purposes.
How can I get the unicode hex codes for ALL characters?
ord can be used for this, there is no need for encoding/decoding at all:
>>> '"\\U{:08x}"'.format(ord('f')) # ...or \u{:04x} if you prefer
'"\\U00000066"'
>>> eval(_)
'f'
You'd have to do so manually; if you assume that all your input is within the Unicode BMP, then a straightforward regex will probably be fastest; this replaces every character with their \uhhhh escape:
import re
def unicode_escaped(s, _pattern=re.compile(r'[\x00-\uffff]')):
return _pattern.sub(lambda m: '\\u{:04x}'.format(
ord(m.group(0))), s)
I've explicitly limited the pattern to the BMP to gracefully handle non-BMP points.
Demo:
>>> print(unicode_escaped('foo bar ф'))
\u0066\u006f\u006f\u0020\u0062\u0061\u0072\u0020\u0444

How can I make a Python string to include unicode code points?

I want to have an ASCII representation of a string that could contain non-ascii characters such as German umlauts. The way the non-ascii characters should be encoded is as unicode code points, e.g. ß would be \u00df.
The problem is that I have those escape sequences in my database. It gets displayed like I want it, but when the user searches for something, he enters ß and not \u00df. For ß, it works for me to simply make search_query.replace('ß', r'\u00df'), but there are (many) more possible escape sequences.
What I tried
>>> name = 'Ein Spaß'
>>> name.encode('ascii', 'backslashreplace')
b'Ein Spa\\xdf'
>>> name.encode('ascii', 'xmlcharrefreplace')
b'Ein Spaß'
What I want to get:
'Ein Spa\\u00df'
As a dumb workaround, stdlib json encoding will use the 4 digit unicode escapes:
>>> name = 'Ein Spaß'
>>> json.dumps(name)
'"Ein Spa\\u00df"'
>>> ast.literal_eval(json.dumps(name)) == name
True
However, this will not really solve your search problem robustly. You'll need to normalize the query text before searching. And you'll want to normalize unicode data on the way into the database, too - or use a db + ORM which handles such details for you.
See this answer for details about a better tool for the job here: unicodedata.normalize.
encode in ascii if possible
else replace by code point as unicode string
ord : is a function to get character code point as integer base 10
new=[]
for e in name:
try:
new.append(e.encode("ascii").decode())
except:
new.append(u"\\u%04x"%ord(e))
"".join(new)
If the data in your database is stored as escaped unicode, you can use codecs.decode with encoding set to unicode_escape:
>>> name = "Ein Spa\\u00df"
>>> codecs.decode(name, "unicode_escape")
'Ein Spaß'

How can I convert a unicode string into string literals in Python 2.7?

Python2.7:
I would like to do something unusual. Most people want to convert string literals to more human-readable strings. I would like to convert the following list of unicode strings into their literal forms:
hallöchen
Straße
Gemüse
freø̯̯nt
to their codepoint forms that look something like this:
\u3023\u2344
You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.
I am not sure what the terminology is for these things—please correct me if I am mistaken.
You can use the str.encode([encoding[, errors]]) function with the unicode_escape encoding:
>>> s = u'freø̯̯nt'
>>> print(s.encode('unicode_escape'))
b'fre\\xf8\\u032f\\u032fnt'
You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.
You don't need codecs.encode(unicode_string, 'unicode-escape') in this case. There are no string literals in memory only string objects.
Unicode string is a sequence of Unicode codepoints in Python. The same user-perceived characters can be written using different codepoints e.g., 'Ç' could be written as u'\u00c7' and u'\u0043\u0327'.
You could use NFKD Unicode normalization form to make sure "breves" are separate in order not to miss them when they are duplicated:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import unicodedata
s = u"freø̯̯nt"
# remove consecutive duplicate "breves"
print(re.sub(u'\u032f+', u'\u032f', unicodedata.normalize('NFKD', s)))
Could you explain why your re.sub command does not have any +1 for ensuring that the breves are consecutive characters? (like #Paulo Freitas's answer)
re.sub('c+', 'c', text) makes sure that there are no 'cc', 'ccc', 'cccc', etc in the text. Sometimes the regex does unnecessary work by replacing 'c' with 'c'. But the result is the same: no consecutive duplicate 'c' in the text.
The regex from #Paulo Freitas's answer should also work:
no_duplicates = re.sub(u'(\u032f)\\1+', r'\1', unicodedata.normalize('NFKD', s))
It performs the replacement only for duplicates. You can measure time performance and see what regex runs faster if it is a bottleneck in your application.

Approximately converting unicode string to ascii string in python

don't know wether this is trivial or not, but I'd need to convert an unicode string to ascii string, and I wouldn't like to have all those escape chars around. I mean, is it possible to have an "approximate" conversion to some quite similar ascii character?
For example: Gavin O’Connor gets converted to Gavin O\x92Connor, but I'd really like it to be just converted to Gavin O'Connor. Is this possible? Did anyone write some util to do it, or do I have to manually replace all chars?
Thank you very much!
Marco
Use the Unidecode package to transliterate the string.
>>> import unidecode
>>> unidecode.unidecode(u'Gavin O’Connor')
"Gavin O'Connor"
import unicodedata
unicode_string = u"Gavin O’Connor"
print unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore')
Output:
Gavin O'Connor
Here's the document that describes the normalization forms: http://unicode.org/reports/tr15/
b = str(a.encode('utf-8').decode('ascii', 'ignore'))
should work fine.
There is a technique to strip accents from characters, but other characters need to be directly replaced. Check this article: http://effbot.org/zone/unicode-convert.htm
Try simple character replacement
str1 = "“I am the greatest”, said Gavin O’Connor"
print(str1)
print(str1.replace("’", "'").replace("“","\"").replace("”","\""))
PS: add # -*- coding: utf-8 -*- to the top of your .py file if you get error

Categories