Combined diacritics do not normalize with unicodedata.normalize (PYTHON) - python

I understand that unicodedata.normalize converts diacritics to their non diacritic counterparts:
import unicodedata
''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf')
if unicodedata.category(c) != 'Mn'
)
My question is (and can be seen in this example): does unicodedata has a way to replace combined char diacritics into their counterparts? (u'œ' becomes 'oe')
If not I assume I will have to put a hit out for these, but then I might as well compile my own dict with all uchars and their counterparts and forget about unicodedata altogether...

There's a bit of confusion about terminology in your question. A diacritic is a mark that can be added to a letter or other character but generally does not stand on its own. (Unicode also uses the more general term combining character.) What normalize('NFD', ...) does is to convert precomposed characters into their components.
Anyway, the answer is that œ is not a precomposed character. It's a typographic ligature:
>>> unicodedata.name(u'\u0153')
'LATIN SMALL LIGATURE OE'
The unicodedata module provides no method for splitting ligatures into their parts. But the data is there in the character names:
import re
import unicodedata
_ligature_re = re.compile(r'LATIN (?:(CAPITAL)|SMALL) LIGATURE ([A-Z]{2,})')
def split_ligatures(s):
"""
Split the ligatures in `s` into their component letters.
"""
def untie(l):
m = _ligature_re.match(unicodedata.name(l))
if not m: return l
elif m.group(1): return m.group(2)
else: return m.group(2).lower()
return ''.join(untie(l) for l in s)
>>> split_ligatures(u'B\u0153uf \u0132sselmeer \uFB00otogra\uFB00')
u'Boeuf IJsselmeer ffotograff'
(Of course you wouldn't do it like this in practice: you'd preprocess the Unicode database to generate a lookup table as you suggest in your question. There aren't all that many ligatures in Unicode.)

Related

Seperate accents from their letters

I'm looking for a function that will take a compound letter and split it as if you had to type it on a US-INTL keyboard, like so:
'ȯ' becomes ".o"
'â' becomes "^a"
'ë' becomes "\"e"
'è' becomes "`e"
'é' becomes "'e"
'ñ' becomes "~n"
'ç' becomes ",c"
etc.
But when searching for this issue I can only find functions to remove accents entirely, which is not what I want.
Here's what I want to accomplish:
Expand this string:
ër íí àha lá eïsch
into this string:
"er 'i'i `aha l'a e"isch
You can possibly use a dictionary to match the characters with their replacements and then iterate over the string to do the actual replacement.
word_rep = dict(zip(['ȯ','â','ë','è','é','ñ','ç']
['.o','^a','\"e','`e','\'e','~n',',c']))
mystr = 'ër íí àha lá eïsch'
for key,value in word_rep.items():
mystr = mystr.replace(key,value)
Below uses Unicode decomposition to separate combining marks from latin letters, a regular expression to swap the combining character and its letter, then a translation table to convert the combining mark to the key used on the international keyboard:
import unicodedata as ud
import re
replacements = {'\N{COMBINING DOT ABOVE}':'.',
'\N{COMBINING CIRCUMFLEX ACCENT}':'^',
'\N{COMBINING DIAERESIS}':'"',
'\N{COMBINING GRAVE ACCENT}':'`',
'\N{COMBINING ACUTE ACCENT}':"'",
'\N{COMBINING TILDE}':'~',
'\N{COMBINING CEDILLA}':','}
combining = ''.join(replacements.keys())
typing = ''.join(replacements.values())
translation = str.maketrans(combining,typing)
s = 'ër íí àha lá eïsch'
s = ud.normalize('NFD',s)
s = re.sub(rf'([aeiounc])([{combining}])',r'\2\1',s)
s = s.translate(translation)
print(s)
Output:
"er 'i'i `aha l'a e"isch

Remove special characters from string such as smileys but keep german special charactes

I know how to remove unwanted charactes in a string, like smileys etc. However, some languages like german have special charactes, too.
This is my current code:
import unicodedata
string = "süß 😆😋😉"
uni_str = str(unicodedata.normalize('NFKD', \
string).encode('ascii','ignore'))
Is there the possibillity to keep the german special characters bu delete the other unwanted charactes, such as smileys like 😆😋😉? so that uni_str will hold the letters "süß" at the end?
Curently, the smileys will get deleted, but the german characters will either be transformed in other vocals or deletet, too.
The smileys in the example are just exemplary and can be any kind of unwanted character.
I am using Python 3.6 and Windows 10
You could do something simple like this (just add the German letters):
def filter_characters(self, value):
allowed_characters = " 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
return ''.join(c for c in value if c in allowed_characters )
Edit:
Another possibilty is to create the allowed_characters with the help of the string module:
import string
allowed_characters = string.printable + 'öäüß'
I don't know which characters are special or otherwise unwanted in your world, but maybe removing all characters with a "Symbol, Other" property is something useful:
import unicodedata as ud
def remove_symbols(text):
return ''.join(c for c in text if ud.category(c) != 'So')
This will keep any letters, digits, punctuation symbols, and white space characters, so the following example won't lose the fancy quotes or the "é":
>>> remove_symbols('Was für ein «süßes» Café! 😆')
'Was für ein «süßes» Café! '
Have a look at the available categories on Wikipedia or search for it on unicode.org.

Python generic function to replace special characters

I have been looking for a while now but I am not able to find a proper solution.
I have a database with Dutch, French and German words which all have their special characters. e.g. é, è, ß, ç, etc...
For some cases, like in a url, I would like to replace these with alphanumeric characters. respectively e, e, ss, c, etc...
Is there a generic function or Python package that does this?
I could do this with Regex of course, but something generic would be great here.
Thanks.
try this package: https://pypi.python.org/pypi/Unidecode
>>> import unidecode
>>> unidecode.unidecode(u'çß')
'css'
As you say, this could be done using a Regex sub. You would of course need to include upper and lowercase variants.
import re
data = "é, è, ß, ç, äÄ"
lookup = {'é':'e', 'è':'e', 'ß':'ss', 'ç':'c', 'ä':'a', 'Ä':'A'}
print re.sub(r'([éèßçäÄ])', lambda x: lookup[x.group(1)], data)
This would display the following:
e, e, ss, c, aA
you can almost get away with the builtin unicode data (unfortunately a few of your characters break it)
>>> import unicodedata
>>> s=u"é, è, ß, ç"
>>> unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
'e, e, , c'
here is a solution that has the codepoints hardcoded stolen from http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
def latin1_to_ascii (unicrap):
"""This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. It returns a plain ASCII string.
This function makes a best effort to convert Latin-1 characters into
ASCII equivalents. It does not just strip out the Latin-1 characters.
All characters in the standard 7-bit ASCII range are preserved.
In the 8th bit range all the Latin-1 accented letters are converted
to unaccented equivalents. Most symbol characters are converted to
something meaningful. Anything not converted is deleted.
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += str(i)
return r
of coarse you could just as easily use a regex as indicated in the other answers

How to efficiently remove non-ASCII characters and numbers, but keep accented ASCII characters

I have several strings like this:
s = u'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
s
"awëerwq مرحباмир bròn 1990 23x4 + &23 'we' we's mexicqué"
I couldn't found a way to remove non-printable things like 'مرحباми', but keeping latin characters like 'óé,...'. Also numbers (like '1990') are undesirable in my case. I have used ASCII flag from re but I don't know what's wrong with that because it removes 'óëé,...'. It is the same problem with using string.printable.
I don't know why
ord('ë')
235
Given that the ASCII table it is assigned 137. The result I would to expect is something like this:
x = some_method(s)
"awëerwq bròn 23x4 we we s mexicqué"
Then, I would like to code with no dependence on unfixed codification.
Here's a way that might help (Python 3.4):
import unicodedata
def remove_nonlatin(s):
s = (ch for ch in s
if unicodedata.name(ch).startswith(('LATIN', 'DIGIT', 'SPACE')))
return ''.join(s)
>>> s = 'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
>>> remove_nonlatin(s)
'awëerwqbròn 1990 23x4 23 we wes mexicqué'
This grabs the unicode names of the characters in the string, and matches charaters who's names start with LATIN, DIGIT, or SPACE.
For example, this would match:
>>> unicodedata.name('S')
'LATIN CAPITAL LETTER S'
And this would not:
>>> unicodedata.name('م')
'ARABIC LETTER MEEM'
I'm reasonably sure that latin characters all have unicode names starting with 'LATIN', so this should filter out other writing scripts, while keeping digits and spaces. There's not a convenient one-liner for punctuation, so in this example, exclamation points and such are also filtered out.
You could presumably filter by code point by using something like ord(c) < 0x250, though you may get some things that you aren't expecting. Or, you could try filtering by unicodedata.category. However, the 'letter' category includes letters from a lot of scripts, so you will still end up with some of these: 'م'.
I have used ASCII flag from re but I don't know what's wrong with that because it removes 'óëé,...'.
I think you are asking your question wrong. ASCII does not have the characters óëé in it. Take a look here to see the set of all ASCII characters and see how basic it is:
https://en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart
It appears that the string you are using is in Unicode since it can support both "مرحباми" as well as "'óëé" at the same time.
In that case, you can find the character ranges you want using
http://jrgraphix.net/research/unicode_blocks.php
and include only the Latin ones (this will filter out Arabic characters for example).
Here's an example:
import re
s = u"مرحباми123"
# prints "123" by keeping all characters from the following ranges:
# 0020 — 007F Basic Latin
# 00A0 — 00FF Latin-1 Supplement
# 0100 — 017F Latin Extended-A
# 0180 — 024F Latin Extended-B
print ''.join(re.findall(ur'[\u0020-\u007F\u00A0-\u00FF\u0100-\u017F\u0180-\u024F]+', s))

How to account for accent characters for regex in Python?

I currently use re.findall to find and isolate words after the '#' character for hash tags in a string:
hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)
It searches str1 and finds all the hashtags. This works however it doesn't account for accented characters like these for example: áéíóúñü¿.
If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example, #yogenfrüz would be #yogenfr.
I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like #yogenfrüz
How can I go about doing this
Try the following:
hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)
Regex101 Demo
EDIT
Check the useful comment below from Martijn Pieters.
I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.
hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)
which will return ['yogenfrüz']
Hope this'll help anyone else.
You may also want to use
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
Explicit example...
myfoo = u'àà'
myfoo
u'\xe0\xe0'
unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?
Here's an update to Ibrahim Najjar's original answer based on the comment Martijn Pieters made to the answer and another answer Martijn Pieters gave in https://stackoverflow.com/a/16467505/5302861:
import re
import unicodedata
s = "#ábá123"
n = unicodedata.normalize('NFC', s)
print(n)
c = ''.join(re.findall(r'#\w+', n, re.UNICODE))
print(s, len(s), c, len(c))

Categories