Seperate accents from their letters - python

I'm looking for a function that will take a compound letter and split it as if you had to type it on a US-INTL keyboard, like so:
'ȯ' becomes ".o"
'â' becomes "^a"
'ë' becomes "\"e"
'è' becomes "`e"
'é' becomes "'e"
'ñ' becomes "~n"
'ç' becomes ",c"
etc.
But when searching for this issue I can only find functions to remove accents entirely, which is not what I want.
Here's what I want to accomplish:
Expand this string:
ër íí àha lá eïsch
into this string:
"er 'i'i `aha l'a e"isch

You can possibly use a dictionary to match the characters with their replacements and then iterate over the string to do the actual replacement.
word_rep = dict(zip(['ȯ','â','ë','è','é','ñ','ç']
['.o','^a','\"e','`e','\'e','~n',',c']))
mystr = 'ër íí àha lá eïsch'
for key,value in word_rep.items():
mystr = mystr.replace(key,value)

Below uses Unicode decomposition to separate combining marks from latin letters, a regular expression to swap the combining character and its letter, then a translation table to convert the combining mark to the key used on the international keyboard:
import unicodedata as ud
import re
replacements = {'\N{COMBINING DOT ABOVE}':'.',
'\N{COMBINING CIRCUMFLEX ACCENT}':'^',
'\N{COMBINING DIAERESIS}':'"',
'\N{COMBINING GRAVE ACCENT}':'`',
'\N{COMBINING ACUTE ACCENT}':"'",
'\N{COMBINING TILDE}':'~',
'\N{COMBINING CEDILLA}':','}
combining = ''.join(replacements.keys())
typing = ''.join(replacements.values())
translation = str.maketrans(combining,typing)
s = 'ër íí àha lá eïsch'
s = ud.normalize('NFD',s)
s = re.sub(rf'([aeiounc])([{combining}])',r'\2\1',s)
s = s.translate(translation)
print(s)
Output:
"er 'i'i `aha l'a e"isch

Related

Remove special characters from string such as smileys but keep german special charactes

I know how to remove unwanted charactes in a string, like smileys etc. However, some languages like german have special charactes, too.
This is my current code:
import unicodedata
string = "süß 😆😋😉"
uni_str = str(unicodedata.normalize('NFKD', \
string).encode('ascii','ignore'))
Is there the possibillity to keep the german special characters bu delete the other unwanted charactes, such as smileys like 😆😋😉? so that uni_str will hold the letters "süß" at the end?
Curently, the smileys will get deleted, but the german characters will either be transformed in other vocals or deletet, too.
The smileys in the example are just exemplary and can be any kind of unwanted character.
I am using Python 3.6 and Windows 10
You could do something simple like this (just add the German letters):
def filter_characters(self, value):
allowed_characters = " 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
return ''.join(c for c in value if c in allowed_characters )
Edit:
Another possibilty is to create the allowed_characters with the help of the string module:
import string
allowed_characters = string.printable + 'öäüß'
I don't know which characters are special or otherwise unwanted in your world, but maybe removing all characters with a "Symbol, Other" property is something useful:
import unicodedata as ud
def remove_symbols(text):
return ''.join(c for c in text if ud.category(c) != 'So')
This will keep any letters, digits, punctuation symbols, and white space characters, so the following example won't lose the fancy quotes or the "é":
>>> remove_symbols('Was für ein «süßes» Café! 😆')
'Was für ein «süßes» Café! '
Have a look at the available categories on Wikipedia or search for it on unicode.org.

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

How do I iterate through unicode symbols, not bytes in python?

Given the accented unicode word like u'кни́га', I need to strip the acute (u'книга'), and also change the accent format to u'кни+га', where '+' represents the acute over the preceding letter.
What I do now is using a dictionary of acuted and not acuted symbols:
accented_list = [u'я́', u'и́', u'ы́', u'у́', u'э́', u'а́', u'е́', u'ю́', u'о́']
regular_list = [u'я', u'и', u'ы', u'у', u'э', u'а', u'е', u'ю', u'о']
accent_dict = dict(zip(accented_list, regular_list))
I want to do something like this:
def changeAccentFormat(word):
for letter in accent_dict:
if letter in word:
its_index = word.index(letter)
word = word[:its_index + 1] + u'+' + word[its_index + 1:]
return word
But of course it does not work as desired. I noticed that this code:
>>> word = u'кни́га'
>>> for letter in word:
... print letter
gives
к
н
и
´
г
а
(Well, i didn't expect the blank symbol to appear, but nevertheless). So I wonder, what is the simplest way to produce [u'к', u'н', u'и́', u'г', u'а']? Or maybe there is some way to solve my problem without it?
First of all, in regard to iterating over characters instead bytes, you're already doing it right - your word is an unicode object, not an encoded bytestring.
Now, for combination characters in Unicode:
For many characters containing combination characters there is a composed and decomposed form of writing it, the composed being one code point, and the decomposed a sequence of two (or more?) code points:
See U+00E7, U+0063 and U+0327
So in Python, you could either write either form, it will get composed at display time to the same character:
>>> combining_cedilla = u'\u0327'
>>> c_with_cedilla = u'\u00e7'
>>> letter_c = u'\u0063'
>>>
>>> print c_with_cedilla
ç
>>> print letter_c + combining_cedilla
ç
In order to convert between composed and decomposed forms, you can use unicodedata.normalize():
>>> import unicodedata
>>> comp = unicodedata.normalize('NFC', letter_c + combining_cedilla)
>>> decomp = unicodedata.normalize('NFD', c_with_cedilla)
>>>
>>> print comp
ç
>>> print decomp
ç
(NFC stands for "normal form C" (composed), and NFD for "normal form D" (decomposed).
They still are different forms though - one consisting of one code point, the other of two:
>>> comp == decomp
False
>>> len(comp)
1
>>> len(decomp)
2
However, in your case, there simply does not seem to be a combined character for the lowercase и with an accent acute (there is one for и with an accent grave)
You can produce [u'к', u'н', u'и́', u'г', u'а'] with the regex module.
Here is the word you have by each user perceived character:
>>> import regex
>>> word = u'кни́га'
>>> len(word)
6
>>> regex.findall(r'\X', word)
['к', 'н', 'и́', 'г', 'а']
>>> len(regex.findall(r'\X', word))
5
Acutes are represented by codepoint 301, COMBINING ACUTE ACCENT, so a simple string character replacement should suffice:
>>>print u'кни́га'.replace(u'\u0301', "+")
кни+га
If you encounter accented characters that are not encoded with a combining accent, unicodedata.normalize should do the trick

Stripping a unicode text of whatever is not a character

I'm trying to write a simple Python script which takes a text file as an input, deletes every non-literal character, and writes the output in another file.
Normally I would have done two ways:
use a regular expression combined with re.sub to replace every non letter character with empty strings
examine every char in every line and write it to the output only if it was in string.lowercase
But this time the text is The Divine Comedy in Italian (I'm Italian), so there are some Unicode characters like
èéï
and some others. I wrote # -*- coding: utf-8 -*- as the first line of the script, but what I got is that Python doesn't signal errors when Unicode chars are written inside the script.
Then I tried to include Unicode chars in my regular expression, writing them as, for example:
u'\u00AB'
and it seems to work, but Python, when reading input from a file, doesn't rewrite what it read the same way it read it. For example, some characters get converted into square root symbol.
What should I do?
unicodedata.category(unichr) will return the category of that code-point.
You can find a description of the categories at unicode.org but the ones relevant to you are the L, N, P, Z and maybe S groups:
Lu Uppercase_Letter an uppercase letter
Ll Lowercase_Letter a lowercase letter
Lt Titlecase_Letter a digraphic character, with first part uppercase
Lm Modifier_Letter a modifier letter
Lo Other_Letter other letters, including syllables and ideographs
...
You might also want to normalize your string first so that diacriticals that can attach to letters do so:
unicodedata.normalize(form, unistr)
Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
Putting all this together:
file_bytes = ... # However you read your input
file_text = file_bytes.decode('UTF-8')
normalized_text = unicodedata.normalize('NFC', file_text)
allowed_categories = set([
'Ll', 'Lu', 'Lt', 'Lm', 'Lo', # Letters
'Nd', 'Nl', # Digits
'Po', 'Ps', 'Pe', 'Pi', 'Pf', # Punctuation
'Zs' # Breaking spaces
])
filtered_text = ''.join(
[ch for ch in normalized_text
if unicodedata.category(ch) in allowed_categories])
filtered_bytes = filtered_text.encode('UTF-8') # ready to be written to a file
import codecs
f = codecs.open('FILENAME', encoding='utf-8')
for line in f:
print repr(line)
print line
1. Will Give you Unicode Formation
2. Will Give you as per written in your file.
Hopefully It will Help you :)

Combined diacritics do not normalize with unicodedata.normalize (PYTHON)

I understand that unicodedata.normalize converts diacritics to their non diacritic counterparts:
import unicodedata
''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf')
if unicodedata.category(c) != 'Mn'
)
My question is (and can be seen in this example): does unicodedata has a way to replace combined char diacritics into their counterparts? (u'œ' becomes 'oe')
If not I assume I will have to put a hit out for these, but then I might as well compile my own dict with all uchars and their counterparts and forget about unicodedata altogether...
There's a bit of confusion about terminology in your question. A diacritic is a mark that can be added to a letter or other character but generally does not stand on its own. (Unicode also uses the more general term combining character.) What normalize('NFD', ...) does is to convert precomposed characters into their components.
Anyway, the answer is that œ is not a precomposed character. It's a typographic ligature:
>>> unicodedata.name(u'\u0153')
'LATIN SMALL LIGATURE OE'
The unicodedata module provides no method for splitting ligatures into their parts. But the data is there in the character names:
import re
import unicodedata
_ligature_re = re.compile(r'LATIN (?:(CAPITAL)|SMALL) LIGATURE ([A-Z]{2,})')
def split_ligatures(s):
"""
Split the ligatures in `s` into their component letters.
"""
def untie(l):
m = _ligature_re.match(unicodedata.name(l))
if not m: return l
elif m.group(1): return m.group(2)
else: return m.group(2).lower()
return ''.join(untie(l) for l in s)
>>> split_ligatures(u'B\u0153uf \u0132sselmeer \uFB00otogra\uFB00')
u'Boeuf IJsselmeer ffotograff'
(Of course you wouldn't do it like this in practice: you'd preprocess the Unicode database to generate a lookup table as you suggest in your question. There aren't all that many ligatures in Unicode.)

Categories