Normalize foreign text [duplicate]

Normalize foreign text [duplicate] - python

This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 2 years ago.
Normally I use unicodedata to normalize other latin-ish text. However, I've come across this and not sure what to do:
s = 'Nguyễn Văn Trỗi'
>>> unicodedata.normalize('NFD', s)
'Nguyễn Văn Trỗi'
Is there another module that can normalize more accents than unicodedata ? The output I want is:
Nguyen Van Troi

normalize doesn't mean "remove accents". It is converting between composed and decomposed forms:
>>> import unicodedata as ud
>>> a = 'ă'
>>> print(ascii(ud.normalize('NFD',a))) # LATIN SMALL LETTER A + COMBINING BREVE
'a\u0306'
>>> print(ascii(ud.normalize('NFC',a))) # LATIN SMALL LETTER A WITH BREVE
'\u0103'
One way to remove them is to then encode the decomposed form as ASCII ignoring errors, which works because combining characters are not ASCII. Note, however, that not all international characters have decomposed forms, such as đ.
>>> s = 'Nguyễn Văn Trỗi'
>>> ud.normalize('NFD',s).encode('ascii',errors='ignore').decode('ascii')
'Nguyen Van Troi'
>>> s = 'Ngô Đình Diệm'
>>> ud.normalize('NFD',s).encode('ascii',errors='ignore').decode('ascii')
'Ngo inh Diem' # error
You can work around the exceptions with a translation table:
>>> table = {ord('Đ'):'D',ord('đ'):'d'}
>>> ud.normalize('NFD',s).translate(table).encode('ascii',errors='ignore').decode('ascii')
'Ngo Dinh Diem'

Related

Form unicode character from label [duplicate]

This question already has answers here:
Python unicode codepoint to unicode character
(4 answers)
Closed 3 months ago.
I have a simple syntax related question that I would be grateful if someone could answer. So I currently have character labels in a string format: '0941'.
To print out unicode characters in Python, I can just use the command:
print(u'\u0941')
Now, my question is how can I convert the label I have ('0941') into the unicode readable format (u'\u0941')?
Thank you so much!

>>> chr(int('0941',16)) == '\u0941'
True

One way to accomplish this without fussing with your numeric keypad is to simply print the character and then copy/paste it as a label.
>>> print("lower case delta: \u03B4")
lower case delta: δ
>>> δ = 42 # copy the lower case delta symbol and paste it to use it as a label
>>> δδ = δ ** 2 # paste it twice to define another label.
>>> δ # at this point, they are just normal labels...
42
>>> δδ
1764
>>> δabc = 737 # using paste, it's just another character in a label
>>> δ123 = 456
>>> δabc, δ123 # exactly like any other alpha character.
(737, 456)

Python - an extremely odd behavior of function lstrip [duplicate]

This question already has answers here:
Python string.strip stripping too many characters [duplicate]
(3 answers)
Closed 6 years ago.
I have encountered a very odd behavior of built-in function lstrip.
I will explain with a few examples:
print 'BT_NAME_PREFIX=MUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=NUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=PUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=SUV'.lstrip('BT_NAME_PREFIX=') # SUV
print 'BT_NAME_PREFIX=mUV'.lstrip('BT_NAME_PREFIX=') # mUV
As you can see, the function trims one additional character sometimes.
I tried to model the problem, and noticed that it persisted if I:
Changed BT_NAME_PREFIX to BT_NAME_PREFIY
Changed BT_NAME_PREFIX to BT_NAME_PREFIZ
Changed BT_NAME_PREFIX to BT_NAME_PREF
Further attempts have made it even more weird:
print 'BT_NAME=MUV'.lstrip('BT_NAME=') # UV
print 'BT_NAME=NUV'.lstrip('BT_NAME=') # UV
print 'BT_NAME=PUV'.lstrip('BT_NAME=') # PUV - different than before!!!
print 'BT_NAME=SUV'.lstrip('BT_NAME=') # SUV
print 'BT_NAME=mUV'.lstrip('BT_NAME=') # mUV
Could someone please explain what on earth is going on here?
I know I might as well just use array-slicing, but I would still like to understand this.
Thanks

You're misunderstanding how lstrip works. It treats the characters you pass in as a bag and it strips characters that are in the bag until it finds a character that isn't in the bag.
Consider:
'abc'.lstrip('ba') # 'c'
It is not removing a substring from the start of the string. To do that, you need something like:
if s.startswith(prefix):
s = s[len(prefix):]
e.g.:
>>> s = 'foobar'
>>> prefix = 'foo'
>>> if s.startswith(prefix):
... s = s[len(prefix):]
...
>>> s
'bar'
Or, I suppose you could use a regular expression:
>>> s = 'foobar'
>>> import re
>>> re.sub('^foo', '', s)
'bar'

The argument given to lstrip is a list of things to remove from the left of a string, on a character by character basis. The phrase is not considered, only the characters themselves.
S.lstrip([chars]) -> string or unicode
Return a copy of the string S with leading whitespace removed. If
chars is given and not None, remove characters in chars instead. If
chars is unicode, S will be converted to unicode before stripping
You could solve this in a flexible way using regular expressions (the re module):
>>> import re
>>> re.sub('^BT_NAME_PREFIX=', '', 'BT_NAME_PREFIX=MUV')
MUV

How do I iterate through unicode symbols, not bytes in python?

Given the accented unicode word like u'кни́га', I need to strip the acute (u'книга'), and also change the accent format to u'кни+га', where '+' represents the acute over the preceding letter.
What I do now is using a dictionary of acuted and not acuted symbols:
accented_list = [u'я́', u'и́', u'ы́', u'у́', u'э́', u'а́', u'е́', u'ю́', u'о́']
regular_list = [u'я', u'и', u'ы', u'у', u'э', u'а', u'е', u'ю', u'о']
accent_dict = dict(zip(accented_list, regular_list))
I want to do something like this:
def changeAccentFormat(word):
for letter in accent_dict:
if letter in word:
its_index = word.index(letter)
word = word[:its_index + 1] + u'+' + word[its_index + 1:]
return word
But of course it does not work as desired. I noticed that this code:
>>> word = u'кни́га'
>>> for letter in word:
... print letter
gives
к
н
и
´
г
а
(Well, i didn't expect the blank symbol to appear, but nevertheless). So I wonder, what is the simplest way to produce [u'к', u'н', u'и́', u'г', u'а']? Or maybe there is some way to solve my problem without it?

First of all, in regard to iterating over characters instead bytes, you're already doing it right - your word is an unicode object, not an encoded bytestring.
Now, for combination characters in Unicode:
For many characters containing combination characters there is a composed and decomposed form of writing it, the composed being one code point, and the decomposed a sequence of two (or more?) code points:
See U+00E7, U+0063 and U+0327
So in Python, you could either write either form, it will get composed at display time to the same character:
>>> combining_cedilla = u'\u0327'
>>> c_with_cedilla = u'\u00e7'
>>> letter_c = u'\u0063'
>>>
>>> print c_with_cedilla
ç
>>> print letter_c + combining_cedilla
ç
In order to convert between composed and decomposed forms, you can use unicodedata.normalize():
>>> import unicodedata
>>> comp = unicodedata.normalize('NFC', letter_c + combining_cedilla)
>>> decomp = unicodedata.normalize('NFD', c_with_cedilla)
>>>
>>> print comp
ç
>>> print decomp
ç
(NFC stands for "normal form C" (composed), and NFD for "normal form D" (decomposed).
They still are different forms though - one consisting of one code point, the other of two:
>>> comp == decomp
False
>>> len(comp)
1
>>> len(decomp)
2
However, in your case, there simply does not seem to be a combined character for the lowercase и with an accent acute (there is one for и with an accent grave)

You can produce [u'к', u'н', u'и́', u'г', u'а'] with the regex module.
Here is the word you have by each user perceived character:
>>> import regex
>>> word = u'кни́га'
>>> len(word)
6
>>> regex.findall(r'\X', word)
['к', 'н', 'и́', 'г', 'а']
>>> len(regex.findall(r'\X', word))
5

Acutes are represented by codepoint 301, COMBINING ACUTE ACCENT, so a simple string character replacement should suffice:
>>>print u'кни́га'.replace(u'\u0301', "+")
кни+га
If you encounter accented characters that are not encoded with a combining accent, unicodedata.normalize should do the trick

Convert exotic charset to string with python

After parsing some webpage with utf-8 coding, I realize that I obtain characters that I can't manipulaten, though it is readable by the means of print.
>> print data
Ａ　Ｄｅｕｃｅ
>> data
u'\uff21\u3000\uff24\uff45\uff55\uff43\uff45'
How can I get this into a decent coding using Python?
I would like to obtain
>> my_variable
'A Deuce'
(I mean being able to cast that text in a variable as a "regular" string)
I saw several solutions related to that topic but did not find relevant answer (mainly based on encoding/decoding in other charset)

This functionality is built into the unicodedata module:
>>> unicodedata.normalize('NFKC', 'Ａ　Ｄｅｕｃｅ')
'A Deuce'

With a little help from this answer:
>>> table = dict([(x + 0xFF00 - 0x20, unichr(x)) for x in xrange(0x21, 0x7F)] + [(0x3000, unichr(0x20))])
>>> data.translate(table)
u'A Deuce'
The translate method takes a dictionary that maps one Unicode code point to another. In this case, it maps the full-width Latin alphabet (which is essentially part of the ASCII character set shifted up to the range 0xFF01-0xFF5E) to the "normal" ASCII character set. For example, 0xFF21 (full-width A) maps to 0x41 (ASCII A), 0xFF22 (full-width B) maps to 0x42 (ASCII B), etc.

Consider using Python 3, which has better printing support for Unicode characters. Here's a sample:
>>> s=u'\uff21\u3000\uff24\uff45\uff55\uff43\uff45'
>>> print(s)
Ａ　Ｄｅｕｃｅ
>>> s
'Ａ\u3000Ｄｅｕｃｅ'
>>> import unicodedata as ud
>>> ud.name('\u3000')
'IDEOGRAPHIC SPACE'
>>> print(ascii(s))
'\uff21\u3000\uff24\uff45\uff55\uff43\uff45'

Combined diacritics do not normalize with unicodedata.normalize (PYTHON)

I understand that unicodedata.normalize converts diacritics to their non diacritic counterparts:
import unicodedata
''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf')
if unicodedata.category(c) != 'Mn'
)
My question is (and can be seen in this example): does unicodedata has a way to replace combined char diacritics into their counterparts? (u'œ' becomes 'oe')
If not I assume I will have to put a hit out for these, but then I might as well compile my own dict with all uchars and their counterparts and forget about unicodedata altogether...

There's a bit of confusion about terminology in your question. A diacritic is a mark that can be added to a letter or other character but generally does not stand on its own. (Unicode also uses the more general term combining character.) What normalize('NFD', ...) does is to convert precomposed characters into their components.
Anyway, the answer is that œ is not a precomposed character. It's a typographic ligature:
>>> unicodedata.name(u'\u0153')
'LATIN SMALL LIGATURE OE'
The unicodedata module provides no method for splitting ligatures into their parts. But the data is there in the character names:
import re
import unicodedata
_ligature_re = re.compile(r'LATIN (?:(CAPITAL)|SMALL) LIGATURE ([A-Z]{2,})')
def split_ligatures(s):
"""
Split the ligatures in `s` into their component letters.
"""
def untie(l):
m = _ligature_re.match(unicodedata.name(l))
if not m: return l
elif m.group(1): return m.group(2)
else: return m.group(2).lower()
return ''.join(untie(l) for l in s)
>>> split_ligatures(u'B\u0153uf \u0132sselmeer \uFB00otogra\uFB00')
u'Boeuf IJsselmeer ffotograff'
(Of course you wouldn't do it like this in practice: you'd preprocess the Unicode database to generate a lookup table as you suggest in your question. There aren't all that many ligatures in Unicode.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Normalize foreign text [duplicate] - python

Related

Form unicode character from label [duplicate]

Python - an extremely odd behavior of function lstrip [duplicate]

How do I iterate through unicode symbols, not bytes in python?

Convert exotic charset to string with python

Combined diacritics do not normalize with unicodedata.normalize (PYTHON)

Categories

Resources