remove accents or tilde from a string in python [duplicate] - python

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).
I found on the web an elegant way to do this (in Java):
convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
remove all the characters whose Unicode type is "diacritic".
Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?
Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.
Example:
>>> from unidecode import unidecode
>>> unidecode('kožušček')
'kozuscek'
>>> unidecode('北亰')
'Bei Jing '
>>> unidecode('François')
'Francois'

How about this:
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
This works on greek letters, too:
>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>
The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).
And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

I just found this answer on the Web:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii
It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.
Edit: this does the trick:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.
Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:
encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café" # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.
Thanks to you, I have created this function that works wonders.
import re
import unicodedata
def strip_accents(text):
"""
Strip accents from input String.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
try:
text = unicode(text, 'utf-8')
except (TypeError, NameError): # unicode is a default on python 3
pass
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode("utf-8")
return str(text)
def text_to_id(text):
"""
Convert input text to id.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
text = strip_accents(text.lower())
text = re.sub('[ ]+', '_', text)
text = re.sub('[^0-9a-zA-Z_-]', '', text)
return text
result:
text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

This handles not only accents, but also "strokes" (as in ø etc.):
import unicodedata as ud
def rmdiacritics(char):
'''
Return the base character of char, by "removing" any
diacritics like accents or curls and strokes and the like.
'''
desc = ud.name(char)
cutoff = desc.find(' WITH ')
if cutoff != -1:
desc = desc[:cutoff]
try:
char = ud.lookup(desc)
except KeyError:
pass # removing "WITH ..." produced an invalid name
return char
This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed.
In fact, it's more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.
There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.
EDIT NOTE:
Incorporated suggestions from the comments (handling lookup errors, Python-3 code).

In my view, the proposed solutions should NOT be accepted answers. The original question is asking for the removal of accents, so the correct answer should only do that, not that plus other, unspecified, changes.
Simply observe the result of this code which is the accepted answer. where I have changed "Málaga" by "Málagueña:
accented_string = u'Málagueña'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaguena'and is of type 'str'
There is an additional change (ñ -> n), which is not requested in the OQ.
A simple function that does the requested task, in lower form:
def f_remove_accents(old):
"""
Removes common accent characters, lower form.
Uses: regex.
"""
new = old.lower()
new = re.sub(r'[àáâãäå]', 'a', new)
new = re.sub(r'[èéêë]', 'e', new)
new = re.sub(r'[ìíîï]', 'i', new)
new = re.sub(r'[òóôõö]', 'o', new)
new = re.sub(r'[ùúûü]', 'u', new)
return new

gensim.utils.deaccent(text) from Gensim - topic modelling for humans:
'Sef chomutovskych komunistu dostal postou bily prasek'
Another solution is unidecode.
Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').

In response to #MiniQuark's answer:
I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats.
As a test, I created a test.txt file that looked like this:
Montréal, über, 12.89, Mère, Françoise, noël, 889
I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate #Jabba's comment:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import csv
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
with open('test.txt') as f:
read = csv.reader(f)
for row in read:
for element in row:
print remove_accents(element)
The result:
Montreal
uber
12.89
Mere
Francoise
noel
889
(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)

import unicodedata
from random import choice
import perfplot
import regex
import text_unidecode
def remove_accent_chars_regex(x: str):
return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))
def remove_accent_chars_join(x: str):
# answer by MiniQuark
# https://stackoverflow.com/a/517974/7966259
return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])
perfplot.show(
setup=lambda n: ''.join([choice('Málaga François Phút Hơn 中文') for i in range(n)]),
kernels=[
remove_accent_chars_regex,
remove_accent_chars_join,
text_unidecode.unidecode,
],
labels=['regex', 'join', 'unidecode'],
n_range=[2 ** k for k in range(22)],
equality_check=None, relative_to=0, xlabel='str len'
)

Some languages have combining diacritics as language letters and accent diacritics to specify accent.
I think it is more safe to specify explicitly what diactrics you want to strip:
def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
accents = set(map(unicodedata.lookup, accents))
chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
return unicodedata.normalize('NFC', ''.join(chars))

Here is a short function which strips the diacritics, but keeps the non-latin characters. Most cases (e.g., "à" -> "a") are handled by unicodedata (standard library), but several (e.g., "æ" -> "ae") rely on the given parallel strings.
Code
from unicodedata import combining, normalize
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))
NB. The default argument outliers is evaluated once and not meant to be provided by the caller.
Intended usage
As a key to sort a list of strings in a more “natural” order:
sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)
Output:
['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']
If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere.
Tests
To make sure the behavior meets your needs, take a look at the pangrams below:
examples = [
("hello, world", "hello, world"),
("42", "42"),
("你好,世界", "你好,世界"),
(
"Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.",
"des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.",
),
(
"Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.",
"falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.",
),
(
"Љубазни фењерџија чађавог лица хоће да ми покаже штос.",
"љубазни фењерџија чађавог лица хоће да ми покаже штос.",
),
(
"Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.",
"ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.",
),
(
"Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.",
"quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.",
),
(
"Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.",
"kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.",
),
(
"Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.",
"glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.",
)
]
for (given, expected) in examples:
assert remove_diacritics(given) == expected
Case-preserving variant
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü Ä Æ Ǽ Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö Œ SS Ŧ Ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))

There are already many answers here, but this was not previously considered: using sklearn
from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode
accented_string = u'Málagueña®'
print(strip_accents_unicode(accented_string)) # output: Malaguena®
print(strip_accents_ascii(accented_string)) # output: Malaguena
This is particularly useful if you are already using sklearn to process text. Those are the functions internally called by classes like CountVectorizer to normalize strings: when using strip_accents='ascii' then strip_accents_ascii is called and when strip_accents='unicode' is used, then strip_accents_unicode is called.
More details
Finally, consider those details from its docstring:
Signature: strip_accents_ascii(s)
Transform accentuated unicode symbols into ascii or nothing
Warning: this solution is only suited for languages that have a direct
transliteration to ASCII symbols.
and
Signature: strip_accents_unicode(s)
Transform accentuated unicode symbols into their simple counterpart
Warning: the python-level loop and join operations make this
implementation 20 times slower than the strip_accents_ascii basic
normalization.

If you are hoping to get functionality similar to Elasticsearch's asciifolding filter, you might want to consider fold-to-ascii, which is [itself]...
A Python port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into ASCII equivalents, if they exist.
Here's an example from the page mentioned above:
from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'
EDIT: The fold_to_ascii module seems to work well for normalizing Latin-based alphabets; however unmappable characters are removed, which means that this module will reduce Chinese text, for example, to empty strings. If you want to preserve Chinese, Japanese, and other Unicode alphabets, consider using #mo-han's remove_accent_chars_regex implementation, above.

Related

Replacing Unicode Characters with actual symbols

string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"
I want to get rid of the <U + 2019> and replace it with '. Is there a way to do this in python?
Edit : I also have instances of <U + 2014>, <U + 201C> etc. Looking for something which can replace all of this with appropriate values
Replace them all at once with re.sub:
import re
string = "testing<U+2019> <U+2014> <U+201C>testing<U+1F603>"
result = re.sub(r'<U\+([0-9a-fA-F]{4,6})>', lambda x: chr(int(x.group(1),16)), string)
print(result)
Output:
testing’ — “testing😃
The regular expression matches <U+hhhh> where hhhh can be 4-6 hexadecimal characters. Note that Unicode defines code points from U+0000 to U+10FFFF so this accounts for that. The lambda replacement function converts the string hhhh to an integer using base 16 and then converts that number to a Unicode character.
Here's my solution for all code points denoted as U+0000 through U+10FFFF ("U+" followed by the code point value in hexadecimal, which is prepended with leading zeros to a minimum of four digits):
import re
def UniToChar(unicode_notation):
return chr(int(re.findall(r'<U\+([a-hA-H0-9]{4,5})>',unicode_notation)[0],16))
xx= '''
At Donald<U+2019>s <U+2016>Elect<U+2016> in <U+2017>2019<U+2017>
<U+00C0> la Donald<U+2019>s friend <U+1F986>. <U+1F929><U+1F92A><U+1F601>
'''
for x in xx.split('\n'):
abc = re.findall(r'<U\+[a-hA-H0-9]{4,5}>',x)
if len(abc) > 0:
for uniid in set(abc): x=x.replace(uniid, UniToChar(uniid))
print(repr(x).strip("'"))
Output: 71307293.py
At Donald’s ‖Elect‖ in ‗2019‗
À la Donald’s friend 🦆. 🤩🤪😁
In fact, private range from U+100000 to U+10FFFD (Plane 16) isn't detected using above simplified regex… Improved code follows:
import re
def UniToChar(unicode_notation):
aux = int(re.findall(r'<U\+([a-hA-H0-9]{4,6})>',unicode_notation)[0],16)
# circumvent the "ValueError: chr() arg not in range(0x110000)"
if aux <= 0x10FFFD:
return chr(aux)
else:
return chr(0xFFFD) # Replacement Character
xx= '''
At Donald<U+2019>s <U+2016>Elect<U+2016> in <U+2017>2019<U+2017>
<U+00C0> la Donald<U+2019>s friend <U+1F986>. <U+1F929><U+1F92A><U+1F601>
Unassigned: <U+05ff>; out of Unicode range: <U+110000>.
'''
for x in xx.split('\n'):
abc = re.findall(r'<U\+[a-hA-H0-9]{4,6}>',x)
if len(abc) > 0:
for uniid in set(abc): x=x.replace(uniid, UniToChar(uniid))
print(repr(x).strip("'"))
Output: 71307293.py
At Donald’s ‖Elect‖ in ‗2019‗
À la Donald’s friend 🦆. 🤩🤪😁
Unassigned: \u05ff; out of Unicode range: �.
I guess this solves the problem if its just one or two of these characters.
>>> string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"
>>> string.replace("<U+2019>","'")
"At Donald Trump's Properties, a Showcase for a Brand and a President-Elect"
If there are many if these substitutions to be done, consider using 'map()' method.
Source: Removing \u2018 and \u2019 character
You can replace using .replace()
print(string.replace('<U+2019>', "'"))
Or if your year changes, you can use re. But make it more attractive than mine.
import re
string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President-Elect"
rep = re.search('[<][U][+]\d{4}[>]', string).group()
print(string.replace(rep, "'"))
what version of python are u using?
I edited my answer so it can bee used with multiple code point in the same string
well u need to convert the unicode's code point that is between < >, to unicode char
I used regex to get the unicode's code point and then convert it to the corresponding uniode char
import re
string = "At Donald Trump<U+2019>s Properties, a Showcase for a Brand and a President<U+2014>Elect"
repbool = re.search('[<][U][+]\d{4}[>]', string)
while repbool:
rep = re.search('[<][U][+]\d{4}[>]', string).group()
string=string.replace(rep, chr(int(rep[1:-1][2:], 16)))
repbool = re.search('[<][U][+]\d{4}[>]', string)
print(string)

How do I convert a unicode text to a text that python can read so that I could find that specific word in webscraping results?

I am trying to scrape text in instagram and check if I could find some keywords in the bio but the user use a special fonts, so I cannot identify the specific word, how can I remove the fonts or formot of a text such that I can search the word?
import re
test="𝙄𝙣𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙛𝙪𝙩𝙪𝙧𝙚 𝙩𝙝𝙚𝙣 𝙚𝙭𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙥𝙖𝙨𝙩. "
x = re.findall(re.compile('past'), test)
if x:
print("TEXT FOUND")
else:
print("TEXT NOT FOUND")
TEXT NOT FOUND
Another example:
import re
test="ғʀᴇᴇʟᴀɴᴄᴇ ɢʀᴀᴘʜɪᴄ ᴅᴇsɪɢɴᴇʀ"
test=test.lower()
x = re.findall(re.compile('graphic'), test)
if x:
print("TEXT FOUND")
else:
print("TEXT NOT FOUND")
TEXT NOT FOUND
you can use unicodedata.normalize that Return the normal form for the Unicode string. For your examples see the following code snippet:
import re
import unicodedata
test="𝙄𝙣𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙛𝙪𝙩𝙪𝙧𝙚 𝙩𝙝𝙚𝙣 𝙚𝙭𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙥𝙖𝙨𝙩. "
formatted_test = unicodedata.normalize('NFKD', test).encode('ascii', 'ignore').decode('utf-8')
x = re.findall(re.compile('past'), formatted_test)
if x:
print("TEXT FOUND")
else:
print("TEXT NOT FOUND")
and the output will be:
TEXT FOUND
Problem 1:
Take care if you are dealing with texts in Portuguese.
If you have:
string = """𝓿𝓲𝓫𝓻𝓪𝓷𝓽𝓮𝓼 orçamento"""
And you use:
unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('utf-8')
You will lost cedilha (ç), it means, orçamento will be orcamento.
Otherwise, if you use:
unicodedata.normalize('NFKC', string)
You will keep cedilha.
Note that I changed NFKD to NFKC, beyond cut encode and decode.
Problem 2:
Take this examples (they are real examples that I found in Instagram):
string2 = """ᴍᴇᴜ ᴄᴏʀᴀçãᴏ ᴀᴛé ᴘᴜʟᴏᴜ ǫᴜᴀɴᴅᴏ ᴇʟᴀ ᴘᴀssᴏᴜ, ᴍᴀs ᴏ ǫᴜᴇ ғᴇᴢ ᴇʟᴇ ᴘᴀʀᴀʀ ғᴏɪ sᴇᴜ ᴀʙʀᴀçᴏ"""
string3 = """🅒🅤🅘🅓🅐🅓🅞"""
string4 = """(n̶ã̶o̶ ̶u̶s̶e̶ ̶á̶g̶u̶a̶ ̶d̶o̶c̶e̶!̶)"""
The lib Unicodedata is not able to normalize them.
Note that, string2 looks like "normal", but it is write using LATIN LETTER SMALL instead LATIN, besides the letter F is not an F, it is CYRILLIC SMALL LETTER GHE WITH STROKE.
One alternative is Unidecode https://pypi.org/project/Unidecode/
print(unidecode(string2))
MEU CORAcaO ATe PULOU oUAnDO ELA PAssOU, MAs O oUE g'EZ ELE PARAR g'OI sEU ABRAcO
print(unidecode(string3))
(C)(U)(I)(D)(A)(D)(O)
print(unidecode(string4))
nao use agua doce!
But Unidecode will normalize everything to ASCII, so we will back to the problem 1.

Clean list of string containing escape sequence in python

I'm working on an OCR and the text extract from the image gets appended to a list that has a lot of escape sequences in it.
How can I clean a list of string like this
extracted = ["b'i)\\nSYRUP\\na\\n\\x0c'",
"b'mi.\\n\\x0c'",
"b'100\\n\\x0c'",
"b'Te eT ran\\nSYRUP\\n\\x0c'",
"b'tamol, Ambroxol k\\n\\x0c'",
"b'Guaiphenesin\\n\\x0c'",
"b'Syrup\\n\\x0c'",
"b'ol HCl &\\n\\x0c'",
"b'quantity.\\n\\x0c'"]
to this
cleaned= ["SYRUP",
"mi",
"100",
"Te eT ran SYRUP",
"tamol, Ambroxol k",
"Guaiphenesin",
"Syrup",
"ol HCl &"
"quantity"]
I tried replacing them but nothing works out and it goes back to how it was when extracted. Any suggestions? Thanks in advance.
For a start you could try:
for i, s in enumerate(extracted):
extracted[i] =(s.replace("b'", '')
.replace("i)", '')
.replace('\\na', '')
.replace('\\n', '')
.replace("\\x0c'", '')
.replace('.', ''))
This seems to be strings of bytecode string representation, which you can decode to utf-8. We use literal_eval from ast for safe evaluation.
This will get you most of the way there, oddities from OCR like i) you'll need to manually fix by replacing.
import ast
extracted = [
"b'i)\\nSYRUP\\na\\n\\x0c'",
"b'mi.\\n\\x0c'",
"b'100\\n\\x0c'",
"b'Te eT ran\\nSYRUP\\n\\x0c'",
"b'tamol, Ambroxol k\\n\\x0c'",
"b'Guaiphenesin\\n\\x0c'",
"b'Syrup\\n\\x0c'",
"b'ol HCl &\\n\\x0c'",
"b'quantity.\\n\\x0c'"]
def fix_string(s):
eval_str = ast.literal_eval(s)
dec_str = eval_str.decode('utf-8')
fix_str = dec_str.strip().replace('\n', ' ')
return fix_str
for e in extracted:
print(fix_string(e))
Output:
i) SYRUP a
mi.
100
Te eT ran SYRUP
tamol, Ambroxol k
Guaiphenesin
Syrup
ol HCl &
quantity.
Here is an answer that assumes the substring you are looking for in each string is either between two newlines or at the beginning of a string and followed by a newline.
import re
def find_substring(string):
string = (eval(string)).decode('UTF-8')
pattern = r"\n?.*\.?\n"
lst = re.findall(pattern,string)
if len(lst) == 1:
substring = lst[0].strip(".\n")
else:
pattern2 = r"\n.*\n"
lst2 = re.findall(pattern2,"".join(lst))
substring = lst2[0].strip("\n")
return substring
Then, map to the list like so.
list(map(find_substring,extracted))
This outputs:
['SYRUP',
'mi',
'100',
'SYRUP',
'tamol, Ambroxol k',
'Guaiphenesin',
'Syrup',
'ol HCl &',
'quantity']

How to convert fancy/artistic unicode text to ASCII?

I have a unicode string like "𝖙𝖍𝖚𝖌 𝖑𝖎𝖋𝖊" and would like to convert it to the ASCII form "thug life".
I know I can achieve this in Python by
import unidecode
print(unidecode.unidecode('𝖙𝖍𝖚𝖌 𝖑𝖎𝖋𝖊'))
// thug life
However, this would asciify also other unicode characters (such as Chinese/Japanese characters, emojis, accented characters, etc.), which I want to preserve.
Is there a way to detect these type of "artistic" unicode characters?
Some more examples:
𝓽𝓱𝓾𝓰 𝓵𝓲𝓯𝓮
𝓉𝒽𝓊𝑔 𝓁𝒾𝒻𝑒
𝕥𝕙𝕦𝕘 𝕝𝕚𝕗𝕖
thug life
Thanks for your help!
import unicodedata
strings = [
'𝖙𝖍𝖚𝖌 𝖑𝖎𝖋𝖊',
'𝓽𝓱𝓾𝓰 𝓵𝓲𝓯𝓮',
'𝓉𝒽𝓊𝑔 𝓁𝒾𝒻𝑒',
'𝕥𝕙𝕦𝕘 𝕝𝕚𝕗𝕖',
'thug life']
for x in strings:
print(unicodedata.normalize( 'NFKC', x), x)
Output: .\62803325.py
thug life 𝖙𝖍𝖚𝖌 𝖑𝖎𝖋𝖊
thug life 𝓽𝓱𝓾𝓰 𝓵𝓲𝓯𝓮
thug life 𝓉𝒽𝓊𝑔 𝓁𝒾𝒻𝑒
thug life 𝕥𝕙𝕦𝕘 𝕝𝕚𝕗𝕖
thug life thug life
Resources:
unicodedata — Unicode Database
Normalization forms for Unicode text

converting word into other word keeping the original format

def translate(string, translations):
'''
>>> translations = {'he':'she', 'brother':'sister'}
>>> translate('he', translations)
'she'
>>> translate('HE', translations)
'SHE'
>>> translate('He', translations)
'She'
>>> translate('brother', translations)
'sister'
>>> translate('my', translations)
'my'
'''
I have inputs like this. I used translations.get(string) to get he and sister and it worked well. But the thing is that I cant convert the strings to 'She' or 'sHe' (in the original format).
How to do it in Python?
You are going to need to either have a bigger dictionary, case sensitive, or your translate function is going to have to be modified to:
Detect the case of the original word or phrase, all lower, all upper, sentence or title.
Look up the translation case insensitive
Re-case the translated text to match the original.
But with some languages you will still have some issues, e.g.: in some languages all caps includes some lower case letters sometimes or capitalise the second letter rather than the first such as d' as a prefix would always be lower case or have different capitalisation rules, in SI units UK capitalisation rules say that if the unit is named after a person it should always be capitalise but other countries do this differently.
Just as you have a data structure of translations, we can create a data structure of case tests and corrections:
def iscapitalized(s):
return s and s[0].isupper() and s[1:].islower()
def translate(string, translations):
translation = translations.get(string.lower(), string)
for test, correction in corrections.items():
if test(string):
translation = correction(translation)
break
return translation
translations = {'he': 'she', 'brother': 'sister'}
corrections = {str.isupper: str.upper, str.islower: str.lower, iscapitalized: str.capitalize}
print(translate('he', translations))
print(translate('HE', translations))
print(translate('He', translations))
print(translate('brother', translations))
print(translate('my', translations))
OUTPUT
> python3 test.py
she
SHE
She
sister
my
>

Categories