I have an OCR program (not so accurate though) that outputs a string. I append it to a list. So, my ss list looks like this:
ss = [
'성 벼 | 5 번YAO LIAO거 CHINA P R체류자격 결혼이민F-1)말급일자', # 'YAO LIAO'
'성 별 F 등록번호명 JAO HALJUNGCHINA P R격 결혼이민(F-6)밥급인자', # 'JAO HALJUNG'
'성 별 F명 CHENG HAIJING국 가 CHINA P R 역체 가차격 결혼이민(C-4) 박급인자', # 'CHENG HAIJING'
'KOa MDOVUD TAREEQ SAID HAFIZULLAH TURKIYE움첫;자격 거주(F-2)발급일자', # 'DOVUD TAREEQ SAID HAFIZULLAH'
'KOn 별 MDOVUD TAREEQ SAID- IIAFIZULLAH 감 TURKIYE동체나자격 거주F-2) 발급일자', # 'DOVUD TAREEQ SAID- IIAFIZULLAH'
'등록번호IN" 성 별 M명 TAREEQ SAD IIAFIZULLAH 값 TURKIYE8체주자격 거주-2)발급일자' # 'TAREEQ SAD IIAFIZULLAH'
]
I need to find some way to at least remove country names, or even better solution would be to extract clean full names as shown as comments above.
Here, the ss list stores the worst outputs, so if I can handle all 6 strings here with one universal solution, I hope the rest will be easier.
So far, I could think of looping through each element to extract upper English-only letters and filter out empty strings and any string whose len is less than 2, because I am assuming name consists of at least 2 letters:
for s in ss:
eng_parts = ''.join([i if 64 < ord(i) < 91 else ' ' for i in s])
#print("English-only strings: {}".format(eng_parts))
new_string = ''
spaced_string_list = eng_parts.split(" ")
for spaced_string in spaced_string_list:
if len(spaced_string) >= 2:
new_string += spaced_string + " "
new_string_list.append(new_string)
where new_string_list is ['YAO LIAO CHINA ', 'JAO HALJUNGCHINA ', 'CHENG HAIJING CHINA ', 'KO MDOVUD TAREEQ SAID HAFIZULLAH TURKIYE ', 'KO MDOVUD TAREEQ SAID IIAFIZULLAH TURKIYE ', 'IN TAREEQ SAD IIAFIZULLAH TURKIYE ']
Could this result be improved further?
EDIT:
The desired name string could be of up to 5 space-separated substrings. Also, a part of the name string is at least two English-only upper letters. In some cases, a name substring could be separated by a - (refer to SAID- case) if it reaches the end of the ID card, where initially the whole string got extracted from.
It is a great idea to postulate that a name always is build of two upper-case words of Latin characters separated by a space (or more).
So you can loop through the elements and look for that pattern. regex is the library to use =):
import re
for el in ss:
m = re.search(r'[A-Z]{2,}(\s+[A-Z\-]{2,})+', el)
if m:
print(m.group())
YAO LIAO
JAO HALJUNGCHINA
CHENG HAIJING
MDOVUD TAREEQ SAID HAFIZULLAH TURKIYE
MDOVUD TAREEQ SAID- IIAFIZULLAH
TAREEQ SAD IIAFIZULLAH
Let's examine the pattern in detail:
[A-Z]{2,} this searches for upper-case Latin characters of length 2 or more. The brackets indicate a symbol range and the curly brackets a numeric range.
\s+ looks for one ore more (+) widespaces (\s)
add special characters to the list of allowed character if necessary. Note that e.g. a dash needs to be escaped \- because it signifies a range otherwise -
group fractions of the pattern to make it repeatable: ( )+
I have a Unicode string in Python, and I would like to remove all the accents (diacritics).
I found on the web an elegant way to do this (in Java):
convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
remove all the characters whose Unicode type is "diacritic".
Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?
Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.
Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.
Example:
>>> from unidecode import unidecode
>>> unidecode('kožušček')
'kozuscek'
>>> unidecode('北亰')
'Bei Jing '
>>> unidecode('François')
'Francois'
How about this:
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
This works on greek letters, too:
>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>
The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).
And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".
I just found this answer on the Web:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii
It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.
Edit: this does the trick:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.
Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:
encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café" # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)
Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.
Thanks to you, I have created this function that works wonders.
import re
import unicodedata
def strip_accents(text):
"""
Strip accents from input String.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
try:
text = unicode(text, 'utf-8')
except (TypeError, NameError): # unicode is a default on python 3
pass
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
text = text.decode("utf-8")
return str(text)
def text_to_id(text):
"""
Convert input text to id.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
text = strip_accents(text.lower())
text = re.sub('[ ]+', '_', text)
text = re.sub('[^0-9a-zA-Z_-]', '', text)
return text
result:
text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'
This handles not only accents, but also "strokes" (as in ø etc.):
import unicodedata as ud
def rmdiacritics(char):
'''
Return the base character of char, by "removing" any
diacritics like accents or curls and strokes and the like.
'''
desc = ud.name(char)
cutoff = desc.find(' WITH ')
if cutoff != -1:
desc = desc[:cutoff]
try:
char = ud.lookup(desc)
except KeyError:
pass # removing "WITH ..." produced an invalid name
return char
This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed.
In fact, it's more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.
There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.
EDIT NOTE:
Incorporated suggestions from the comments (handling lookup errors, Python-3 code).
In my view, the proposed solutions should NOT be accepted answers. The original question is asking for the removal of accents, so the correct answer should only do that, not that plus other, unspecified, changes.
Simply observe the result of this code which is the accepted answer. where I have changed "Málaga" by "Málagueña:
accented_string = u'Málagueña'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaguena'and is of type 'str'
There is an additional change (ñ -> n), which is not requested in the OQ.
A simple function that does the requested task, in lower form:
def f_remove_accents(old):
"""
Removes common accent characters, lower form.
Uses: regex.
"""
new = old.lower()
new = re.sub(r'[àáâãäå]', 'a', new)
new = re.sub(r'[èéêë]', 'e', new)
new = re.sub(r'[ìíîï]', 'i', new)
new = re.sub(r'[òóôõö]', 'o', new)
new = re.sub(r'[ùúûü]', 'u', new)
return new
gensim.utils.deaccent(text) from Gensim - topic modelling for humans:
'Sef chomutovskych komunistu dostal postou bily prasek'
Another solution is unidecode.
Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').
In response to #MiniQuark's answer:
I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats.
As a test, I created a test.txt file that looked like this:
Montréal, über, 12.89, Mère, Françoise, noël, 889
I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate #Jabba's comment:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import csv
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
with open('test.txt') as f:
read = csv.reader(f)
for row in read:
for element in row:
print remove_accents(element)
The result:
Montreal
uber
12.89
Mere
Francoise
noel
889
(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)
import unicodedata
from random import choice
import perfplot
import regex
import text_unidecode
def remove_accent_chars_regex(x: str):
return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))
def remove_accent_chars_join(x: str):
# answer by MiniQuark
# https://stackoverflow.com/a/517974/7966259
return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])
perfplot.show(
setup=lambda n: ''.join([choice('Málaga François Phút Hơn 中文') for i in range(n)]),
kernels=[
remove_accent_chars_regex,
remove_accent_chars_join,
text_unidecode.unidecode,
],
labels=['regex', 'join', 'unidecode'],
n_range=[2 ** k for k in range(22)],
equality_check=None, relative_to=0, xlabel='str len'
)
Some languages have combining diacritics as language letters and accent diacritics to specify accent.
I think it is more safe to specify explicitly what diactrics you want to strip:
def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
accents = set(map(unicodedata.lookup, accents))
chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
return unicodedata.normalize('NFC', ''.join(chars))
Here is a short function which strips the diacritics, but keeps the non-latin characters. Most cases (e.g., "à" -> "a") are handled by unicodedata (standard library), but several (e.g., "æ" -> "ae") rely on the given parallel strings.
Code
from unicodedata import combining, normalize
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))
NB. The default argument outliers is evaluated once and not meant to be provided by the caller.
Intended usage
As a key to sort a list of strings in a more “natural” order:
sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)
Output:
['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']
If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere.
Tests
To make sure the behavior meets your needs, take a look at the pangrams below:
examples = [
("hello, world", "hello, world"),
("42", "42"),
("你好,世界", "你好,世界"),
(
"Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.",
"des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.",
),
(
"Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.",
"falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.",
),
(
"Љубазни фењерџија чађавог лица хоће да ми покаже штос.",
"љубазни фењерџија чађавог лица хоће да ми покаже штос.",
),
(
"Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.",
"ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.",
),
(
"Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.",
"quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.",
),
(
"Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.",
"kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.",
),
(
"Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.",
"glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.",
)
]
for (given, expected) in examples:
assert remove_diacritics(given) == expected
Case-preserving variant
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü Ä Æ Ǽ Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö Œ SS Ŧ Ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE"
def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))
There are already many answers here, but this was not previously considered: using sklearn
from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode
accented_string = u'Málagueña®'
print(strip_accents_unicode(accented_string)) # output: Malaguena®
print(strip_accents_ascii(accented_string)) # output: Malaguena
This is particularly useful if you are already using sklearn to process text. Those are the functions internally called by classes like CountVectorizer to normalize strings: when using strip_accents='ascii' then strip_accents_ascii is called and when strip_accents='unicode' is used, then strip_accents_unicode is called.
More details
Finally, consider those details from its docstring:
Signature: strip_accents_ascii(s)
Transform accentuated unicode symbols into ascii or nothing
Warning: this solution is only suited for languages that have a direct
transliteration to ASCII symbols.
and
Signature: strip_accents_unicode(s)
Transform accentuated unicode symbols into their simple counterpart
Warning: the python-level loop and join operations make this
implementation 20 times slower than the strip_accents_ascii basic
normalization.
If you are hoping to get functionality similar to Elasticsearch's asciifolding filter, you might want to consider fold-to-ascii, which is [itself]...
A Python port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into ASCII equivalents, if they exist.
Here's an example from the page mentioned above:
from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'
EDIT: The fold_to_ascii module seems to work well for normalizing Latin-based alphabets; however unmappable characters are removed, which means that this module will reduce Chinese text, for example, to empty strings. If you want to preserve Chinese, Japanese, and other Unicode alphabets, consider using #mo-han's remove_accent_chars_regex implementation, above.
I have following texts, each line has two phrases and separated with "\t"
RoadTunnel RouteOfTransportation
LaunchPad Infrastructure
CyclingLeague SportsLeague
Territory PopulatedPlace
CurlingLeague SportsLeague
GatedCommunity PopulatedPlace
What I want to get is to add _ to separate words, the results should be:
Road_Tunnel Route_Of_Transportation
Launch_Pad Infrastructure
Cycling_League Sports_League
Territory Populated_Place
Curling_League Sports_League
Gated_Community Populated_Place
There is no cases such as "ABTest" or "aBTest", and there are cases such as three words together "RouteOfTransportation" I tried several ways but not succeeded.
One of my tries is:
textProcessed = re.sub(r"([A-Z][a-z]+)(?=([A-Z][a-z]+))", r"\1_", text)
But there is no result
Use a regular expression and re.sub.
>>> import re
>>> s = '''LaunchPad Infrastructure
... CyclingLeague SportsLeague
... Territory PopulatedPlace
... CurlingLeague SportsLeague
... GatedCommunity PopulatedPlace'''
>>> subbed = re.sub('([A-Z][a-z]+)([A-Z])', r'\1_\2', s)
>>> print(subbed)
Launch_Pad Infrastructure
Cycling_League Sports_League
Territory Populated_Place
Curling_League Sports_League
Gated_Community Populated_Place
edit: Here's another one, since your test cases don't cover enough to be sure what exactly you want:
>>> re.sub('([a-zA-Z])([A-Z])([a-z])', r'\1_\2\3', 'ABThingThing')
'AB_Thing_Thing'
Combining re.findall and str.join:
>>> "_".join(re.findall(r"[A-Z]{1}[^A-Z]*", text))
Depending on your needs, a slightly different solution can be this:
import re
result = re.sub(r"([a-zA-Z])(?=[A-Z])", r"\1_", s)
It will insert a _ before any upper case letter that follows another letter (whether it is upper or lower case).
"TheRabbit IsBlue" => "The_Rabbit Is_Blue"
"ABThing ThingAB" => "A_B_Thing Thing_A_B"
It does not support special chars.
If all I have is a string of 10 or more digits, how can I format this as a phone number?
Some trivial examples:
555-5555
555-555-5555
1-800-555-5555
I know those aren't the only ways to format them, and it's very likely I'll leave things out if I do it myself. Is there a python library or a standard way of formatting phone numbers?
for library: phonenumbers (pypi, source)
Python version of Google's common library for parsing, formatting, storing and validating international phone numbers.
The readme is insufficient, but I found the code well documented.
Seems like your examples formatted with three digits groups except last, you can write a simple function, uses thousand seperator and adds last digit:
>>> def phone_format(n):
... return format(int(n[:-1]), ",").replace(",", "-") + n[-1]
...
>>> phone_format("5555555")
'555-5555'
>>> phone_format("5555555")
'555-5555'
>>> phone_format("5555555555")
'555-555-5555'
>>> phone_format("18005555555")
'1-800-555-5555'
Here's one adapted from utdemir's solution and this solution that will work with Python 2.6, as the "," formatter is new in Python 2.7.
def phone_format(phone_number):
clean_phone_number = re.sub('[^0-9]+', '', phone_number)
formatted_phone_number = re.sub("(\d)(?=(\d{3})+(?!\d))", r"\1-", "%d" % int(clean_phone_number[:-1])) + clean_phone_number[-1]
return formatted_phone_number
You can use the function clean_phone() from the library DataPrep. Install it with pip install dataprep.
>>> from dataprep.clean import clean_phone
>>> df = pd.DataFrame({'phone': ['5555555', '5555555555', '18005555555']})
>>> clean_phone(df, 'phone')
Phone Number Cleaning Report:
3 values cleaned (100.0%)
Result contains 3 (100.0%) values in the correct format and 0 null values (0.0%)
phone phone_clean
0 5555555 555-5555
1 5555555555 555-555-5555
2 18005555555 1-800-555-5555
More verbose, one dependency, but guarantees consistent output for most inputs and was fun to write:
import re
def format_tel(tel):
tel = tel.removeprefix("+")
tel = tel.removeprefix("1") # remove leading +1 or 1
tel = re.sub("[ ()-]", '', tel) # remove space, (), -
assert(len(tel) == 10)
tel = f"{tel[:3]}-{tel[3:6]}-{tel[6:]}"
return tel
Output:
>>> format_tel("1-800-628-8737")
'800-628-8737'
>>> format_tel("800-628-8737")
'800-628-8737'
>>> format_tel("18006288737")
'800-628-8737'
>>> format_tel("1800-628-8737")
'800-628-8737'
>>> format_tel("(800) 628-8737")
'800-628-8737'
>>> format_tel("(800) 6288737")
'800-628-8737'
>>> format_tel("(800)6288737")
'800-628-8737'
>>> format_tel("8006288737")
'800-628-8737'
Without magic numbers; ...if you're not into the whole brevity thing:
def format_tel(tel):
AREA_BOUNDARY = 3 # 800.6288737
SUBSCRIBER_SPLIT = 6 # 800628.8737
tel = tel.removeprefix("+")
tel = tel.removeprefix("1") # remove leading +1, or 1
tel = re.sub("[ ()-]", '', tel) # remove space, (), -
assert(len(tel) == 10)
tel = (f"{tel[:AREA_BOUNDARY]}-"
f"{tel[AREA_BOUNDARY:SUBSCRIBER_SPLIT]}-{tel[SUBSCRIBER_SPLIT:]}")
return tel
A simple solution might be to start at the back and insert the hyphen after four numbers, then do groups of three until the beginning of the string is reached. I am not aware of a built in function or anything like that.
You might find this helpful:
http://www.diveintopython3.net/regular-expressions.html#phonenumbers
Regular expressions will be useful if you are accepting user input of phone numbers. I would not use the exact approach followed at the above link. Something simpler, like just stripping out digits, is probably easier and just as good.
Also, inserting commas into numbers is an analogous problem that has been solved efficiently elsewhere and could be adapted to this problem.
In my case, I needed to get a phone pattern like "*** *** ***" by country.
So I re-used phonenumbers package in our project
from phonenumbers import country_code_for_region, format_number, PhoneMetadata, PhoneNumberFormat, parse as parse_phone
import re
def get_country_phone_pattern(country_code: str):
mobile_number_example = PhoneMetadata.metadata_for_region(country_code).mobile.example_number
formatted_phone = format_number(parse_phone(mobile_number_example, country_code), PhoneNumberFormat.INTERNATIONAL)
without_country_code = " ".join(formatted_phone.split()[1:])
return re.sub("\d", "*", without_country_code)
get_country_phone_pattern("KG") # *** *** ***