This question already has answers here:
Regular expression for first and last name
(28 answers)
Closed 3 years ago.
Hello I have a string of full names.
string='Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser'
I would like to split it by first and last name to have an output like this
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
I tried using this code:
splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', string))
that returns this result
['Christof', 'Koch', 'Jonathan', 'Harel', 'Moran', 'Cerf', 'Wolfgang', 'Einhaeuser']
I would like to have each full name as an item.
Any suggestions? Thanks
You can use a lookahead after any lowercase to see if it's followed by an immediate uppercase or end-of-line such as [a-zA-Z\s]+?[a-z](?=[A-Z]|$) (more specific) or even .+?[a-z](?=[A-Z]|$) (more broad).
import re
string = 'Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser'
print(re.findall(r".+?[a-z](?=[A-Z]|$)", string))
# -> ['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
Having provided this answer, definitely check out Falsehoods Programmers Believe About Names; depending on your data, it might be erroneous to assume that your format will be parseable using the lower->upper assumption.
For your list of strings in this format from the comments, just add a list comprehension. The regex I provided above happens to be robust to the middle initials without modification (but I have to emphasize that if your dataset is enormous, that might not hold).
import re
names = ['Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser', 'Za?d HarchaouiC?line Levy-leduc', 'David A. ForsythDuan Tran', 'Arnold SmeuldersSennay GhebreabPieter Adriaans', 'Peter L. BartlettAmbuj Tewari', 'Javier R. MovellanPaul L. RuvoloIan Fasel', 'Deli ZhaoXiaoou Tang']
result = [re.findall(r".+?[a-z](?=[A-Z]|$)", x) for x in names]
for name in result:
print(name)
Output:
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
['Za?d Harchaoui', 'C?line Levy-leduc']
['David A. Forsyth', 'Duan Tran']
['Arnold Smeulders', 'Sennay Ghebreab', 'Pieter Adriaans']
['Peter L. Bartlett', 'Ambuj Tewari']
['Javier R. Movellan', 'Paul L. Ruvolo', 'Ian Fasel']
['Deli Zhao', 'Xiaoou Tang']
And if you want all of these names in one list, add
flattened = [x for y in result for x in y]
It'll most likely have FP and TN, yet maybe OK to start with:
[A-Z][^A-Z\r\n]*\s+[A-Z][^A-Z\r\n]*
Test
import re
expression = r"[A-Z][^A-Z]*\s+[A-Z][^A-Z]*"
string = """
Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser
"""
print(re.findall(expression, string))
Output
['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
Related
I am wrangling with a dataset and I have ended up having a list of names of the following form:
s = ['DR. James Coffins',
'Zacharias Pallefas',
'Matthew Ebnel',
'Ranzzith Redly',
'GEORGE GEORGIADAKIS',
'HARISH KUMARAN K',
'Christiaan Kraanlen, CFA',
'Mary K. Lein, CFA, COL',
'Alexandre Cegra, CFA, CAIA'
'Anna Bely']
I must extract the last names and place them in a separate list (or column in a pandas dataframe). However I am puzzled with the polymorphism of the Full Names and I am novice in Python.
A possible algorithm would be the following:
Loop through the elements of the list. For each element:
split the element into subelements using spaces. Then:
a) If there are four or less subelements start from the beginning and
examine the first four subelements.
a1) If the first subelement is larger than 2 letters then: If the
second subelement is larger than one letter, return the second
subelement. Otherwise, return the third subelement.
a2) if the first subelement is 2 letters then drop it and repeat
step a1
How about always grabbing the second element of each line after skipping words that contain . and not in a exlude list ['dr', 'mr', 'mrs', 'mrs', 'miss', 'prof']
>>> exclude_tags = ['dr', 'mr', 'mrs', 'mrs', 'miss', 'prof']
>>> [[y for y in x.split() if '.' not in y and y.lower() not in exclude_tags][1].rstrip(',').capitalize() for x in s]
['Coffins', 'Pallefas', 'Ebnel', 'Redly', 'Georgiadakis', 'Kumaran', 'Kraanlen', 'Lein', 'Cegra']
For anyone else coming across this question, keep in mind that it is impossible in general to perfectly extract a person's surname from their full name, and go read Falsehoods Programmers Believe About Names
Sunitha's solution will fail for anyone whose last name is composed of more than one token (van Gogh), has more than one last name (Gonzalez Ramirez), has a first name that has more than one token (Mary Jane Watson), chose to put their middle name in whatever system created this list, is from an Asian culture where the order of given name / surname is sometimes reversed, etc.
I know this question has been asked a few times, but what I'm asking is not how to do it, but which delimiter should be used.
So I have a very long string and I want to split it into words. The result is not what I wanted, so I thought to add another delimiter.
The problem is there are words like vs. and U.S. in the string. If I use . as a delimiter, I will get vs but U.S. becomes U and S. This is not what I wanted.
Another example, there are words brainf*ck *7 F***ing x*x+y*y works* f*k in the string. If I use * as a delimiter, the result will be very messy (brainf*ck becomes brainf and ck, F***ing becomes F and ing, and so on)
' delimiter have the same problem; (don't 'starting out' what's do's dont's)
- = + ( ) also have some minor problem but I can handle those delimiters. The problem is with . * '.
Does anyone have any idea how to tackle this problem?
What about using re:
import re
text = 'U.S. vs. brainf*ck *7 F***ing x*x+y*y works* f*k'
get = re.split('\s', text)
# ['U.S.', 'vs.', 'brainf*ck', '*7', 'F***ing', 'x*x+y*y', 'works*', 'f*k']
#Example
print(get[0]) # U.S.
print(get[1]) # vs.
When I need to split a line of data I get following result:
>>> s="MS Dhoni cricket captain 10000"
>>> val=s.split()
>>> print val
['MS', 'Dhoni', 'cricket', 'captain', '10000']
But I expect code in the below manner:
['MS Dhoni', 'cricket', 'captain', '10000']
Though there is space in a specific position it must be skipped. How can I modify the code?
That code does what you want
import re
s="MS Dhoni cricket captain 10000"
print(re.split("\s(?=[a-z0-9])",s))
output:
['MS Dhoni', 'cricket', 'captain', '10000']
Explanation: split according to spaces, but only if followed by a lowercase letter or a digit (not consumed in the split operation thanks to the ?= construction (lookahead)
BUT this is cheating: had MS Dhoni been in the middle of the string, it wouldn't have worked. You assume that python knows how to read a distinction (Mr, ...) or group words containing only capital letters together with the next word. That is only in your mind.
It answers your question, but you have to be more specific if you want the answer to be useful for your projects.
I have a dictionary named dicitionario1. I need to replace the content of dicionario[chave][1] which is a list, for the list lista_atributos.
lista_atribtutos uses the content of dicionario[chave][1] to get a list where:
All the information is separed by "," except when it finds the characters "(#" and ")". In this case, it should create a list with the content between those characters (also separated by ","). It can find one or more entries of '(#' and I need to work with every single of them.
Although this might be easy, I'm stuck with the following code:
dicionario1 = {'#998' : [['IFCPROPERTYSET'],["'0siSrBpkjDAOVD99BESZyg',#41,'Geometric Position',$,(#977,#762,#768,#754,#753,#980,#755,#759,#757)"]],
'#1000' : [['IFCRELDEFINESBYPROPERTIES'],["'1dEWu40Ab8zuK7fuATUuvp',#41,$,$,(#973,#951),#998"]]}
for chave in dicionario1:
lista_atributos = []
ini = 0
for i in dicionario1[chave][1][0][ini:]:
if i == '(' and dicionario1[chave][1][0][dicionario1[chave][1][0].index(i) + 1] == '#':
ini = dicionario1[chave][1][0].index(i) + 1
fim = dicionario1[chave][1][0].index(')')
lista_atributos.append(dicionario1[chave][1][0][:ini-2].split(','))
lista_atributos.append(dicionario1[chave][1][0][ini:fim].split(','))
lista_atributos.append(dicionario1[chave][1][0][fim+2:].split(','))
print lista_atributos
Result:
[["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$'], ['#973', '#951'], ['#998']]
[["'0siSrBpkjDAOVD99BESZyg'", '#41', "'Geometric Position'", '$'], ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757'], ['']]
Unfortunately I can figure out how to iterate over the dictionario1[chave][1][0] to get this result:
[["'1dEWu40Ab8zuK7fuATUuvp'"], ['#41'], ['$'], ['$'], ['#973', '#951'], ['#998']]
[["'0siSrBpkjDAOVD99BESZyg'", ['#41'], ["'Geometric Position'"], ['$'], ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757']]
I need the"["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$']..." in the result, also to turn into ["'1dEWu40Ab8zuK7fuATUuvp'"], ['#41'], ['$'], ['$']...
Also If I modify "Geometric Position" to "(Geometric Position)" the result becomes:
[["'1dEWu40Ab8zuK7fuATUuvp'", '#41', '$', '$'], ['#973', '#951'], ['#998']]
SOLUTION: (thanks to Rob Watts)
import re
dicionario1 =["'0siSrBpkjDAOVD99BESZyg',#41,'(Geometric) (Position)',$,(#977,#762,#768,#754,#753,#980,#755,#759,#757)"]
dicionario1 = re.findall('\([^)]*\)|[^,]+', dicionario1[0])
for i in range(len(dicionario1)):
if dicionario1[i].startswith('(#'):
dicionario1[i] = dicionario1[i][1:-1].split(',')
else:
pass
print dicionario1
["'0siSrBpkjDAOVD99BESZyg'", '#41', "'(Geometric) (Position)'", '$', ['#977', '#762', '#768', '#754', '#753', '#980', '#755', '#759', '#757']]
One problem I see with your code is the use of index:
ini = dicionario1[chave][1][0].index(i) + 2
fim = dicionario1[chave][1][0].index(')')
index returns the index of the first occurrence of the character. So if you have two ('s in your string, then both times it will give you the index of the first one. That (and your break statement) is why in your example you've got ['2.1', '2.2', '2.3'] correctly but also have '(#5.1', '5.2', '5.3)'.
You can get around this by specifying a starting index to the index method, but I'd suggest a different strategy. If you don't have any commas in the parsed strings, you can use a fairly simple regex to find all your groups:
'\([^)]*\)|[^,]+'
This will find everything inside parenthesis and also everything that doesn't contain a comma. For example:
>>> import re
>>> teststr = "'1',$,#41,(#10,#5)"
>>> re.findall('\([^)]*\)|[^,]+', teststr)
["'1'", '$', '#41', '(#10,#5)']
This leaves you will everything grouped appropriately. You still have to do a little bit of processing on each entry, but it should be fairly straightforward.
During your processing, the startswith method should be helpful. For example:
>>> '(something)'.startswith('(')
True
>>> '(something)'.startswith('(#')
False
>>> '(#1,#2,#3)'.startswith('(#')
True
This will make it easy for you to distinguish between (...) and (#...). If there are commas in the (...), you could always split on comma after you've used the regex.
Example:
names = ['James John', 'Robert David', 'Paul' ... the list has 5K items]
text1 = 'I saw James today'
text2 = 'I saw James John today'
text3 = 'I met Paul'
is_name_in_text(text1,names) # this returns false 'James' in not in list
is_name_in_text(text2,names) # this returns 'James John'
is_name_in_text(text3,names) # this return 'Paul'
is_name_in_text() searches if any of the name list is in text.
The easy way to do is to just check if the name is in the list by using in operator, but the list has 5,000 items, so it is not efficient. I can just split the text into words and check if the words are in the list, but this not going to work if you have more than one word matching. Line number 7 will fail in this case.
Make names into a set and use the in-operator for fast O(1) lookup.
You can use a regex to parse out the possible names in a sentence:
>>> import re
>>> findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')
>>> def is_name_in_text(text, names):
for possible_name in set(findnames.findall(text)):
if possible_name in names:
return possible_name
return False
>>> names = set(['James John', 'Robert David', 'Paul'])
>>> is_name_in_text('I saw James today', names)
False
>>> is_name_in_text('I saw James John today', names)
'James John'
>>> is_name_in_text('I met Paul', names)
'Paul'
Build a regular expression with all the alternatives. This way you don't have to worry about somehow pulling the names out of the phrases beforehand.
import re
names_re = re.compile(r'\b' +
r'\b|\b'.join(re.escape(name) for name in names) +
r'\b')
print names_re.search('I saw James today')
You may use Python's set in order to get good performance while using the in operator.
If you have a mechanism of pulling the names out of the phrases and don't need to worry about partial matches (the full name will always be in the string), you can use a set rather than a list.
Your code is exactly the same, with this addition at line 2:
names = set(names)
The in operation will now function much faster.