Regex - removing everything after first word following a comma - python

I have a column that has name variations that I'd like to clean up. I'm having trouble with the regex expression to remove everything after the first word following a comma.
d = {'names':['smith,john s','smith, john', 'brown, bob s', 'brown, bob']}
x = pd.DataFrame(d)
Tried:
x['names'] = [re.sub(r'/.\s+[^\s,]+/','', str(x)) for x in x['names']]
Desired Output:
['smith,john','smith, john', 'brown, bob', 'brown, bob']
Not sure why my regex isn't working, but any help would be appreciated.

You could try using a regex that looks for a comma, then an optional space, then only keeps the remaining word:
x["names"].str.replace(r"^([^,]*,\s*[^\s]*).*", r"\1")
0 smith,john
1 smith, john
2 brown, bob
3 brown, bob
Name: names, dtype: object

Try re.sub(r'/(,\s*\w+).*$/','$1', str(x))...
Put the triggered pattern into capture group 1 and then restore it in what gets replaced.

Related

Remove and replace multiple commas in string

I have this dataset
df = pd.DataFrame({'name':{0: 'John,Smith', 1: 'Peter,Blue', 2:'Larry,One,Stacy,Orange' , 3:'Joe,Good' , 4:'Pete,High,Anne,Green'}})
yielding:
name
0 John,Smith
1 Peter,Blue
2 Larry,One,Stacy,Orange
3 Joe,Good
4 Pete,High,Anne,Green
I would like to:
remove commas (replace them by one space)
wherever I have 2 persons in one cell, insert the "&"symbol after the first person family name and before the second person name.
Desired output:
name
0 John Smith
1 Peter Blue
2 Larry One & Stacy Orange
3 Joe Good
4 Pete High & Anne Green
Tried this code below, but it simply removes commas. I could not find how to insert the "&"symbol in the same code.
df['name']= df['name'].str.replace(r',', '', regex=True)
Disclaimer : all names in this table are fictitious. No identification with actual persons (living or deceased)is intended or should be inferred.
I would do it following way
import pandas as pd
df = pd.DataFrame({'name':{0: 'John,Smith', 1: 'Peter,Blue', 2:'Larry,One,Stacy,Orange' , 3:'Joe,Good' , 4:'Pete,High,Anne,Green'}})
df['name'] = df['name'].str.replace(',',' ').str.replace(r'(\w+ \w+) ', r'\1 & ', regex=True)
print(df)
gives output
name
0 John Smith
1 Peter Blue
2 Larry One & Stacy Orange
3 Joe Good
4 Pete High & Anne Green
Explanation: replace ,s using spaces, then use replace again to change one-or-more word characters followed by space followed by one-or-more word character followed by space using content of capturing group (which includes everything but last space) followed by space followed by & character followed by space.
With single regex replacement:
df['name'].str.replace(r',([^,]+)(,)?', lambda m:f" {m.group(1)}{' & ' if m.group(2) else ''}")
0 John Smith
1 Peter Blue
2 Larry One & Stacy Orange
3 Joe Good
4 Pete High & Anne Green
This should work:
import re
def separate_names(original_str):
spaces = re.sub(r',([^,]*(?:,|$))', r' \1', original_str)
return spaces.replace(',', ' & ')
df['spaced'] = df.name.map(separate_names)
df
I created a function called separate_names which replaces the odd number of commas with spaces using regex. The remaining commas (even) are then replaced by & using the replace function. Finally I used the map function to apply separate_names to each row. The output is as follows:
In replace statement you should replace comma with space. Please put space between '' -> so you have ' '
df['name']= df['name'].str.replace(r',', ' ', regex=True)
inserted space ^ here

Python -- split a string with multiple occurrences of same delimiter

How can I take a string that looks like this
string = 'Contact name: John Doe Contact phone: 222-333-4444'
and split the string on both colons? Ideally the output would look like:
['Contact Name', 'John Doe', 'Contact phone','222-333-4444']
The real issue is that the name can be an arbitrary length however, I think it might be possible to use re to split the string after a certain number of space characters (say at least 4, since there will likely always be at least 4 spaces between the end of any name and beginning of Contact phone) but I'm not that good with regex. If someone could please provide a possible solution (and explanation so I can learn), that would be thoroughly appreciated.
You can use re.split:
import re
s = 'Contact name: John Doe Contact phone: 222-333-4444'
new_s = re.split(':\s|\s{2,}', s)
Output:
['Contact name', 'John Doe', 'Contact phone', '222-333-4444']
Regex explanation:
:\s => matches an occurrence of ': '
| => evaluated as 'or', attempts to match either the pattern before or after it
\s{2,} => matches two or more whitespace characters

How do I extract characters from a string in Python?

I need to make some name formats match for merging later on in my script. My column 'Name' is imported from a csv and contains names like the following:
Antonio Brown
LeSean McCoy
Le'Veon Bell
For my script, I would like to get the first letter of the first name and combine it with the last name as such....
A.Brown
L.McCoy
L.Bell
Here's what I have right now that returns a NaaN every time:
ff['AbbrName'] = ff['Name'].str.extract('([A-Z]\s[a-zA-Z]+)', expand=True)
Thanks!
Another option using str.replace method with ^([A-Z]).*?([a-zA-Z]+)$; ^([A-Z]) captures the first letter at the beginning of the string; ([a-zA-Z]+)$ matches the last word, then reconstruct the name by adding . between the first captured group and second captured group:
df['Name'].str.replace(r'^([A-Z]).*?([a-zA-Z]+)$', r'\1.\2')
#0 A.Brown
#1 L.McCoy
#2 L.Bell
#Name: Name, dtype: object
What if you would just apply() a function that would split by the first space and get the first character of the first word adding the rest:
import pandas as pd
def abbreviate(row):
first_word, rest = row['Name'].split(" ", 1)
return first_word[0] + ". " + rest
df = pd.DataFrame({'Name': ['Antonio Brown', 'LeSean McCoy', "Le'Veon Bell"]})
df['AbbrName'] = df.apply(abbreviate, axis=1)
print(df)
Prints:
Name AbbrName
0 Antonio Brown A. Brown
1 LeSean McCoy L. McCoy
2 Le'Veon Bell L. Bell
This should be simple enough to do, even without regex. Use a combination of string splitting and concatenation.
df.Name.str[0] + '.' + df.Name.str.split().str[-1]
0 A.Brown
1 L.McCoy
2 L.Bell
Name: Name, dtype: object
If there is a possibility of the Name column having leading spaces, replace df.Name.str[0] with df.Name.str.strip().str[0].
Caveat: Columns must have two names at the very least.
You get NaaN because your regular expression cannot match to the names.
Instead I'll try the following:
parts = ff[name].split(' ')
ff['AbbrName'] = parts[0][0] + '.' + parts[1]

Iterate and match all elements with regex

So I have something like this:
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
I want to replace every element with the first name so it would look like this:
data = ['Alice Smith', 'Tim', 'Uncle Neo']
So far I got:
for i in range(len(data)):
if re.match('(.*) and|with|\&', data[i]):
a = re.match('(.*) and|with|\&', data[i])
data[i] = a.group(1)
But it doesn't seem to work, I think it's because of my pattern but I can't figure out the right way to do this.
Use a list comprehension with re.split:
result = [re.split(r' (?:and|with|&) ', x)[0] for x in data]
The | needs grouping with parentheses in your attempt. Anyway, it's too complex.
I would just use re.sub to remove the separation word & the rest:
data = [re.sub(" (and|with|&) .*","",d) for d in data]
result:
['Alice Smith', 'Tim', 'Uncle Neo']
You can try this:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
final_data = [re.sub('\sand.*?$|\s&.*?$|\swith.*?$', '', i) for i in data]
Output:
['Alice Smith', 'Tim', 'Uncle Neo']
Simplify your approach to the following:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
data = [re.search(r'.*(?= (and|with|&))', i).group() for i in data]
print(data)
The output:
['Alice Smith', 'Tim', 'Uncle Neo']
.*(?= (and|with|&)) - positive lookahead assertion, ensures that name/surname .* is followed by any item from the alternation group (and|with|&)
Brief
I would suggest using Casimir's answer if possible, but, if you are not sure what word might follow (that is to say that and, with, and & are dynamic), then you can use this regex.
Note: This regex will not work for some special cases such as names with apostrophes ' or dashes -, but you can add them to the character list that you're searching for. This answer also depends on the name beginning with an uppercase character and the "union word" as I'll name it (and, with, &, etc.) not beginning with an uppercase character.
Code
See this regex in use here
Regex
^((?:[A-Z][a-z]*\s*)+)\s.*
Substitution
$1
Result
Input
Alice Smith and Bob
Tim with Sam Dunken
Uncle Neo & 31
Output
Alice Smith
Tim
Uncle Neo
Explanation
Assert position at the beginning of the string ^
Match a capital alpha character [A-Z]
Match between any number of lowercase alpha characters [a-z]*
Match between any number of whitespace characters (you can specify spaces if you'd prefer using * instead) \s*
Match the above conditions between one and unlimited times, all captured into capture group 1 (...)+: where ... contains everything above
Match a whitespace character, followed by any character (except new line) any number of times
$1: Replace with capture group 1

Using str.split for pandas dataframe values based on parentheses location

Let's say I have the following dataframe series df['Name'] column:
Name
'Jerry'
'Adam (and family)'
'Paul and Hellen (and family):\n'
'John and Peter (and family):/n'
How would I remove all the contents in Name after the first parentheses?
df['Name']= df['Name'].str.split("'(").str[0]
doesn't seem to work and I don't understand why?
The output I want is
Name
'Jerry'
'Adam'
'Paul and Hellen'
'John and Peter'
so everything after the parentheses is deleted.
Solution with split - is necessary escape ( by \:
df['Name']= df['Name'].str.split("\s+\(").str[0]
print (df)
Name
0 'Jerry'
1 'Adam
2 'Paul and Hellen
3 'John and Peter
Solution with regex and replace:
df['Name']= df['Name'].str.replace("\s+\(.*$", "")
print (df)
Name
0 'Jerry'
1 'Adam
2 'Paul and Hellen
3 'John and Peter
\s+\(.*$ means replace from optional whitespace, first ( to the end of string $ to "" - empty string.
Use regular expression:
>>> import re
>>> str = 'Adam (and family)'
>>> result = re.sub(r"( \().*$", '', str)
>>> print result
Adam

Categories