Remove and replace multiple commas in string - python

I have this dataset
df = pd.DataFrame({'name':{0: 'John,Smith', 1: 'Peter,Blue', 2:'Larry,One,Stacy,Orange' , 3:'Joe,Good' , 4:'Pete,High,Anne,Green'}})
yielding:
name
0 John,Smith
1 Peter,Blue
2 Larry,One,Stacy,Orange
3 Joe,Good
4 Pete,High,Anne,Green
I would like to:
remove commas (replace them by one space)
wherever I have 2 persons in one cell, insert the "&"symbol after the first person family name and before the second person name.
Desired output:
name
0 John Smith
1 Peter Blue
2 Larry One & Stacy Orange
3 Joe Good
4 Pete High & Anne Green
Tried this code below, but it simply removes commas. I could not find how to insert the "&"symbol in the same code.
df['name']= df['name'].str.replace(r',', '', regex=True)
Disclaimer : all names in this table are fictitious. No identification with actual persons (living or deceased)is intended or should be inferred.

I would do it following way
import pandas as pd
df = pd.DataFrame({'name':{0: 'John,Smith', 1: 'Peter,Blue', 2:'Larry,One,Stacy,Orange' , 3:'Joe,Good' , 4:'Pete,High,Anne,Green'}})
df['name'] = df['name'].str.replace(',',' ').str.replace(r'(\w+ \w+) ', r'\1 & ', regex=True)
print(df)
gives output
name
0 John Smith
1 Peter Blue
2 Larry One & Stacy Orange
3 Joe Good
4 Pete High & Anne Green
Explanation: replace ,s using spaces, then use replace again to change one-or-more word characters followed by space followed by one-or-more word character followed by space using content of capturing group (which includes everything but last space) followed by space followed by & character followed by space.

With single regex replacement:
df['name'].str.replace(r',([^,]+)(,)?', lambda m:f" {m.group(1)}{' & ' if m.group(2) else ''}")
0 John Smith
1 Peter Blue
2 Larry One & Stacy Orange
3 Joe Good
4 Pete High & Anne Green

This should work:
import re
def separate_names(original_str):
spaces = re.sub(r',([^,]*(?:,|$))', r' \1', original_str)
return spaces.replace(',', ' & ')
df['spaced'] = df.name.map(separate_names)
df
I created a function called separate_names which replaces the odd number of commas with spaces using regex. The remaining commas (even) are then replaced by & using the replace function. Finally I used the map function to apply separate_names to each row. The output is as follows:

In replace statement you should replace comma with space. Please put space between '' -> so you have ' '
df['name']= df['name'].str.replace(r',', ' ', regex=True)
inserted space ^ here

Related

Regex - removing everything after first word following a comma

I have a column that has name variations that I'd like to clean up. I'm having trouble with the regex expression to remove everything after the first word following a comma.
d = {'names':['smith,john s','smith, john', 'brown, bob s', 'brown, bob']}
x = pd.DataFrame(d)
Tried:
x['names'] = [re.sub(r'/.\s+[^\s,]+/','', str(x)) for x in x['names']]
Desired Output:
['smith,john','smith, john', 'brown, bob', 'brown, bob']
Not sure why my regex isn't working, but any help would be appreciated.
You could try using a regex that looks for a comma, then an optional space, then only keeps the remaining word:
x["names"].str.replace(r"^([^,]*,\s*[^\s]*).*", r"\1")
0 smith,john
1 smith, john
2 brown, bob
3 brown, bob
Name: names, dtype: object
Try re.sub(r'/(,\s*\w+).*$/','$1', str(x))...
Put the triggered pattern into capture group 1 and then restore it in what gets replaced.

Split Column on regex

I really struggle with regex, and I'm hoping for some help.
I have columns that look like this
import pandas as pd
data = {'Location': ['Building A, 100 First St City, State', 'Fire Station # 100, 2 Apple Row, City, State Zip', 'Church , 134 Baker Rd City, State']}
df = pd.DataFrame(data)
Location
0 Building A, 100 First St City, State
1 Fire Station # 100, 2 Apple Row, City, State Zip
2 Church , 134 Baker Rd City, State
I would like to get it to the code chunk below by splitting anytime there is a comma followed by space and then a number. However, I'm running into an issue where I'm removing the number.
Location Name Address
0 Building A 100 First St City, State
1 Fire Station # 100 2 Apple Row, City, State, Zip
2 Church 134 Baker Rd City, State
This is the code I've been using
df['Location Name']= df['Location'].str.split('.,\s\d', expand=True)[0]
df['Address']= df['Location'].str.split('.,\s\d', expand=True)[1]
You can use Series.str.extract:
df[['Location Name','Address']] = df['Location'].str.extract(r'^(.*?),\s(\d.*)', expand=True)
The ^(.*?),\s(\d.*) regex matches
^ - start of string
(.*?) - Group 1 ('Location Name'): any zero or more chars other than line break chars as few as possible
,\s - comma and whitespace
(\d.*) - Group 1 ('Address'): digit and the rest of the line.
See the regex demo.
Another simple solution to your problem is to use a positive lookahead. You want to check if there is a number ahead of your pattern, while not including the number in the match. Here's an example of a regex that solves your problem:
\s?,\s(?=\d)
Here, we optionally remove a trailing whitespace, then match a comma followed by whitespace.
The (?= ) is a positive lookahead, in this case we check for a following digit. If that's matched, the split will remove the comma and whitespace only.

Regex Names which have received a B

I have the folllowing lines
John SMith: A
Pedro Smith: B
Jonathan B: A
John B: B
Luis Diaz: A
Scarlet Diaz: B
I need to get all student names which have received a B.
I tried this but it doesnt work 100%
x = re.findall(r'\b(.*?)\bB', grades)
You can use
\b([^:\n]*):\s*B
See the regex demo. Details:
\b - a word boundary
([^:\n]*) - Group 1: any zero or more chars other than : and line feed
: - a colon
\s* - zero or more whitespaces
B - a B char.
See the Python demo:
import re
# Make sure you use
# with open(fpath, 'r') as f: text = f.read()
# for this to work if you read the data from a file
text = """John SMith: A
Pedro Smith: B
Jonathan B: A
John B: B
Luis Diaz: A
Scarlet Diaz: B"""
print( re.findall(r'\b([^:\n]*):\s*B', text) )
# => ['Pedro Smith', 'John B', 'Scarlet Diaz']
re.findall(': B') is a pretty simple way to do it, assuming there will always be exactly one space between the colon and the letter grade (which, assuming you're doing the Data Science in Python course on Coursera, there is for this assignment)

Insert space between specific characters but not if followed by specific characters regex

Using python regex, I wish to insert a space between alpha characters and numerals (alpha will always preceed numeral), but not between (numerals and hyphens) or between (numerals and underscores).
Ideally, I'd like it to replace all such examples on the line (see the 3rd sample string, below), but even just doing the first one is great.
I've gotten this far:
import re
item = "Bob Ro1-1 Fred"
txt = re.sub(r"(.*)(\d)", r"\1 \2", item)
print(txt) #prints Bob Ro1 -1 Fred (DESIRED WOULD BE Bob Ro 1-1 Fred)
I've tried sticking a ? in various places to ungreedify the search, but haven't yet found the magic.
Sample strings: Original ==> Desired output
1. "Bob Ro1 Sam cl3" ==> "Bob Ro 1 Sam cl 3"
2. "Some Guy ro1-1 Sam" ==> "Some Guy ro 1-1 Sam"
3. "ribbet ribbit ro3_2 bob wow cl1-3" ==> "ribbit ribbit ro 3_2 bow wow cl 1-3"
You may use
re.sub(r'([^\W\d_])(\d)', r'\1 \2', s)
See the regex demo
A variation using lookarounds:
re.sub(r'(?<=[^\W\d_])(?=\d)', ' ', s)
The ([^\W\d_])(\d) regex matches and captures into Group 1 any single letter and into Group 2 the next digit. Then, the \1 \2 replacement pattern inserts the letter in Group 1, a space, and the digit in Group 2 into the resulting string.
The (?<=[^\W\d_])(?=\d) matches a location in between a letter and a digit, and thus, the replacement string only contains a space.
See the Python demo:
import re
strs = [ 'Bob Ro1-1 Fred', 'Bob Ro1 Sam cl3', 'Some Guy ro1-1 Sam', 'ribbet ribbit ro3_2 bob wow cl1-3' ]
rx = re.compile(r'([^\W\d_])(\d)')
for s in strs:
print(re.sub(r'([^\W\d_])(\d)', r'\1 \2', s))
print(re.sub(r'(?<=[^\W\d_])(?=\d)', ' ', s))
Output:
Bob Ro 1-1 Fred
Bob Ro 1-1 Fred
Bob Ro 1 Sam cl 3
Bob Ro 1 Sam cl 3
Some Guy ro 1-1 Sam
Some Guy ro 1-1 Sam
ribbet ribbit ro 3_2 bob wow cl 1-3
ribbet ribbit ro 3_2 bob wow cl 1-3
You need a look ahead following a look behind:
(?<=[a-zA-Z])(?=[0-9])
The code should be re.sub(r"(?<=[a-zA-Z])(?=[0-9])", r" ", item)

How do I extract characters from a string in Python?

I need to make some name formats match for merging later on in my script. My column 'Name' is imported from a csv and contains names like the following:
Antonio Brown
LeSean McCoy
Le'Veon Bell
For my script, I would like to get the first letter of the first name and combine it with the last name as such....
A.Brown
L.McCoy
L.Bell
Here's what I have right now that returns a NaaN every time:
ff['AbbrName'] = ff['Name'].str.extract('([A-Z]\s[a-zA-Z]+)', expand=True)
Thanks!
Another option using str.replace method with ^([A-Z]).*?([a-zA-Z]+)$; ^([A-Z]) captures the first letter at the beginning of the string; ([a-zA-Z]+)$ matches the last word, then reconstruct the name by adding . between the first captured group and second captured group:
df['Name'].str.replace(r'^([A-Z]).*?([a-zA-Z]+)$', r'\1.\2')
#0 A.Brown
#1 L.McCoy
#2 L.Bell
#Name: Name, dtype: object
What if you would just apply() a function that would split by the first space and get the first character of the first word adding the rest:
import pandas as pd
def abbreviate(row):
first_word, rest = row['Name'].split(" ", 1)
return first_word[0] + ". " + rest
df = pd.DataFrame({'Name': ['Antonio Brown', 'LeSean McCoy', "Le'Veon Bell"]})
df['AbbrName'] = df.apply(abbreviate, axis=1)
print(df)
Prints:
Name AbbrName
0 Antonio Brown A. Brown
1 LeSean McCoy L. McCoy
2 Le'Veon Bell L. Bell
This should be simple enough to do, even without regex. Use a combination of string splitting and concatenation.
df.Name.str[0] + '.' + df.Name.str.split().str[-1]
0 A.Brown
1 L.McCoy
2 L.Bell
Name: Name, dtype: object
If there is a possibility of the Name column having leading spaces, replace df.Name.str[0] with df.Name.str.strip().str[0].
Caveat: Columns must have two names at the very least.
You get NaaN because your regular expression cannot match to the names.
Instead I'll try the following:
parts = ff[name].split(' ')
ff['AbbrName'] = parts[0][0] + '.' + parts[1]

Categories