Conditionally update dataframe column if character exists in another column - python

I have a dataframe which consists of two columns, full name and last name. Sometimes, the last name column is not filled properly. In such cases, the last name would be found as the last word in the full name column between parenthesis. I would like to update my last name column for those cases where parenthesis are found to be equal to the word between parenthesis.
Code
import pandas as pd
df = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', '-', 'mac', '-']
})
result_to_be = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', 'james', 'mac', 'arnold']
})
print(df)
print(result_to_be)
I have tried to implement the contains function to be used as a mask but it seems to be messing the check regex when checking if it contains ')' or '(' characters
df['full'].str.contains(')')
The error it shows is
re.error: unbalanced parenthesis at position 0

You can use .str.findall to get the value between the parentheses and df.loc to assign that where last is -:
df.loc[df['last'] == '-', 'last'] = df['full'].str.findall('\((.+?)\)').str[-1]
Output:
>>> df
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold

For a slightly different syntax, you could also use extract
df.loc[df['last'] == '-', 'last'] = df['full'].str.extract('.*\((.*)\)', expand=False)
Output:
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold

Related

Moving Names in to First Name the Last Name

I have a imported a csv dataset into python that is being cleaned up, there is no consistency with names some being "John Doe" and others being "Doe, John". I need them to be "First name Last name" without the comma:
Doe, John
Smith, John
Snow, John
John Cena
Steve Smith
When I want:
John Doe
John Smith
John Snow
John Cena
Steve Smith
I tried doing:
if ',' in df['names']:
df['names'] = ' '. join(df['names'].split(',')[::-1]).strip()
I get
AttributeError: 'Series' object has no attribute 'split'
I have tried making name into a list by doing prior to the code above but that didn't work:
df['name'] = df['name'].to_list()
You can use str.replace and use capture groups to swap values:
df['names'] = df['names'].str.replace(r'([^,]+),\s*(.+)', r'\2 \1', regex=True)
print(df)
# Output
names
0 John Doe
1 John Smith
2 John Snow
3 John Cena
4 Steve Smith
Note: you have to use str accessor in your code (but does not solve the next problem):
# Replace
df['names'].split(',')
# With
df['names'].str.split(',')
You can use a lambda function to process each name
df['names'] = df['names'].apply(
lambda x: (x.split(',')[1] + ' ' + x.split(',')[0]).strip()
if ',' in x else x
)
Using split(',') you are splitting the name into two strings, and accessing them with the index [1] part. Then you concatenate [1] with [0] and finally remove leading and trailing whitespaces using strip(). All of this happens if x (remember x is every singular name) contains a comma, if not then we leave x as is.
You can try this:
df['Name'].str.split(',').str[::-1].str.join(' ').str.strip()
Output:
0 John Doe
1 John Smith
2 John Snow
3 John Cena
4 Steve Smith
Name: Name, dtype: object
Split on comma, reverse element order, join elements with a space and trailing leading spaces incase there was no commna found.

Python - Remove words from variable

I am trying to remove any occurrence of 'Doctor', 'Honorable', and 'Professor' from a variable in a dataframe. Here is an example of the dataframe:
Name
professor Rick Smith
Mark M. Tarleton
Doctor Charles M. Alexander
Professor doctor Todd Mckenzie
Carl L. Darla
Honorable Billy Darlington
Observations could have multiple, one, or none of: 'Doctor', 'Honorable', or 'Professor'. Also, the terms could be upper case or lower case.
Any help would be much appreciated!
Use a regex with str.replace:
regex = '(?:Doctor|Honorable|Professor)\s*'
df['Name'] = df['Name'].str.replace(regex, '', regex=True, case=False)
Output:
Name
0 Rick Smith
1 Mark M. Tarleton
2 Charles M. Alexander
3 Todd Mckenzie
4 Carl L. Darla
5 Billy Darlington
regex demo

Extract last word in DataFrame column

This has to be so simple - but I can't figure it out. I have a "name" column within a DataFrame and I'm trying to reverse the order of ['First Name', 'Middle Name', 'Last Name'] to ['Last Name', 'First Name', 'Middle Name'].
Here is my code:
for i in range(2114):
bb = a['Approved by User'][i].split(" ",2)[2]
aa = a['Approved by User'][i].split(" ",2)[0]
a['Full Name]'] = bb+','+aa
Unfortunately I keep getting IndexError: list index out of range with the current code.
This is what I want:
Old column Name| Jessica Mary Simpson
New column Name| Simpson Jessica Mary
One way to do it is to split the string and joinit later on in a function.
like so:
import pandas as pd
d = {"name": ["Jessica Mary Simpson"]}
df = pd.DataFrame(d)
a = df.name.str.split()
a = a.apply(lambda x: " ".join(x[::-1])).reset_index()
print(a)
output:
index name
0 0 Simpson Mary Jessica
With your shown samples, you could try following.
Let's say following is the df:
fullname
0 Jessica Mary Simpson
1 Ravinder avtar singh
2 John jonny janardan
Here is the code:
df['fullname'].replace(r'^([^ ]*) ([^ ]*) (.*)$', r'\3 \1 \2',regex=True)
OR
df['fullname'].replace(r'^(\S*) (\S*) (.*)$', r'\3 \1 \2',regex=True)
output will be as follows:
0 Simpson Jessica Mary
1 singh Ravinder avtar
2 janardan John jonny
I think problem is in your data, here is your solution in pandas text functions Series.str.split, indexing and Series.str.join:
df['Full Name'] = df['Approved by User'].str.split(n=2).str[::-1].str.join(' ')
print (df)
Approved by User Full Name
0 Jessica Mary Simpson Simpson Mary Jessica
1 John Doe Doe John
2 Mary Mary

Removing last words in each row in pandas dataframe

A dataframe contains a column named 'full_name' and the rows look like this:
full_name
Peter Eli Smith
Vanessa Mary Ellen
Raul Gonzales
Kristine S Lee
How do I remove the last words and add an additional column called 'first_middle_name' which will result like this?:
full_name first_middle_name
Peter Eli Smith Peter Eli
Vanessa Mary Ellen Vanessa Mary
Raul Gonzales Raul
Kristine S Lee Kristine S
Thank you
We can try using str.replace here:
df["first_middle_name"] = df["full_name"].replace("\s+\S+$", "")
See the above regex replacement working in the demo link below.
Demo
Use str
df["first_middle_name"] = df["full_name"].str.replace("\s+\S+$", "")

Extract sub-string between 2 special characters from one column of Pandas DataFrame

I have a Python Pandas DataFrame like this:
Name
Jim, Mr. Jones
Sara, Miss. Baker
Leila, Mrs. Jacob
Ramu, Master. Kuttan
I would like to extract only name title from Name column and copy it into a new column named Title. Output DataFrame looks like this:
Name Title
Jim, Mr. Jones Mr
Sara, Miss. Baker Miss
Leila, Mrs. Jacob Mrs
Ramu, Master. Kuttan Master
I am trying to find a solution with regex but failed to find a proper result.
In [157]: df['Title'] = df.Name.str.extract(r',\s*([^\.]*)\s*\.', expand=False)
In [158]: df
Out[158]:
Name Title
0 Jim, Mr. Jones Mr
1 Sara, Miss. Baker Miss
2 Leila, Mrs. Jacob Mrs
3 Ramu, Master. Kuttan Master
or
In [163]: df['Title'] = df.Name.str.split(r'\s*,\s*|\s*\.\s*').str[1]
In [164]: df
Out[164]:
Name Title
0 Jim, Mr. Jones Mr
1 Sara, Miss. Baker Miss
2 Leila, Mrs. Jacob Mrs
3 Ramu, Master. Kuttan Master
Have a look at str.extract.
The regexp you are looking for is (?<=, )\w+(?=.). In words: take the substring that is preceded by , (but do not include), consists of at least one word character, and ends with a . (but do not include). In future, use an online regexp tester such as regex101; regexps become rather trivial that way.
This is assuming each entry in the Name column is formatted the same way.

Categories