I am trying to remove any occurrence of 'Doctor', 'Honorable', and 'Professor' from a variable in a dataframe. Here is an example of the dataframe:
Name
professor Rick Smith
Mark M. Tarleton
Doctor Charles M. Alexander
Professor doctor Todd Mckenzie
Carl L. Darla
Honorable Billy Darlington
Observations could have multiple, one, or none of: 'Doctor', 'Honorable', or 'Professor'. Also, the terms could be upper case or lower case.
Any help would be much appreciated!
Use a regex with str.replace:
regex = '(?:Doctor|Honorable|Professor)\s*'
df['Name'] = df['Name'].str.replace(regex, '', regex=True, case=False)
Output:
Name
0 Rick Smith
1 Mark M. Tarleton
2 Charles M. Alexander
3 Todd Mckenzie
4 Carl L. Darla
5 Billy Darlington
regex demo
Related
I have a dataframe which consists of two columns, full name and last name. Sometimes, the last name column is not filled properly. In such cases, the last name would be found as the last word in the full name column between parenthesis. I would like to update my last name column for those cases where parenthesis are found to be equal to the word between parenthesis.
Code
import pandas as pd
df = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', '-', 'mac', '-']
})
result_to_be = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', 'james', 'mac', 'arnold']
})
print(df)
print(result_to_be)
I have tried to implement the contains function to be used as a mask but it seems to be messing the check regex when checking if it contains ')' or '(' characters
df['full'].str.contains(')')
The error it shows is
re.error: unbalanced parenthesis at position 0
You can use .str.findall to get the value between the parentheses and df.loc to assign that where last is -:
df.loc[df['last'] == '-', 'last'] = df['full'].str.findall('\((.+?)\)').str[-1]
Output:
>>> df
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold
For a slightly different syntax, you could also use extract
df.loc[df['last'] == '-', 'last'] = df['full'].str.extract('.*\((.*)\)', expand=False)
Output:
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold
I have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': ['Steve Smith', 'Joe Nadal',
'Roger Federer'],
'birthdat/company': ['1995-01-26Sharp, Reed and Crane',
'1955-08-14Price and Sons',
'2000-06-28Pruitt, Bush and Mcguir']})
df[['data_time','full_company_name']] = df['birthdat/company'].str.split('[0-9]{4}-[0-9]{2}-[0-9]{2}', expand=True)
df
with my code I get the following:
____|____Name______|__birthdat/company_______________|_birthdate_|____company___________
0 |Steve Smith |1995-01-26Sharp, Reed and Crane | |Sharp, Reed and Crane
1 |Joe Nadal |1955-08-14Price and Sons | |Price and Sons
2 |Roger Federer |2000-06-28Pruitt, Bush and Mcguir| |Pruitt, Bush and Mcguir
what I want is - get this regex ('[0-9]{4}-[0-9]{2}-[0-9]{2}') and the rest should go to the column "full_company_name" and :
____|____Name______|_birthdate_|____company_name_______
0 |Steve Smith |1995-01-26 |Sharp, Reed and Crane
1 |Joe Nadal |1955-08-14 |Price and Sons
2 |Roger Federer |2000-06-28 |Pruitt, Bush and Mcguir
Updated Question:
How could I handle missing values for birthdate or company name,
example: birthdate/company = "NaApple" or birthdate/company = "2003-01-15Na" the missing values are not only limited to Na
You may use
df[['data_time','full_company_name']] = df['birthdat/company'].str.extract(r'^([0-9]{4}-[0-9]{2}-[0-9]{2})(.*)', expand=False)
>>> df
Name Age ... data_time full_company_name
0 Steve Smith 32 ... 1995-01-26 Sharp, Reed and Crane
1 Joe Nadal 34 ... 1955-08-14 Price and Sons
2 Roger Federer 36 ... 2000-06-28 Pruitt, Bush and Mcguir
[3 rows x 5 columns]
The Series.str.extract is used here because you need to get two parts without losing the date.
The regex is
^ - start of string
([0-9]{4}-[0-9]{2}-[0-9]{2}) - your date pattern captured into Group 1
(.*) - the rest of the string captured into Group 2.
See the regex demo.
split splits the string by the separator while ignoring them. I think you want extract with two capture groups:
df[['data_time','full_company_name']] = \
df['birthdat/company'].str.extract('^([0-9]{4}-[0-9]{2}-[0-9]{2})(.*)')
Output:
Name birthdat/company data_time full_company_name
-- ------------- --------------------------------- ----------- -----------------------
0 Steve Smith 1995-01-26Sharp, Reed and Crane 1995-01-26 Sharp, Reed and Crane
1 Joe Nadal 1955-08-14Price and Sons 1955-08-14 Price and Sons
2 Roger Federer 2000-06-28Pruitt, Bush and Mcguir 2000-06-28 Pruitt, Bush and Mcguir
A dataframe contains a column named 'full_name' and the rows look like this:
full_name
Peter Eli Smith
Vanessa Mary Ellen
Raul Gonzales
Kristine S Lee
How do I remove the last words and add an additional column called 'first_middle_name' which will result like this?:
full_name first_middle_name
Peter Eli Smith Peter Eli
Vanessa Mary Ellen Vanessa Mary
Raul Gonzales Raul
Kristine S Lee Kristine S
Thank you
We can try using str.replace here:
df["first_middle_name"] = df["full_name"].replace("\s+\S+$", "")
See the above regex replacement working in the demo link below.
Demo
Use str
df["first_middle_name"] = df["full_name"].str.replace("\s+\S+$", "")
I have two columns in a DataFrame, crewname is a list of crew members worked on a film. Director_loc is the location within the list of the director.
I want to create a new column which has the name of the director.
crewname Director_loc
[John Lasseter, Joss Whedon, Andrew Stanton, J... 0
[Larry J. Franco, Jonathan Hensleigh, James Ho... 3
[Howard Deutch, Mark Steven Johnson, Mark Stev... 0
[Forest Whitaker, Ronald Bass, Ronald Bass, Ez... 0
[Alan Silvestri, Elliot Davis, Nancy Meyers, N... 5
[Michael Mann, Michael Mann, Art Linson, Micha... 0
[Sydney Pollack, Barbara Benedek, Sydney Polla... 0
[David Loughery, Stephen Sommers, Peter Hewitt... 2
[Peter Hyams, Karen Elise Baldwin, Gene Quinta... 0
[Martin Campbell, Ian Fleming, Jeffrey Caine, ... 0
I've tried a number of codes using list comprehension, enumerate etc. I'm a bit embarrassed to put them here.
Any help will be appreciated.
Use indexing with list comprehension:
df['name'] = [a[b] for a , b in zip(df['crewname'], df['Director_loc'])]
print (df)
crewname Director_loc \
0 [John Lasseter, Joss Whedon, Andrew Stanton] 2
1 [Larry J. Franco, Jonathan Hensleigh] 1
name
0 Andrew Stanton
1 Jonathan Hensleigh
My python function definition is as follows:
def name_extractor(dirty_name):
print Name
clean_name = re.sub('\W'," ", dirty_name)
print clean_name
The samples of dirty name contains:
(10) Johny Doe
Eric E. Shelby
(1) Chris Melton - ŗ≤ēŗ≤Ņŗ≤įŗ≤Ņŗ≤ēŗ≥ć ŗ≤ēŗ≥Äŗ≤įŗ≥ćŗ≤§ŗ≤Ņ
Jonas Alexander Bay
Christopher Rockstar - An awesome guy
Jones Collier
I want to the output to print just:
Johny Doe
Eric E. Shelby
Chris Melton
Jonas Alexander Bay
Christopher Rockstar
Jones Collier
How to tweak the regular expression to print names only as it is and exclude everything (the random characters or normal ascii characters) after "-"?
You don't need regular expressions for that. Split each line on ' - ' and then filter out the characters you don't want, stripping the extra whitespace:
>>> l = '''(10) Johny Doe
... Eric E. Shelby
... (1) Chris Melton - ŗ≤ēŗ≤Ņŗ≤įŗ≤Ņŗ≤ēŗ≥ć ŗ≤ēŗ≥Äŗ≤įŗ≥ćŗ≤§ŗ≤Ņ
... Jonas Alexander Bay
... Christopher Rockstar - An awesome guy
... Jones Collier'''.splitlines()
>>> for line in l:
... print(''.join(c for c in line.split(' - ')[0] if c.isalpha() or c in ' .').strip())
...
Johny Doe
Eric E. Shelby
Chris Melton
Jonas Alexander Bay
Christopher Rockstar
Jones Collier
To exclude all non-ascii characters and all others that go after hyphen - - it would be enough to replace them with empty string "".Short solution using specific regex pattern:
dirty_name = '''
(10) Johny Doe
Eric E. Shelby
(1) Chris Melton - ŗ≤ēŗ≤Ņŗ≤įŗ≤Ņŗ≤ēŗ≥ć ŗ≤ēŗ≥Äŗ≤įŗ≥ćŗ≤§ŗ≤Ņ
Jonas Alexander Bay
Christopher Rockstar - An awesome guy
Jones Collier'''
clean_name = '\n'.join(l.lstrip() for l in re.sub(r'[^\x00-\x7f]|[\d()]| - .+\b(?=\n)', "", dirty_name).split('\n'))
print(clean_name)
The output:
Johny Doe
Eric E. Shelby
Chris Melton
Jonas Alexander Bay
Christopher Rockstar
Jones Collier
Edit: removed left leading spaces cause #TigerhawkT3 is too "space-sensitive"(in his own religion)
P.S. \x00-\x7f is ASCII characters range