A dataframe contains a column named 'full_name' and the rows look like this:
full_name
Peter Eli Smith
Vanessa Mary Ellen
Raul Gonzales
Kristine S Lee
How do I remove the last words and add an additional column called 'first_middle_name' which will result like this?:
full_name first_middle_name
Peter Eli Smith Peter Eli
Vanessa Mary Ellen Vanessa Mary
Raul Gonzales Raul
Kristine S Lee Kristine S
Thank you
We can try using str.replace here:
df["first_middle_name"] = df["full_name"].replace("\s+\S+$", "")
See the above regex replacement working in the demo link below.
Demo
Use str
df["first_middle_name"] = df["full_name"].str.replace("\s+\S+$", "")
Related
I am trying to remove any occurrence of 'Doctor', 'Honorable', and 'Professor' from a variable in a dataframe. Here is an example of the dataframe:
Name
professor Rick Smith
Mark M. Tarleton
Doctor Charles M. Alexander
Professor doctor Todd Mckenzie
Carl L. Darla
Honorable Billy Darlington
Observations could have multiple, one, or none of: 'Doctor', 'Honorable', or 'Professor'. Also, the terms could be upper case or lower case.
Any help would be much appreciated!
Use a regex with str.replace:
regex = '(?:Doctor|Honorable|Professor)\s*'
df['Name'] = df['Name'].str.replace(regex, '', regex=True, case=False)
Output:
Name
0 Rick Smith
1 Mark M. Tarleton
2 Charles M. Alexander
3 Todd Mckenzie
4 Carl L. Darla
5 Billy Darlington
regex demo
I have a dataframe which consists of two columns, full name and last name. Sometimes, the last name column is not filled properly. In such cases, the last name would be found as the last word in the full name column between parenthesis. I would like to update my last name column for those cases where parenthesis are found to be equal to the word between parenthesis.
Code
import pandas as pd
df = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', '-', 'mac', '-']
})
result_to_be = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', 'james', 'mac', 'arnold']
})
print(df)
print(result_to_be)
I have tried to implement the contains function to be used as a mask but it seems to be messing the check regex when checking if it contains ')' or '(' characters
df['full'].str.contains(')')
The error it shows is
re.error: unbalanced parenthesis at position 0
You can use .str.findall to get the value between the parentheses and df.loc to assign that where last is -:
df.loc[df['last'] == '-', 'last'] = df['full'].str.findall('\((.+?)\)').str[-1]
Output:
>>> df
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold
For a slightly different syntax, you could also use extract
df.loc[df['last'] == '-', 'last'] = df['full'].str.extract('.*\((.*)\)', expand=False)
Output:
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold
Input:
I have a Dataframe as follows
Full_Name Name1 Name2
John Mathew Davidson John Davidson
Paul Theodre Luther Paul Theodre
Victor George Mary George Mary
Output:
I need to find the Remaining_name column as shown below
Full_Name Name1 Name2 Remaining_name
John Mathew Davidson John Davidson Mathew
Paul Theodre Luther Paul Theodre Luther
Victor George Mary George Mary Victor
Clarification:
I need to compare more than one column's value (word) in another column's value (sentence) and find the unmatched words which could be in any position of the whole string.
This is the data you provided:
import pandas as pd
full_name = ['John Mathew Davidson', 'Paul Theodre Luther', 'Victor George Mary']
name_1 = ['John', 'Paul', 'George']
name_2 = ['Davidson', 'Theodre', 'Mary']
df = pd.DataFrame({'Full_Name':full_name, 'Name1':name_1, 'Name2':name_2 })
In order to perform an action on multiple columns in a row, best thing is to define the function separately. It makes the code more readable and easier to debug
The function will take a DataFrame row as an input:
def find_missing_name(row):
known_names = [row['Name1'], row['Name2']] ## we add known names to a list to check it later
full_name_list = row['Full_Name'].split(' ') ## converting the full name to the list by splitting it on spaces
## WARNING! this function works well only if you are sure your 'Full_Name' column items are separated by a space.
missing_name = [x for x in full_name_list if x not in known_names] ## looping throught the full name list and comparing it to the known_names list, to only keep the missing ones.
missing_name = ','.join(missing_name) ## in case there are more than one missing names convert them all in a string separated by comma
return missing_name
Now apply the function to the existing DataFrame:
df['missing_name'] = df.apply(find_missing_name, axis=1) ## axis=1 means 'apply to each row', where axis=0 means 'apply to each column'
Output:
Hope this helps :)
You can do so in one line with:
df['Remaining_name'] = df.apply(lambda x: [i for i in x['Full_Name'].split() if all(i not in x[c] for c in df.columns[1:])], axis=1)
This will return your Remaining_name column as a list, but this functionality will be helpful in the case that you have names with more than three sub-strings, for example:
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson [Mathew]
1 Paul Theodre Luther Paul Theodre [Luther]
2 Victor George Mary George Mary [Victor]
3 Henry Patrick John Harrison Patrick Henry [John, Harrison]
Vectorized solution using replace,
df['Remaining_name'] = df.apply(lambda x: x['Full_Name'].replace(x['Name1'], '').replace(x['Name2'], ''), axis=1).str.strip()
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson Mathew
1 Paul Theodre Luther Paul Theodre Luther
2 Victor George Mary George Mary Victor
Edit: If you have many columns starting with Name, you can select a slice a replace the values in Full_Name based on regex pattern
df['tmp'] = df[df.columns[df.columns.str.startswith('Name')]].apply('|'.join, axis = 1)
df['Remaining_name'] = df.apply(lambda x: x.replace(x['tmp'], '', regex = True), axis = 1)['Full_Name'].str.strip()
df.drop('tmp', axis =1, inplace = True)
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson Mathew
1 Paul Theodre Luther Paul Theodre Luther
2 Victor George Mary George Mary Victor
3 Henry Patrick John Harrison Henry John Patrick Harrison
I have three tsv files with names; file1.tsv, file2.tsv anf file3.tsv
The three tsv files have the following column names;
ID
Comment
Now I want to create a tsv file, where each ID gets a concatenated 'comment' string by checking the three files.
For example;
file1.tsv
ID Comment
Anne Smith Comment 1 of Anne smith
Peter Smith Comment 1 of peter smith
file2.tsv
ID Comment
John Cena Comment 2 of john cena
Peter Smith Comment 2 of peter smith
file3.tsv
ID Comment
John Cena Comment 3 of john cena
Peter Smith Comment 3 of peter smith
The results file should be;
results.tsv
ID Comment
Anne Smith Comment 1 of Anne smith
John Cena Comment 2 of john cena. Comment 3 of john cena.
Peter Smith Comment 1 of peter smith. Comment 2 of peter smith. Comment 3 of peter smith
I am new to panda. Just wondering if we can use Pandas or any other suitable library to perform concatenation rather than writing from scratch.
Assuming you read your tsv into df1, df2, df3
df=pd.concat([df1,df2,df2]).groupby('ID').Comment.apply('. '.join)
You can just use Pandas' read_csv function, but with the sep argument set to \t.
If you use this on all three TSV files, you should end up with three dataframes. You can then use the merge function to combine them how you wish.
to further expand on Wen's answer, the last loop is not very panda-ic, but it works...
file1 = '''ID\tComment
Anne Smith\tComment 1 of Anne smith
Peter Smith\tComment 1 of peter smith
'''
file2 = '''ID\tComment
John Cena\tComment 2 of john cena
Peter Smith\tComment 2 of peter smith
'''
file3 = '''ID\tComment
John Cena\tComment 3 of john cena
Peter Smith\tComment 3 of peter smith
'''
flist=[]
for r in [file1,file2,file3]:
fname=r+'.tsv'
with open(fname,'w') as f:
f.write(r)
flist.append(fname)
import pandas as pd
dflist=[]
for fname in flist:
df=pd.read_csv(fname,delimiter='\t')
dflist.append(df)
grouped=pd.concat(dflist).groupby('ID')
data=[]
for row in grouped:
data.append({'ID':row[0],'Comments':'. '.join(row[1].Comment)})
pd.DataFrame(data,columns=['ID','Comments']).to_csv('concat.tsv',sep='\t',index=False)
I have a Python Pandas DataFrame like this:
Name
Jim, Mr. Jones
Sara, Miss. Baker
Leila, Mrs. Jacob
Ramu, Master. Kuttan
I would like to extract only name title from Name column and copy it into a new column named Title. Output DataFrame looks like this:
Name Title
Jim, Mr. Jones Mr
Sara, Miss. Baker Miss
Leila, Mrs. Jacob Mrs
Ramu, Master. Kuttan Master
I am trying to find a solution with regex but failed to find a proper result.
In [157]: df['Title'] = df.Name.str.extract(r',\s*([^\.]*)\s*\.', expand=False)
In [158]: df
Out[158]:
Name Title
0 Jim, Mr. Jones Mr
1 Sara, Miss. Baker Miss
2 Leila, Mrs. Jacob Mrs
3 Ramu, Master. Kuttan Master
or
In [163]: df['Title'] = df.Name.str.split(r'\s*,\s*|\s*\.\s*').str[1]
In [164]: df
Out[164]:
Name Title
0 Jim, Mr. Jones Mr
1 Sara, Miss. Baker Miss
2 Leila, Mrs. Jacob Mrs
3 Ramu, Master. Kuttan Master
Have a look at str.extract.
The regexp you are looking for is (?<=, )\w+(?=.). In words: take the substring that is preceded by , (but do not include), consists of at least one word character, and ends with a . (but do not include). In future, use an online regexp tester such as regex101; regexps become rather trivial that way.
This is assuming each entry in the Name column is formatted the same way.