Moving Names in to First Name the Last Name - python

I have a imported a csv dataset into python that is being cleaned up, there is no consistency with names some being "John Doe" and others being "Doe, John". I need them to be "First name Last name" without the comma:
Doe, John
Smith, John
Snow, John
John Cena
Steve Smith
When I want:
John Doe
John Smith
John Snow
John Cena
Steve Smith
I tried doing:
if ',' in df['names']:
df['names'] = ' '. join(df['names'].split(',')[::-1]).strip()
I get
AttributeError: 'Series' object has no attribute 'split'
I have tried making name into a list by doing prior to the code above but that didn't work:
df['name'] = df['name'].to_list()

You can use str.replace and use capture groups to swap values:
df['names'] = df['names'].str.replace(r'([^,]+),\s*(.+)', r'\2 \1', regex=True)
print(df)
# Output
names
0 John Doe
1 John Smith
2 John Snow
3 John Cena
4 Steve Smith
Note: you have to use str accessor in your code (but does not solve the next problem):
# Replace
df['names'].split(',')
# With
df['names'].str.split(',')

You can use a lambda function to process each name
df['names'] = df['names'].apply(
lambda x: (x.split(',')[1] + ' ' + x.split(',')[0]).strip()
if ',' in x else x
)
Using split(',') you are splitting the name into two strings, and accessing them with the index [1] part. Then you concatenate [1] with [0] and finally remove leading and trailing whitespaces using strip(). All of this happens if x (remember x is every singular name) contains a comma, if not then we leave x as is.

You can try this:
df['Name'].str.split(',').str[::-1].str.join(' ').str.strip()
Output:
0 John Doe
1 John Smith
2 John Snow
3 John Cena
4 Steve Smith
Name: Name, dtype: object
Split on comma, reverse element order, join elements with a space and trailing leading spaces incase there was no commna found.

Related

Conditionally update dataframe column if character exists in another column

I have a dataframe which consists of two columns, full name and last name. Sometimes, the last name column is not filled properly. In such cases, the last name would be found as the last word in the full name column between parenthesis. I would like to update my last name column for those cases where parenthesis are found to be equal to the word between parenthesis.
Code
import pandas as pd
df = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', '-', 'mac', '-']
})
result_to_be = pd.DataFrame({
'full':['bob john smith','sam alan (james)','zack joe mac', 'alan (gracie) jacob (arnold)'],
'last': ['ross', 'james', 'mac', 'arnold']
})
print(df)
print(result_to_be)
I have tried to implement the contains function to be used as a mask but it seems to be messing the check regex when checking if it contains ')' or '(' characters
df['full'].str.contains(')')
The error it shows is
re.error: unbalanced parenthesis at position 0
You can use .str.findall to get the value between the parentheses and df.loc to assign that where last is -:
df.loc[df['last'] == '-', 'last'] = df['full'].str.findall('\((.+?)\)').str[-1]
Output:
>>> df
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold
For a slightly different syntax, you could also use extract
df.loc[df['last'] == '-', 'last'] = df['full'].str.extract('.*\((.*)\)', expand=False)
Output:
full last
0 bob john smith ross
1 sam alan (james) james
2 zack joe mac mac
3 alan (gracie) jacob (arnold) arnold

Extract last word in DataFrame column

This has to be so simple - but I can't figure it out. I have a "name" column within a DataFrame and I'm trying to reverse the order of ['First Name', 'Middle Name', 'Last Name'] to ['Last Name', 'First Name', 'Middle Name'].
Here is my code:
for i in range(2114):
bb = a['Approved by User'][i].split(" ",2)[2]
aa = a['Approved by User'][i].split(" ",2)[0]
a['Full Name]'] = bb+','+aa
Unfortunately I keep getting IndexError: list index out of range with the current code.
This is what I want:
Old column Name| Jessica Mary Simpson
New column Name| Simpson Jessica Mary
One way to do it is to split the string and joinit later on in a function.
like so:
import pandas as pd
d = {"name": ["Jessica Mary Simpson"]}
df = pd.DataFrame(d)
a = df.name.str.split()
a = a.apply(lambda x: " ".join(x[::-1])).reset_index()
print(a)
output:
index name
0 0 Simpson Mary Jessica
With your shown samples, you could try following.
Let's say following is the df:
fullname
0 Jessica Mary Simpson
1 Ravinder avtar singh
2 John jonny janardan
Here is the code:
df['fullname'].replace(r'^([^ ]*) ([^ ]*) (.*)$', r'\3 \1 \2',regex=True)
OR
df['fullname'].replace(r'^(\S*) (\S*) (.*)$', r'\3 \1 \2',regex=True)
output will be as follows:
0 Simpson Jessica Mary
1 singh Ravinder avtar
2 janardan John jonny
I think problem is in your data, here is your solution in pandas text functions Series.str.split, indexing and Series.str.join:
df['Full Name'] = df['Approved by User'].str.split(n=2).str[::-1].str.join(' ')
print (df)
Approved by User Full Name
0 Jessica Mary Simpson Simpson Mary Jessica
1 John Doe Doe John
2 Mary Mary

Get string instead of list in Pandas DataFrame

I have a column Name of string data type. I want to get all the values except the last one and put it in a new column FName, which I could achieve
df = pd.DataFrame({'Name': ['John A Sether', 'Robert D Junior', 'Jonny S Rab'],
'Age':[32, 34, 36]})
df['FName'] = df['Name'].str.split(' ').str[0:-1]
Name Age FName
0 John A Sether 32 [John, A]
1 Robert D Junior 34 [Robert, D]
2 Jonny S Rab 36 [Jonny, S]
But the new column FName looks like a list, which I don't want. I want it to be like: John A.
I tried convert the list to string, but it does not seems to be right.
Any suggestion ?
You can use .str.rsplit:
df['FName'] = df['Name'].str.rsplit(n=1).str[0]
Or you can use .str.extract:
df['FName'] = df['Name'].str.extract(r'(\S+\s?\S*)', expand=False)
Or, you can chain .str.join after .str.split:
df['FName'] = df['Name'].str.split().str[:-1].str.join(' ')
Name Age FName
0 John A Sether 32 John A
1 Robert D Junior 34 Robert D
2 Jonny S Rab 36 Jonny S

Removing last words in each row in pandas dataframe

A dataframe contains a column named 'full_name' and the rows look like this:
full_name
Peter Eli Smith
Vanessa Mary Ellen
Raul Gonzales
Kristine S Lee
How do I remove the last words and add an additional column called 'first_middle_name' which will result like this?:
full_name first_middle_name
Peter Eli Smith Peter Eli
Vanessa Mary Ellen Vanessa Mary
Raul Gonzales Raul
Kristine S Lee Kristine S
Thank you
We can try using str.replace here:
df["first_middle_name"] = df["full_name"].replace("\s+\S+$", "")
See the above regex replacement working in the demo link below.
Demo
Use str
df["first_middle_name"] = df["full_name"].str.replace("\s+\S+$", "")

How to concatenate rows with similar IDs in multiple TSV files in python?

I have three tsv files with names; file1.tsv, file2.tsv anf file3.tsv
The three tsv files have the following column names;
ID
Comment
Now I want to create a tsv file, where each ID gets a concatenated 'comment' string by checking the three files.
For example;
file1.tsv
ID Comment
Anne Smith Comment 1 of Anne smith
Peter Smith Comment 1 of peter smith
file2.tsv
ID Comment
John Cena Comment 2 of john cena
Peter Smith Comment 2 of peter smith
file3.tsv
ID Comment
John Cena Comment 3 of john cena
Peter Smith Comment 3 of peter smith
The results file should be;
results.tsv
ID Comment
Anne Smith Comment 1 of Anne smith
John Cena Comment 2 of john cena. Comment 3 of john cena.
Peter Smith Comment 1 of peter smith. Comment 2 of peter smith. Comment 3 of peter smith
I am new to panda. Just wondering if we can use Pandas or any other suitable library to perform concatenation rather than writing from scratch.
Assuming you read your tsv into df1, df2, df3
df=pd.concat([df1,df2,df2]).groupby('ID').Comment.apply('. '.join)
You can just use Pandas' read_csv function, but with the sep argument set to \t.
If you use this on all three TSV files, you should end up with three dataframes. You can then use the merge function to combine them how you wish.
to further expand on Wen's answer, the last loop is not very panda-ic, but it works...
file1 = '''ID\tComment
Anne Smith\tComment 1 of Anne smith
Peter Smith\tComment 1 of peter smith
'''
file2 = '''ID\tComment
John Cena\tComment 2 of john cena
Peter Smith\tComment 2 of peter smith
'''
file3 = '''ID\tComment
John Cena\tComment 3 of john cena
Peter Smith\tComment 3 of peter smith
'''
flist=[]
for r in [file1,file2,file3]:
fname=r+'.tsv'
with open(fname,'w') as f:
f.write(r)
flist.append(fname)
import pandas as pd
dflist=[]
for fname in flist:
df=pd.read_csv(fname,delimiter='\t')
dflist.append(df)
grouped=pd.concat(dflist).groupby('ID')
data=[]
for row in grouped:
data.append({'ID':row[0],'Comments':'. '.join(row[1].Comment)})
pd.DataFrame(data,columns=['ID','Comments']).to_csv('concat.tsv',sep='\t',index=False)

Categories