Python remove text if same of another column - python

I want to drop in my dataframe the text in a column if it starts with the same text that is in another column.
Example of dataframe:
name var1
John Smith John Smith Hello world
Mary Jane Mary Jane Python is cool
James Bond My name is James Bond
Peter Pan Nothing happens here
Dataframe that I want:
name var1
John Smith Hello world
Mary Jane Python is cool
James Bond My name is James Bond
Peter Pan Nothing happens here
Something simple as:
df[~df.var1.str.contains(df.var1)]
does not work. How I should write my python code?

Try using apply lambda;
df["var1"] = df.apply(lambda x: x["var1"][len(x["name"]):].strip() if x["name"] == x["var1"][:len(x["name"])] else x["var1"],axis=1)

How about this?
df['var1'] = [df.loc[i, 'var1'].replace(df.loc[i, 'name'], "") for i in df.index]

Related

Strip colum values if startswith a specific string pandas

I have a pandas dataframe(sample).
id name
1 Mr-Mrs-Jon Snow
2 Mr-Mrs-Jane Smith
3 Mr-Mrs-Darth Vader
I'm looking to strip the "Mr-Mrs-" from the dataframe. i.e the output should be:
id name
1 Jon Snow
2 Jane Smith
3 Darth Vader
I tried using
df['name'] = df['name'].str.lstrip("Mr-Mrs-")
But while doing so, some of the alphabets of names in some rows are also getting stripped out.
I don't want to run a loop and do .loc for every row, is there a better/optimized way to achieve this ?
Don't strip, replace using a start of string anchor (^):
df['name'] = df['name'].str.replace(r"^Mr-Mrs-", "", regex=True)
Or removeprefix:
df['name'] = df['name'].str.removeprefix("Mr-Mrs-")
Output:
id name
1 Jon Snow
2 Jane Smith
3 Darth Vader

How Do I match two Data Frames in Pandas with multiple matches?

I have 2 data frames I want to match some data from one data frame and append it on another.
df1 looks like this:
sourceId
firstName
lastName
1234
John
Doe
5678
Sally
Green
9101
Chlodovech
Anderson
df2 looks like this:
sourceId
agentId
123456789
1234,5678
987654321
9101
143216546
1234,5678
I want my Final Data Frame to look like this:
sourceId
firstName
lastName
agentId
1234
John
Doe
123456789,143216546
5678
Sally
Green
123456789,143216546
9101
Chlodovech
Anderson
987654321
Usually appending stuff is easy but I'm not quite sure how to match this data up, and then append the matches with commas in-between them. I'm fairly new to using pandas so any help is appreciated.
This works. It's long and not the most elegant, but it works well :)
tmp = df2.assign(agentId=df2['agentId'].str.split(',')).explode('agentId').set_index('agentId')['sourceId'].astype(str).groupby(level=0).agg(list).str.join(',').reset_index()
df1['sourceId'] = df1['sourceId'].astype(str)
new_df = df1.merge(tmp, left_on='sourceId', right_on='agentId').drop('agentId',axis=1).rename({'sourceId_x':'sourceId', 'sourceId_y':'agentId'},axis=1)
Output:
>>> new_df
sourceId firstName lastName agentId
0 1234 John Doe 123456789,143216546
1 5678 Sally Green 123456789,143216546
2 9101 Chlodovech Anderson 987654321

match name and surname from two data frames, extract middle name from one data frame and append it to other

I have two almost identical data frames A and B. In reality its a two data frames with 1000+ names each.
I want to match name and surname from both data frames and then extract the middle name from data frame B to data frame A.
data frame A
name surname
John Doe
Tom Sawyer
Huckleberry Finn
data frame B
name middle_name surname
John `O Doe
Tom Philip Sawyer
Lilly Tomas Finn
The result i seek:
name middle name surname
John `O Doe
Tom Philip Sawyer
You can use df.merge with parameter how='inner' and on=['name','surname']. To get the correct order use df.reindex over axis 1.
df = df.merge(df1,how='inner',on=['name','surname'])
df.reindex(['name', 'middle_name', 'surname'])
name middle_name surname
0 John `O Doe
1 Tom Philip Sawyer

How to delete Pandas rows that have been seen before

If I have a table as follows in Pandas Dataframe:
Date Name
15/12/01 John Doe
15/12/01 John Doe
15/12/01 John Doe
15/12/02 Mary Jean
15/12/02 Mary Jean
15/12/02 Mary Jean
I would like to delete all instances of John Doe/Mary Jean (or whatever name may be there) with the same date and only keep the latest one. After the operation it would look like this:
Date Name
15/12/01 John Doe
15/12/02 Mary Jean
Where the third instance of both John Doe and Mary Jean have been kept and the rest have been deleted. How could I do this in an efficient and fast way in Pandas?
Thanks!

Group by pandas data frame unique first values - numpy array returned

From a two string columns pandas data frame looking like:
d = {'SCHOOL' : ['Yale', 'Yale', 'LBS', 'Harvard','UCLA', 'Harvard', 'HEC'],
'NAME' : ['John', 'Marc', 'Alex', 'Will', 'Will','Miller', 'Tom']}
df = pd.DataFrame(d)
Notice the relationship between NAME to SCHOOL is n to 1.
I want to get the last school in case one person has gone to two different schools (see "Will" case).
So far I got:
df = df.groupby('NAME')['SCHOOL'].unique().reset_index()
Return:
NAME SCHOOL
0 Alex [LBS]
1 John [Yale]
2 Marc [Yale]
3 Miller [Harvard]
4 Tom [HEC]
5 Will [Harvard, UCLA]
PROBLEMS:
unique() return both school not only the last school.
This line return SCHOOL column as a np.array instead of string. Very difficult to work further with this df.
Both problems where solved based on #IanS comments.
Using last() instead of unique():
df = df.groupby('NAME')['SCHOOL'].last().reset_index()
Return:
NAME SCHOOL
0 Alex LBS
1 John Yale
2 Marc Yale
3 Miller Harvard
4 Tom HEC
5 Will UCLA
Use drop_duplicates with parameter last and specifying column for check duplicates:
df = df.drop_duplicates('NAME', keep='last')
print (df)
NAME SCHOOL
0 John Yale
1 Marc Yale
2 Alex LBS
4 Will UCLA
5 Miller Harvard
6 Tom HEC
Also if need sorting add sort_values:
df = df.drop_duplicates('NAME', keep='last').sort_values('NAME')
print (df)
NAME SCHOOL
2 Alex LBS
0 John Yale
1 Marc Yale
5 Miller Harvard
6 Tom HEC
4 Will UCLA

Categories