Clear column based on a value appearing in another column - python

I feel like this should be really simple but I have been stuck on this all day.
I have a dataframe that looks like:
Name Date Graduated Really?
Bob 2014 Yes
Bob 2020 Yes
Sally 1995 Yes
Sally 1999
Sally 1999 No
Robert 2005 Yes
Robert 2005 Yes
I am grouping by Name and Date. In each group, if "Yes" appears in Graduated, then clear the Really? column. And if "Yes" doesn't appear in the group, then leave as is. The output should look like:
Name Date Graduated Really?
Bob 2014 Yes
Bob 2020 Yes
Sally 1995 Yes
Sally 1999
Sally 1999 No
Robert 2005
Robert 2005 Yes
I keep trying different variations of mask = df.groupby(['Name','Date'])['Graduated'].isin('Yes') before doing df.loc[mask, "Really?"] = Nonebut receive AttributeError (I assume my syntax is incorrect).
Edited expected output.

Try this:
s = df['Graduated'].eq('Yes').groupby([df['Name'],df['Date']]).transform('any')
df['Really?'] = df['Really?'].mask(s,'')

Related

Add column to DataFrame and assign number to each row

I have the following table
Father
Son
Year
James
Harry
1999
James
Alfi
2001
Corey
Kyle
2003
I would like to add a fourth column that makes the table look like below. It's supposed to show which child of each father was born first, second, third, and so on. How can I do that?
Father
Son
Year
Child
James
Harry
1999
1
James
Alfi
2001
2
Corey
Kyle
2003
1
here is one way to do it. using cumcount
# groupby Father and take a cumcount, offsetted by 1
df['Child']=df.groupby(['Father'])['Son'].cumcount()+1
df
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
it assumes that DF is sorted by Father and Year. if not, then
df['Child']=df.sort_values(['Father','Year']).groupby(['Father'] )['Son'].cumcount()+1
df
Here is an idea of solving this using groupby and cumsum functions.
This assumes that the rows are ordered so that the younger sibling is always below their elder brother and all children of the same father are in a continuous pack of rows.
Assume we have the following setup
import pandas as pd
df = pd.DataFrame({'Father': ['James', 'James', 'Corey'],
'Son': ['Harry', 'Alfi', 'Kyle'],
'Year': [1999, 2001, 2003]})
then here is the trick we group the siblings with the same father into a groupby object and then compute the cumulative sum of ones to assign a sequential number to each row.
df['temp_column'] = 1
df['Child'] = df.groupby('Father')['temp_column'].cumsum()
df.drop(columns='temp_column')
The result would look like this
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
Now to make the solution more general consider reordering the rows to satisfy the preconditions before applying the solution and then if necessary restore the dataframe to the original order.

Change Values in Pandas Dataframe

I have two dataframes. I want to change some of the values.
I know how to change it on a one by one basis, using isin and where statement but I don't know how to change a large list of changes.
df1
Name Type
David Staff
Jones Pilot
Jack Pilot
Susan Steward
John Staff
Leroy Staff
Steve Staff
df2
Name Type
David Captain
Leroy Pilot
Steve Pilot
How do I change the "type" column on df1 by using df2?
df_desired
Name Type
David Captain
Jones Pilot
Jack Pilot
Susan Steward
John Staff
Leroy Pilot
Steve Pilot
You can try map Type column of df2 to df1 then update
df1['Type'].update(df1['Name'].map(df2.set_index('Name')['Type']))
print(df1)
Name Type
0 David Captain
1 Jones Pilot
2 Jack Pilot
3 Susan Steward
4 John Staff
5 Leroy Pilot
6 Steve Pilot

Removing values from column within groups based on conditions

I am really struggling with this even though I feel like it should be extremely easy.
I have a dataframe that looks like this:
Title
Release Date
Released
In Stores
Seinfeld
1995
Seinfeld
1999
Yes
Seinfeld
1999
Yes
Friends
2000
Yes
Friends
2004
Yes
Friends
2004
I am first grouping by Title, and then Release Date and then observing the values of Released and In Stores. If both Released and In Stores have a value of "Yes" in the same Release Date year, then remove the In Stores value.
So in the above dataframe, the category Seinfeld --> 1999 would have the "Yes" removed from In Stores, but the "Yes" would stay in the In Stores category for "2004" since it is the only "Yes" in the Friends --> 2004 category.
I am starting by using
df.groupby(['Title', 'Release Date'])['Released', 'In Stores].count()
But I cannot figure out the syntax of removing values from In_Stores.
Desired output:
Title
Release Date
Released
In Stores
Seinfeld
1995
Seinfeld
1999
Yes
Seinfeld
1999
Friends
2000
Yes
Friends
2004
Yes
Friends
2004
EDIT: I have tried this line given in the top comment:
flag = (df.groupby(['Title', 'Release Date']).transform(lambda x: (x == 'Yes').any()) .all(axis=1))
but the kernel runs indefinitely.
You can use groupby.transform to flag rows where In Stores needs to be removed, based on whether the row's ['Title', 'Release Date'] group has at least one value of 'Yes' in column Released, and also in column In Stores.
flag = (df.groupby(['Title', 'Release Date'])
.transform(lambda x: (x == 'Yes').any())
.all(axis=1))
print(flag)
0 False
1 True
2 True
3 False
4 False
5 False
dtype: bool
df.loc[flag, 'In Stores'] = np.nan
Result:
Title
Release Date
Released
In Stores
Seinfeld
1995
nan
nan
Seinfeld
1999
Yes
nan
Seinfeld
1999
nan
nan
Friends
2000
Yes
nan
Friends
2004
nan
Yes
Friends
2004
nan
nan

Applying values to a column based on value comparison in other columns across different rows in Pandas

I have already searched the Internet for my problem but nothing quite the same. I am quite new in Pandas.
I have a huge dataframe, around 800K of rows. Out of 800K of rows, 200K of them are duplicates that indicate an owner who owns multiple cars under the same SSN (may have a different name due to spelling and such). For example, below is my dataframe.
SSN is the key in determining they are the same person albeit the name might be different (or slightly different) :
SSN_ID Name Registration_Number Brand Car Year Eligible Status Channel
00001 Baron Zemo SKV2017 Toyota 86 2020 1 2 Call
00001 Baron Zimo SKV1999 Subaru BRZ 2012 1 0 Call
00002 Steve Rogers SHD2012 Cadillac deVille 1970 1 0 Call
00003 Bucky Barnes MTL9841 Ford Boss 429 1970 1 0 Call
00004 Tony Stark IRN0007 Audi R8 2013 1 1 Apps
00005 Wanda Maximoff SCR1080 Hyundai i-30N 2020 1 1 Apps
00004 Tony Stank ILY3000 Audi e-Tron GT 2020 1 0 Call
00001 Beron Zemo SKV0800 Audi TT-RS 2018 1 1 Apps
The column 'Channel' is the channel where advertisement for insurance promotion will be done, and column 'Status' is the status of customer engagement.
'Status' = 0, No call attempted
'Status' = 1, Answered, rejected/accepted the offer
'Status' = 2, Unanswered, line busy/not pick-up
In before, the call and promotion is done based on each car, thus prompting the situation where an owner is called multiple times, once for each cars. For example above, Baron Zemo will be called 3 times at separate time/day, for each of his cars, since he owned 3 cars. But now, the management want to make sure each owner is called only once despite having multiple cars.
I want to update the column 'Channel' in the dataframe based on the 'Status' column value. The logic is supposed to be simply like this :
If 'Status' == 0 or 2, df[Channel] = 'Call'
If 'Status' == 1, df[Channel] = 'Apps'
But the thing is, owner with multiple cars, have multiple 'Status' across the rows. Take Zemo (SSN_ID : 00001) and Stark (SSN_ID : 00004) for example. They have multiple value for column 'Status' because they own multiple cars. Thus, I need to update the 'Channel' column based on 'Status' value on other rows as well.
Using .loc, I can split the dataframe into 2, 1 for owner with multiple car, and 1 with owner with 1 car.
df1= df.loc[df.duplicated(subset=['SSN_ID'], keep=False)].sort_values(by='SSN_ID', ascending=True)
df2= df.loc[~(df.duplicated(subset=['SSN_ID'], keep=False))]
df1 is like below :
SSN_ID Name Registration_Number Brand Car Year Eligible Status Channel
00001 Baron Zemo SKV2017 Toyota 86 2020 1 2 Call
00001 Baron Zimo SKV1999 Subaru BRZ 2012 1 0 Call
00001 Beron Zemo SKV0800 Audi TT-RS 2018 1 1 Apps
00004 Tony Stark IRN0007 Audi R8 2013 1 2 Apps
00004 Tony Stank ILY3000 Audi e-Tron GT 2020 1 0 Call
Eventho Zemo has 3 statuses (2,0,1), but since we have called Zemo on his Audi TT-RS ('Status'== 1) and he already rejected the offer, we should not bother to call him anymore (eventho he has 2 other cars), thus, column 'Channel' will be assigned to 'Apps'.
As for Stark, he has 2 statuses (2,0), since he didn not answer the call ('Status' == 2) , we would continue to try to call him until he answered and either reject or accept the offer, thus, column 'Channel' will be assigned to 'Call'.
However, I do not know how to apply those logic from above.
The final desired result for df1 is like below :
SSN_ID Name Registration_Number Brand Car Year Eligible Status Channel
00001 Baron Zemo SKV2017 Toyota 86 2020 1 2 Apps
00001 Baron Zimo SKV1999 Subaru BRZ 2012 1 0 Apps
00001 Beron Zemo SKV0800 Audi TT-RS 2018 1 1 Apps
00004 Tony Stark IRN0007 Audi R8 2013 1 2 Call
00004 Tony Stank ILY3000 Audi e-Tron GT 2020 1 0 Call
Is there a way to do comparison across rows, and only update the value of the 'Channel' column correctly, without changing the dataframe shape (since it is still needed for something else) ?
Thank you so much.
Disclaimer : I know if the focus of the dataframe based on the SSN_ID instead of Car/Registration number, it will be easier, but this is for data manipulation practices.
Hopefully this will help you get started. This should give you the channel column you are looking for.
d = {0:'Call',
1:'Apps'}
df['Channel'] = df['Status'].eq(1).groupby(df['SSN_ID']).transform('any').astype(int).map(d)

Duplicates using python, if any create a new column when there's a match

I'm currently trying to conduct an analysis where people might be doing something to avoid the system. So, I created a new field inside my dataFrame where I appended the Issue Date and the Name of the potential offender. What I want is: if any of the rows have the same Audit ID, say yes, if not, NaN.
So for example, I have:
Offender Name Issue Date Audit ID
Joe 12/02/2020 Joe-12/02/20
Nic 20/02/2020 Nic-20/02/20
Mat 01/02/2020 Mat-01/02/20
Joe 12/02/2020 Joe-12/02/20
And I want something like:
Offender Name Issue Date Audit ID Matches
Joe 12/02/2020 Joe-12/02/20 Yes
Nic 20/02/2020 Nic-20/02/20 No
Mat 01/02/2020 Mat-01/02/20 No
Joe 12/02/2020 Joe-12/02/20 Yes
I'd appreciate any insights anyone can give me
You can mark duplicates with 'Yes' and 'No'
df['Matches'] = df.duplicated('Audit ID', keep=False).map({True: 'Yes',False: 'No'})
df
Out:
Offender Name Issue Date Audit ID Matches
0 Joe 12/02/2020 Joe-12/02/20 Yes
1 Nic 20/02/2020 Nic-20/02/20 No
2 Mat 01/02/2020 Mat-01/02/20 No
3 Joe 12/02/2020 Joe-12/02/20 Yes
The column Audit ID is redundant. You have the same informations in your dataframe already
df['Matches'] = df.duplicated(['Offender Name','Issue Date'], keep=False).map({True: 'Yes',False: 'No'})
df
Out:
Offender Name Issue Date Audit ID Matches
0 Joe 12/02/2020 Joe-12/02/20 Yes
1 Nic 20/02/2020 Nic-20/02/20 No
2 Mat 01/02/2020 Mat-01/02/20 No
3 Joe 12/02/2020 Joe-12/02/20 Yes

Categories