Drop duplicate where column value of duplicate row is zero - python

I am attempting to drop duplicates where the value of a specific column of the duplicated row is zero.
Name Division Clients
0 Dave Sales 0
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
5 Dan HR 0
The output I'm hoping to achieve is seen below
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
Any assistance anyone could provide would be greatly appreciated.

You can do a check if Clients == 0 and find all duplicates based on Name and Division, then do an & and inverse, then boolean mask:
c = df['Clients'].eq(0)
df[~(df.duplicated(['Name','Division'],keep=False) & c)]
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
Thanks to Seabean , consider the following df:
df1 = df.append(pd.DataFrame([['Dave','HR',0]],columns=df.columns),ignore_index=True)
print(df1)
Name Division Clients
0 Dave Sales 0
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
5 Dan HR 0
6 Dave HR 0
c = df1['Clients'].eq(0)
print(df1[~(df1.duplicated(['Name','Division'],keep=False) & c)])
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
6 Dave HR 0

It depends on how your data is organized... If you're reading in from a csv you could do something like this:
#Get the Data:
data = pd.read_csv("employees.csv")
#Sort by Clients so the zeros are dropped instead of the Clients:
data.sort_values("Clients", inplace = True)
#Drop any duplicates based on name:
data.drop_duplicates(subset ="Name",
keep = False, inplace = True)

Related

Numbering occurrences within groups in python

I have a pandas dataframe with information about students and test dates. I would like to create a variable that takes on a new value for each student, but also takes on a new value for the same student if 5 years have passed without a test attempt. The desired column is "group" below. How can I do this in python?
Student test_date group
Bob 1995 1
Bob 1997 1
Bob 2020 2
Bob 2020 2
Mary 2020 3
Mary 2021 3
Mary 2021 3
The initial, very clunky idea I had was to sort by name, sort by date, calculate the difference in date, have an ind if diff > 5, and then somehow number by groups.
ds = pd.read_excel('../students.xlsx')
ds = ds.sort_values(by=['student','test_date'])
ds['time'] = ds['test_date'].diff()
ds['break'] = 0
ds.loc[(ds['time'] > 5),'break'] = 1
Student test_date time break
Bob 1995 na na
Bob 1997 2 0
Bob 2020 23 1
Bob 2020 0 0
Mary 2020 na na
Mary 2021 1 0
Mary 2021 0 0
df = df.sort_values(["Student", "test_date"])
((df.Student != df.Student.shift()) | (df.test_date.diff().gt(5))).cumsum()
# 0 1
# 1 1
# 2 2
# 3 2
# 4 3
# 5 3
# 6 3
# dtype: int32

Assign values (1 to N) for similar rows in a dataframe Pandas [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed last year.
I have a dataframe df:
Name
Place
Price
Bob
NY
15
Jack
London
27
John
Paris
5
Bill
Sydney
3
Bob
NY
39
Jack
London
9
Bob
NY
2
Dave
NY
7
I need to assign an incremental value (from 1 to N) for each row which has the same name and place (price can be different).
df_out:
Name
Place
Price
Value
Bob
NY
15
1
Jack
London
27
1
John
Paris
5
1
Bill
Sydney
3
1
Bob
NY
39
2
Jack
London
9
2
Bob
NY
2
3
Dave
NY
7
1
I could do this by sorting the dataframe (on Name and Place) and then iteratively checking if they match between two consecutive rows. Is there a smarter/faster pandas way to do this?
You can use a grouped (on Name, Place) cumulative count and add 1 as it starts from 0:
df['Value'] = df.groupby(['Name','Place']).cumcount().add(1)
prints:
Name Place Price Value
0 Bob NY 15 1
1 Jack London 27 1
2 John Paris 5 1
3 Bill Sydney 3 1
4 Bob NY 39 2
5 Jack London 9 2
6 Bob NY 2 3
7 Dave NY 7 1

Pandas comparing dataframes and changing column value based on number of similar rows in another dataframe

Suppose I have two dataframes:
df1:
Person Number Type
0 Kyle 12 Male
1 Jacob 15 Male
2 Jacob 15 Male
df2:
A much larger dataset with similar format except there is a count column that needs to increment based on df1
Person Number Type Count
0 Kyle 12 Male 0
1 Jacob 15 Male 0
3 Sally 43 Female 0
4 Mary 15 Female 5
What I am looking to do is increase the count column based on the number of occurrences of the same person in df1
Excepted output for this example:
Person Number Type Count
0 Kyle 12 Male 1
1 Jacob 15 Male 2
3 Sally 43 Female 0
4 Mary 15 Female 5
Increase count to 1 for Kyle because there is one instance, increase count to 2 because there are two instances for Jacob. Don't change value for Sally and Mary and keep the value the same.
How do I do this? I have tried using .loc but I can't figure out how to account for two instances of the same row. Meaning that I can only get count to increase by one for Jacob even though there are two Jacobs in df1.
I have tried
df2.loc[df2['Person'].values == df1['Person'].values, 'Count'] += 1
However this does not account for duplicates.
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df1 = df1.set_index(['Person','Number','Type'])
df2 = df2.set_index(['Person','Number','Type'])
df1.add(df2, fill_value=0).reset_index()
Or
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df2.merge(df1, on=['Person','Number','Type'], how='left').set_index(['Person','Number','Type']).sum(axis=1).to_frame('Count').reset_index()
value_counts + Index alignment.
u = df2.set_index("Person")
u.assign(Count=df1["Person"].value_counts().add(u["Count"], fill_value=0))
Number Type Count
Person
Kyle 12 Male 1.0
Jacob 15 Male 2.0
Sally 43 Female 0.0
Mary 15 Female 5.0

pandas to find earliest occurrence of statement and set to starter

Consider the following df
data = {'Name' : ['John','John','Lucy','Lucy','Lucy'],
'Payroll' : [15,15,75,75,75],
'Week' : [1,2,1,2,3]}
df = pd.DataFrame(data)
Name Payroll Week
0 John 15 1
1 John 15 2
2 Lucy 75 1
3 Lucy 75 2
4 Lucy 75 3
What I'm attempting to do is true apply a Boolean throughout a DataFrame very similar to this one with 2m+ rows and 20+ columns to find out when someone started
To find if someone is active or not I pass a condition to another df:
df2 = df.loc[df.Week == df.Week.max()]
This gives me the final week i then use an isin filter to find out if the person is active or has left
df['Status'] = np.where(df['Payroll'].isin(df2['Payroll']), 'Active','Leaver')
So using the above code I get the following which is great, which tells me that since John is not in the latest week he has left the company
Name Payroll Week Status
0 John 15 1 Leaver
1 John 15 2 Leaver
2 Lucy 75 1 Active
3 Lucy 75 2 Active
4 Lucy 75 3 Active
What I'm trying to achieve is to know when John started with us, I could try a mask for each week of the year and an isin to check for when they first appeared but I figured there must be a more pythonic way do this!
Desired output:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Leaver
2 Lucy 75 1 Starter
3 Lucy 75 2 Active
4 Lucy 75 3 Active
Any help is much appreciated.
Edit for Clarity :
data = {'Name' : ['John','John','John','John','Lucy','Lucy','Lucy','Lucy','Lucy'],
'Payroll' : [15,15,15,15,75,75,75,75,75],
'Week' : [1,2,3,4,1,2,3,4,5]}
df = pd.DataFrame(data)
desired output:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Active
2 John 15 3 Active
3 John 15 4 Leaver
4 Lucy 75 1 Starter
5 Lucy 75 2 Active
6 Lucy 75 3 Active
7 Lucy 75 4 Active
8 Lucy 75 5 Active
things to note:
Max week is 5 so anyone not in week 5 is a leaver
first week of person in df makes them a starter.
all weeks in between are set to Active.
Use numpy.select with new condition by duplicated:
a = df.loc[df.Week == df.Week.max(), 'Payroll']
m1 = ~df['Payroll'].isin(a)
m2 = ~df['Payroll'].duplicated()
m3 = ~df['Payroll'].duplicated(keep='last')
df['Status'] = np.select([m2, m1 & m3], ['Starter', 'Leaver'], 'Active')
print (df)
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Active
2 John 15 3 Active
3 John 15 4 Leaver
4 Lucy 75 1 Starter
5 Lucy 75 2 Active
6 Lucy 75 3 Active
7 Lucy 75 4 Active
8 Lucy 75 5 Active
The simplest way that I have come across is using groupby and finding minimal index for the name in the group:
for _, dfg in df.groupby(df['Name']):
gidx = min(dfg.index)
df.loc[df.index == gidx,'Status'] = 'Starter'
print(df)
And the df is then:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Leaver
2 Lucy 75 1 Starter
3 Lucy 75 2 Active
4 Lucy 75 3 Active

Python Grouping Transpose

I have my data in a pandas dataframe
out[1]:
NAME STORE AMOUNT
0 GARY GAP 20
1 GARY GAP 10
2 GARY KROGER 15
3 ASHLEY FOREVER21 30
4 ASHLEY KROGER 10
5 MARK GAP 10
6 ROGER KROGER 30
I'm trying to get grouping by name, sum their total amount spent, while also generating columns for each unique store in the dataframe.
Desired:
out[1]:
NAME GAP KROGER FOREVER21
0 GARY 30 15 0
1 ASHLEY 0 10 30
2 MARK 10 0 0
3 ROGER 0 30 0
Thanks for your help!
You need pivot_table:
df1 = df.pivot_table(index='NAME',
columns='STORE',
values='AMOUNT',
aggfunc='sum',
fill_value=0)
print (df1)
STORE FOREVER21 GAP KROGER
NAME
ASHLEY 30 0 10
GARY 0 30 15
MARK 0 10 0
ROGER 0 0 30
Alternative solution with aggregating by groupby and sum:
df1 = df.groupby(['NAME','STORE'])['AMOUNT'].sum().unstack(fill_value=0)
print (df1)
STORE FOREVER21 GAP KROGER
NAME
ASHLEY 30 0 10
GARY 0 30 15
MARK 0 10 0
ROGER 0 0 30
Last if need column from index values and remove column and index names:
print (df1.reset_index().rename_axis(None, axis=1).rename_axis(None))
NAME FOREVER21 GAP KROGER
0 ASHLEY 30 0 10
1 GARY 0 30 15
2 MARK 0 10 0
3 ROGER 0 0 30

Categories