I am attempting to drop duplicates where the value of a specific column of the duplicated row is zero.
Name Division Clients
0 Dave Sales 0
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
5 Dan HR 0
The output I'm hoping to achieve is seen below
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
Any assistance anyone could provide would be greatly appreciated.
You can do a check if Clients == 0 and find all duplicates based on Name and Division, then do an & and inverse, then boolean mask:
c = df['Clients'].eq(0)
df[~(df.duplicated(['Name','Division'],keep=False) & c)]
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
Thanks to Seabean , consider the following df:
df1 = df.append(pd.DataFrame([['Dave','HR',0]],columns=df.columns),ignore_index=True)
print(df1)
Name Division Clients
0 Dave Sales 0
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
5 Dan HR 0
6 Dave HR 0
c = df1['Clients'].eq(0)
print(df1[~(df1.duplicated(['Name','Division'],keep=False) & c)])
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
6 Dave HR 0
It depends on how your data is organized... If you're reading in from a csv you could do something like this:
#Get the Data:
data = pd.read_csv("employees.csv")
#Sort by Clients so the zeros are dropped instead of the Clients:
data.sort_values("Clients", inplace = True)
#Drop any duplicates based on name:
data.drop_duplicates(subset ="Name",
keep = False, inplace = True)
Related
I have a pandas dataframe with information about students and test dates. I would like to create a variable that takes on a new value for each student, but also takes on a new value for the same student if 5 years have passed without a test attempt. The desired column is "group" below. How can I do this in python?
Student test_date group
Bob 1995 1
Bob 1997 1
Bob 2020 2
Bob 2020 2
Mary 2020 3
Mary 2021 3
Mary 2021 3
The initial, very clunky idea I had was to sort by name, sort by date, calculate the difference in date, have an ind if diff > 5, and then somehow number by groups.
ds = pd.read_excel('../students.xlsx')
ds = ds.sort_values(by=['student','test_date'])
ds['time'] = ds['test_date'].diff()
ds['break'] = 0
ds.loc[(ds['time'] > 5),'break'] = 1
Student test_date time break
Bob 1995 na na
Bob 1997 2 0
Bob 2020 23 1
Bob 2020 0 0
Mary 2020 na na
Mary 2021 1 0
Mary 2021 0 0
df = df.sort_values(["Student", "test_date"])
((df.Student != df.Student.shift()) | (df.test_date.diff().gt(5))).cumsum()
# 0 1
# 1 1
# 2 2
# 3 2
# 4 3
# 5 3
# 6 3
# dtype: int32
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed last year.
I have a dataframe df:
Name
Place
Price
Bob
NY
15
Jack
London
27
John
Paris
5
Bill
Sydney
3
Bob
NY
39
Jack
London
9
Bob
NY
2
Dave
NY
7
I need to assign an incremental value (from 1 to N) for each row which has the same name and place (price can be different).
df_out:
Name
Place
Price
Value
Bob
NY
15
1
Jack
London
27
1
John
Paris
5
1
Bill
Sydney
3
1
Bob
NY
39
2
Jack
London
9
2
Bob
NY
2
3
Dave
NY
7
1
I could do this by sorting the dataframe (on Name and Place) and then iteratively checking if they match between two consecutive rows. Is there a smarter/faster pandas way to do this?
You can use a grouped (on Name, Place) cumulative count and add 1 as it starts from 0:
df['Value'] = df.groupby(['Name','Place']).cumcount().add(1)
prints:
Name Place Price Value
0 Bob NY 15 1
1 Jack London 27 1
2 John Paris 5 1
3 Bill Sydney 3 1
4 Bob NY 39 2
5 Jack London 9 2
6 Bob NY 2 3
7 Dave NY 7 1
Suppose I have two dataframes:
df1:
Person Number Type
0 Kyle 12 Male
1 Jacob 15 Male
2 Jacob 15 Male
df2:
A much larger dataset with similar format except there is a count column that needs to increment based on df1
Person Number Type Count
0 Kyle 12 Male 0
1 Jacob 15 Male 0
3 Sally 43 Female 0
4 Mary 15 Female 5
What I am looking to do is increase the count column based on the number of occurrences of the same person in df1
Excepted output for this example:
Person Number Type Count
0 Kyle 12 Male 1
1 Jacob 15 Male 2
3 Sally 43 Female 0
4 Mary 15 Female 5
Increase count to 1 for Kyle because there is one instance, increase count to 2 because there are two instances for Jacob. Don't change value for Sally and Mary and keep the value the same.
How do I do this? I have tried using .loc but I can't figure out how to account for two instances of the same row. Meaning that I can only get count to increase by one for Jacob even though there are two Jacobs in df1.
I have tried
df2.loc[df2['Person'].values == df1['Person'].values, 'Count'] += 1
However this does not account for duplicates.
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df1 = df1.set_index(['Person','Number','Type'])
df2 = df2.set_index(['Person','Number','Type'])
df1.add(df2, fill_value=0).reset_index()
Or
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df2.merge(df1, on=['Person','Number','Type'], how='left').set_index(['Person','Number','Type']).sum(axis=1).to_frame('Count').reset_index()
value_counts + Index alignment.
u = df2.set_index("Person")
u.assign(Count=df1["Person"].value_counts().add(u["Count"], fill_value=0))
Number Type Count
Person
Kyle 12 Male 1.0
Jacob 15 Male 2.0
Sally 43 Female 0.0
Mary 15 Female 5.0
Consider the following df
data = {'Name' : ['John','John','Lucy','Lucy','Lucy'],
'Payroll' : [15,15,75,75,75],
'Week' : [1,2,1,2,3]}
df = pd.DataFrame(data)
Name Payroll Week
0 John 15 1
1 John 15 2
2 Lucy 75 1
3 Lucy 75 2
4 Lucy 75 3
What I'm attempting to do is true apply a Boolean throughout a DataFrame very similar to this one with 2m+ rows and 20+ columns to find out when someone started
To find if someone is active or not I pass a condition to another df:
df2 = df.loc[df.Week == df.Week.max()]
This gives me the final week i then use an isin filter to find out if the person is active or has left
df['Status'] = np.where(df['Payroll'].isin(df2['Payroll']), 'Active','Leaver')
So using the above code I get the following which is great, which tells me that since John is not in the latest week he has left the company
Name Payroll Week Status
0 John 15 1 Leaver
1 John 15 2 Leaver
2 Lucy 75 1 Active
3 Lucy 75 2 Active
4 Lucy 75 3 Active
What I'm trying to achieve is to know when John started with us, I could try a mask for each week of the year and an isin to check for when they first appeared but I figured there must be a more pythonic way do this!
Desired output:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Leaver
2 Lucy 75 1 Starter
3 Lucy 75 2 Active
4 Lucy 75 3 Active
Any help is much appreciated.
Edit for Clarity :
data = {'Name' : ['John','John','John','John','Lucy','Lucy','Lucy','Lucy','Lucy'],
'Payroll' : [15,15,15,15,75,75,75,75,75],
'Week' : [1,2,3,4,1,2,3,4,5]}
df = pd.DataFrame(data)
desired output:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Active
2 John 15 3 Active
3 John 15 4 Leaver
4 Lucy 75 1 Starter
5 Lucy 75 2 Active
6 Lucy 75 3 Active
7 Lucy 75 4 Active
8 Lucy 75 5 Active
things to note:
Max week is 5 so anyone not in week 5 is a leaver
first week of person in df makes them a starter.
all weeks in between are set to Active.
Use numpy.select with new condition by duplicated:
a = df.loc[df.Week == df.Week.max(), 'Payroll']
m1 = ~df['Payroll'].isin(a)
m2 = ~df['Payroll'].duplicated()
m3 = ~df['Payroll'].duplicated(keep='last')
df['Status'] = np.select([m2, m1 & m3], ['Starter', 'Leaver'], 'Active')
print (df)
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Active
2 John 15 3 Active
3 John 15 4 Leaver
4 Lucy 75 1 Starter
5 Lucy 75 2 Active
6 Lucy 75 3 Active
7 Lucy 75 4 Active
8 Lucy 75 5 Active
The simplest way that I have come across is using groupby and finding minimal index for the name in the group:
for _, dfg in df.groupby(df['Name']):
gidx = min(dfg.index)
df.loc[df.index == gidx,'Status'] = 'Starter'
print(df)
And the df is then:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Leaver
2 Lucy 75 1 Starter
3 Lucy 75 2 Active
4 Lucy 75 3 Active
I have my data in a pandas dataframe
out[1]:
NAME STORE AMOUNT
0 GARY GAP 20
1 GARY GAP 10
2 GARY KROGER 15
3 ASHLEY FOREVER21 30
4 ASHLEY KROGER 10
5 MARK GAP 10
6 ROGER KROGER 30
I'm trying to get grouping by name, sum their total amount spent, while also generating columns for each unique store in the dataframe.
Desired:
out[1]:
NAME GAP KROGER FOREVER21
0 GARY 30 15 0
1 ASHLEY 0 10 30
2 MARK 10 0 0
3 ROGER 0 30 0
Thanks for your help!
You need pivot_table:
df1 = df.pivot_table(index='NAME',
columns='STORE',
values='AMOUNT',
aggfunc='sum',
fill_value=0)
print (df1)
STORE FOREVER21 GAP KROGER
NAME
ASHLEY 30 0 10
GARY 0 30 15
MARK 0 10 0
ROGER 0 0 30
Alternative solution with aggregating by groupby and sum:
df1 = df.groupby(['NAME','STORE'])['AMOUNT'].sum().unstack(fill_value=0)
print (df1)
STORE FOREVER21 GAP KROGER
NAME
ASHLEY 30 0 10
GARY 0 30 15
MARK 0 10 0
ROGER 0 0 30
Last if need column from index values and remove column and index names:
print (df1.reset_index().rename_axis(None, axis=1).rename_axis(None))
NAME FOREVER21 GAP KROGER
0 ASHLEY 30 0 10
1 GARY 0 30 15
2 MARK 0 10 0
3 ROGER 0 0 30