Numbering occurrences within groups in python - python

I have a pandas dataframe with information about students and test dates. I would like to create a variable that takes on a new value for each student, but also takes on a new value for the same student if 5 years have passed without a test attempt. The desired column is "group" below. How can I do this in python?
Student test_date group
Bob 1995 1
Bob 1997 1
Bob 2020 2
Bob 2020 2
Mary 2020 3
Mary 2021 3
Mary 2021 3
The initial, very clunky idea I had was to sort by name, sort by date, calculate the difference in date, have an ind if diff > 5, and then somehow number by groups.
ds = pd.read_excel('../students.xlsx')
ds = ds.sort_values(by=['student','test_date'])
ds['time'] = ds['test_date'].diff()
ds['break'] = 0
ds.loc[(ds['time'] > 5),'break'] = 1
Student test_date time break
Bob 1995 na na
Bob 1997 2 0
Bob 2020 23 1
Bob 2020 0 0
Mary 2020 na na
Mary 2021 1 0
Mary 2021 0 0

df = df.sort_values(["Student", "test_date"])
((df.Student != df.Student.shift()) | (df.test_date.diff().gt(5))).cumsum()
# 0 1
# 1 1
# 2 2
# 3 2
# 4 3
# 5 3
# 6 3
# dtype: int32

Related

Assign values (1 to N) for similar rows in a dataframe Pandas [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed last year.
I have a dataframe df:
Name
Place
Price
Bob
NY
15
Jack
London
27
John
Paris
5
Bill
Sydney
3
Bob
NY
39
Jack
London
9
Bob
NY
2
Dave
NY
7
I need to assign an incremental value (from 1 to N) for each row which has the same name and place (price can be different).
df_out:
Name
Place
Price
Value
Bob
NY
15
1
Jack
London
27
1
John
Paris
5
1
Bill
Sydney
3
1
Bob
NY
39
2
Jack
London
9
2
Bob
NY
2
3
Dave
NY
7
1
I could do this by sorting the dataframe (on Name and Place) and then iteratively checking if they match between two consecutive rows. Is there a smarter/faster pandas way to do this?
You can use a grouped (on Name, Place) cumulative count and add 1 as it starts from 0:
df['Value'] = df.groupby(['Name','Place']).cumcount().add(1)
prints:
Name Place Price Value
0 Bob NY 15 1
1 Jack London 27 1
2 John Paris 5 1
3 Bill Sydney 3 1
4 Bob NY 39 2
5 Jack London 9 2
6 Bob NY 2 3
7 Dave NY 7 1

Drop duplicate where column value of duplicate row is zero

I am attempting to drop duplicates where the value of a specific column of the duplicated row is zero.
Name Division Clients
0 Dave Sales 0
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
5 Dan HR 0
The output I'm hoping to achieve is seen below
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
Any assistance anyone could provide would be greatly appreciated.
You can do a check if Clients == 0 and find all duplicates based on Name and Division, then do an & and inverse, then boolean mask:
c = df['Clients'].eq(0)
df[~(df.duplicated(['Name','Division'],keep=False) & c)]
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
Thanks to Seabean , consider the following df:
df1 = df.append(pd.DataFrame([['Dave','HR',0]],columns=df.columns),ignore_index=True)
print(df1)
Name Division Clients
0 Dave Sales 0
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
5 Dan HR 0
6 Dave HR 0
c = df1['Clients'].eq(0)
print(df1[~(df1.duplicated(['Name','Division'],keep=False) & c)])
Name Division Clients
1 Dave Sales 15
2 Karen Sales 10
3 Rachel HR 20
4 Dan HR 45
6 Dave HR 0
It depends on how your data is organized... If you're reading in from a csv you could do something like this:
#Get the Data:
data = pd.read_csv("employees.csv")
#Sort by Clients so the zeros are dropped instead of the Clients:
data.sort_values("Clients", inplace = True)
#Drop any duplicates based on name:
data.drop_duplicates(subset ="Name",
keep = False, inplace = True)

new dataframe using values in existing dataframe

exdf = pd.DataFrame({'Employee name': ['Alex','Mike'],
'2014.1': [5, 2], '2014.2': [3, 4], '2014.3': [3, 6], '2014.4': [4, 3], '2015.1': [7, 5], '2015.2': [5, 4]})
exdf
Employee name 2014.1 2014.2 2014.3 2014.4 2015.1 2015.2
0 Alex 5 3 3 4 7 5
1 Mike 2 4 6 3 5 4
Suppose the above dataframe has several such rows and columns with output from each employee for each quarter.
I want to create a new dataframe with columns:
newdf=pd.Dataframe(columns=['Employee name','Year','Quarter','Output'])
So, new dataframe will have nxm rows where n and m are rows and columns in original dataframe.
What I have tried is filling every row and column entry using nested for loop.
But I'm sure there is a more efficient method.
for i in range(df.shape[0]):
for j in range(df.shape[1]):
newdf.iloc[?]=exdf.iloc[?]
Use DataFrame.melt with Series.str.split, last change order of columns:
df = exdf.melt('Employee name', var_name='Year', value_name='Output')
df[['Year', 'Quarter']] = df['Year'].str.split('.', expand=True)
df = df[['Employee name','Year','Quarter','Output']]
print (df)
Employee name Year Quarter Output
0 Alex 2014 1 5
1 Mike 2014 1 2
2 Alex 2014 2 3
3 Mike 2014 2 4
4 Alex 2014 3 3
5 Mike 2014 3 6
6 Alex 2014 4 4
7 Mike 2014 4 3
8 Alex 2015 1 7
9 Mike 2015 1 5
10 Alex 2015 2 5
11 Mike 2015 2 4
Convert the columns to a multiIndex, using str.split, then you stack the columns to get ur final output
#set Employee name as index
exdf = exdf.set_index('Employee name')
#convert columns to multiIndex
exdf.columns = exdf.columns.str.split('.',expand = True)
exdf.columns = exdf.columns.set_names(['year','quarter'])
#stack data and give column a name
(exdf
.stack(["year","quarter"])
.reset_index(name='output')
)
Employee name year quarter output
0 Alex 2014 1 5.0
1 Alex 2014 2 3.0
2 Alex 2014 3 3.0
3 Alex 2014 4 4.0
4 Alex 2015 1 7.0
5 Alex 2015 2 5.0
6 Mike 2014 1 2.0
7 Mike 2014 2 4.0
8 Mike 2014 3 6.0
9 Mike 2014 4 3.0
10 Mike 2015 1 5.0
11 Mike 2015 2 4.0
With pivot_longer the reshaping can be abstracted to a simpler form:
# pip install pyjanitor
import janitor
import pandas as pd
```py
exdf.pivot_longer(
index="Employee name",
names_to=("Year", "Quarter"),
names_sep=".",
values_to="Output"
)
Employee name Year Quarter Output
0 Alex 2014 1 5
1 Mike 2014 1 2
2 Alex 2014 2 3
3 Mike 2014 2 4
4 Alex 2014 3 3
5 Mike 2014 3 6
6 Alex 2014 4 4
7 Mike 2014 4 3
8 Alex 2015 1 7
9 Mike 2015 1 5
10 Alex 2015 2 5
11 Mike 2015 2 4

Get order of subgroups in pandas dataframe

I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'Name' : ['Kate', 'John', 'Peter','Kate', 'John', 'Peter'],'Distance' : [23,16,32,15,31,26], 'Time' : [3,5,2,7,9,4]})
df
Distance Name Time
0 23 Kate 3
1 16 John 5
2 32 Peter 2
3 15 Kate 7
4 31 John 9
5 26 Peter 2
I want to add a column that tells me, for each Name, what's the order of the times.
I want something like this:
Order Distance Name Time
0 16 John 5
1 31 John 9
0 23 Kate 3
1 15 Kate 7
0 32 Peter 2
1 26 Peter 4
I can do it using a for loop:
df2 = df[df['Name'] == 'aaa'].reset_index().reset_index() # I did this just to create an empty data frame with the columns I want
for name, row in df.groupby('Name').count().iterrows():
table = df[df['Name'] == name].sort_values('Time').reset_index().reset_index()
to_concat = [df2,table]
df2 = pd.concat(to_concat)
df2.drop('index', axis = 1, inplace = True)
df2.columns = ['Order', 'Distance', 'Name', 'Time']
df2
This works, the problem is (apart from being very unpythonic), for large tables (my actual table has about 50 thousand rows) it takes about half an hour to run.
Can someone help me write this in a simpler way that runs faster?
I'm sorry if this has been answered somewhere, but I didn't really know how to search for it.
Best,
Use sort_values with cumcount:
df = df.sort_values(['Name','Time'])
df['Order'] = df.groupby('Name').cumcount()
print (df)
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
If need first column use insert:
df = df.sort_values(['Name','Time'])
df.insert(0, 'Order', df.groupby('Name').cumcount())
print (df)
Order Distance Name Time
1 0 16 John 5
4 1 31 John 9
0 0 23 Kate 3
3 1 15 Kate 7
2 0 32 Peter 2
5 1 26 Peter 4
In [67]: df = df.sort_values(['Name','Time']) \
.assign(Order=df.groupby('Name').cumcount())
In [68]: df
Out[68]:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
PS I'm not sure this is the most elegant way to do this...

Why am I not able to drop values within columns on pandas using python3?

I have a DataFrame (df) with various columns. In this assignment I have to find the difference between summer gold medals and winter gold medals, relative to total medals, for each country using stats about the olympics.
I must only include countries which have at least one gold medal. I am trying to use dropna() to not include those countries who do not at least have one medal. My current code:
def answer_three():
df['medal_count'] = df['Gold'] - df['Gold.1']
df['medal_count'].dropna()
df['medal_dif'] = df['medal_count'] / df['Gold.2']
df['medal_dif'].dropna()
return df.head()
print (answer_three())
This results in the following output:
# Summer Gold Silver Bronze Total # Winter Gold.1 \
Afghanistan 13 0 0 2 2 0 0
Algeria 12 5 2 8 15 3 0
Argentina 23 18 24 28 70 18 0
Armenia 5 1 2 9 12 6 0
Australasia 2 3 4 5 12 0 0
Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 \
Afghanistan 0 0 0 13 0 0 2
Algeria 0 0 0 15 5 2 8
Argentina 0 0 0 41 18 24 28
Armenia 0 0 0 11 1 2 9
Australasia 0 0 0 2 3 4 5
Combined total ID medal_count medal_dif
Afghanistan 2 AFG 0 NaN
Algeria 15 ALG 5 1.0
Argentina 70 ARG 18 1.0
Armenia 12 ARM 1 1.0
Australasia 12 ANZ 3 1.0
I need to get rid of both the '0' values in "medal_count" and the NaN in "medal_dif".
I am also aware the maths/way I have written the code is probably incorrect to solve the question, but I think I need to start by dropping these values? Any help with any of the above is greatly appreciated.
You are required to pass an axis e.g. axis=1 into the drop function.
An axis of 0 => row, and 1 => column. 0 seems to be the default.
As you can see the entire column is dropped for axis =1

Categories