How to calculate conditional aggregate measure for each row in dataframe? - python

I have a table like this...
Date
PlayerId
Goals
June 1
A
1
June 14
A
1
June 15
B
2
June 28
A
1
July 6th
B
0
July 17th
A
1
I would like to calculate the amount of goals a player had scored in the 30 days prior (NOT 30 games). The final results should look like...
Date
PlayerId
Goals
Goals_Prev_30
June 1
A
1
0
June 14
A
1
1
June 15
B
2
0
June 28
A
1
2
July 6th
B
0
2
July 17th
A
1
1
I created a for loop that filters that identifies a single row in the dataframe, then filters the dataframe by characteristics of the row, then calculates the sum of goals in the filtered dataframe, appends it to a list, which is finally assigned to the Goals_Prev_30 column. The code looks like...
30_day_goals = []
for i in range(len(df)):
row = df.iloc[i]
filtered_df = df[(df['Date'] < row['Date']) & (df['Date'] >= row['Date'])- pd.to_timedelta(30,unit='d')) & (df['PlayerId'] == row['PlayerId'])]
total = filtered_df['Goals'].sum()
30_day_goals.append(total)
df['Goals_Prev_30'] = 30_day_goals
This solution works, but it's slow. It can do around 30 rows a second, however it's not a viable solution as I have multiple measures that are similar and there are over 1.2M rows. This means it will take around 11hrs per measure to complete.
How can this problem be solved in a more efficient manner?

I change your solution to custom function per groups with mask created by broadcasting and sum values of Goals column per groups if match:
#if necessary
#df['Date'] = pd.to_datetime(df['Date'], format='%B %d')
def f(x):
d1 = x['Date']
d2 = d1 - pd.to_timedelta(30,unit='d')
a1 = d1.to_numpy()
a2 = d2.to_numpy()
m = (a1 < a1[:, None]) & (a1 >=a2[:, None])
x['Goals_Prev_30'] = np.where(m, x['Goals'], 0).sum(axis=1)
return x
df = df.groupby('PlayerId').apply(f)
print (df)
Date PlayerId Goals Goals_Prev_30
0 1900-06-01 A 1 0
1 1900-06-14 A 1 1
2 1900-06-15 B 2 0
3 1900-06-28 A 1 2
4 1900-07-06 B 0 2
5 1900-07-17 A 1 1

Related

Transition count within a column from one value to another value in Pandas

I have the below dataframe.
df = pd.DataFrame({'Player': [1,1,1,1,2,2,2,3,3,3,4,5], "Team": ['X','X','X','Y','X','X','Y','X','X','Y','X','Y'],'Month': [1,1,1,2,1,1,2,2,2,3,4,5]})
Input:
Player Team Month
0 1 X 1
1 1 X 1
2 1 X 1
3 1 Y 2
4 2 X 1
5 2 X 1
6 2 Y 2
7 3 X 2
8 3 X 2
9 3 Y 3
10 4 X 4
11 5 Y 5
The data frame consists of Players, the team they belong to and the month. You can have multiple entries for the same player on a given month. Some players move from Team X to Team Y on a particular month, some don’t move at all and some directly join Team Y.
I am looking for the total count of people who moved from Team X to Team Y on a given month and the output should be like below. i.e the month of transition and total count of transitions. In this case, Players 1,2 moved on Month-2 and Player-3 moved on Month-3. Players 4 and 5 didn't move.
Expected Output:
Month Count
0 2 2
1 3 1
I am able to get this done in the below fashion.
###find all the people who moved from Team X to Y###
s1 = df.drop_duplicates(['Team','Player'])
s2 = s1.groupby('Player').size().reset_index(name='counts')
s2 = s2[s2['counts']>1]
####Tie them to the original df so that I can find the month in which they moved###
s3 = s1.groupby("Player").last().reset_index()
s4 = s3[s3['Player'].isin(s2['Player'])]
s5 = s4.groupby('Month').size().reset_index(name='Count')
I am pretty sure there is a better way than what I did here. Just looking for some help to make if more efficient.
First pick out the entries which (1) changes team but (2) is not the first row of a player. And then compute the size grouped by each month.
mask = df["Team"].shift().ne(df["Team"]) & df["Player"].shift().eq(df["Player"])
out = df[mask].groupby("Month").size()
Output:
print(out) # a Series
Month
2 2
3 1
dtype: int64
# series to dataframe (optional)
out.to_frame(name="count").reset_index()
Month count
0 2 2
1 3 1
Edit: the first groupby in mask is redundant so removed.
An option is to self merge on Player, Month and check for the players that move:
s = df.drop_duplicates()
t = (s.merge(s.assign(Month=s.Month+1), on=['Player', 'Month'], how='right')
.assign(Count=lambda x: x.Team_x.eq('Y') & x.Team_y.eq('X'))
.groupby('Month', as_index=False)['Count'].sum()
)
print(t.loc[t['Count'] != 0])
Output:
Month Count
0 2 2
1 3 1

A way to iterate through rows and columns (in a panda data frame), select rows and columns based on conditions to put into panda another data frame

I have a data frame with over 1500 rows
a sample of the table is like so
Site 2019 2020 2021 ....
ABC 0 1 2
DEF 1 1 2
GHI 2 0 1
JKL 0 0 0
MNO 2 1 1
I want to create a new dataframe which only selects sites and years if they have:
a value in 2019
if 2019 has a value greater that or equal to value in the next years
if there is a greater value in the next year, then the value of the previous year
if the next year has a value less than the previous year
so the out put for the example would be
Site 2019 2020 2021 ....
DEF 1 1 1
GHI 2
MNO 2 1 1
DEF has got a 1 in 2021 because there is a one in 2020
I tried to use the following to find the rows with values in the 2019 column but
for i.j in df.iterrows():
if when j=2
if i >0
return value
but I get syntax errors
Without looping the rows you can do:
df1 = df[(df[2019] > 0) & (df.loc[:, 2020:].min(axis=1) <= df.loc[:, 2019])]
cols = df1.columns.tolist()
for i in range(2, len(cols)):
df1[cols[i]] = df1.loc[:, cols[i - 1: i + 1]].min(axis=1)
df1
Output:
2019 2020 2021
DEF 1 1 1
GHI 2 0 0
MNO 2 1 1
This should work as long as you don't have too many columns. Add another comparison for each set of years that need to be compared. This will be a reference to the original df unless you use .copy() to make a deep copy.
new_df = df[(df['2019'] > 0) & (df['2019'] <= df['2020']) & (df['2020'] <= df['2021']) & (df['2021'] <= df['2022'])]

python: extract rows that a column value is more than 3

there is a dataframe as following:
id year number
1 2016 3
1 2017 5
2 2016 1
2 2017 5
...
I want to extract the rows that groupby id and the value of number column is more than 3 in both 2016 and 2017.
for example in the above first 4 rows, the result is:
id year number
1 2016 3
1 2017 5
Thanks!
Compare by >=3 and use GroupBy.transform for Series with same size like original, so possible filter by boolean indexing:
df1 = df[(df["number"] >= 3).groupby(df["id"]).transform('all')]
#alternative for reassign mask to column
#df = df[df.assign(number= df["number"] >= 3).groupby("id")['number'].transform('all')]
print (df1)
id year number
0 1 2016 3
1 1 2017 5
Or use filter, but it should be slow if large DataFrame or many groups:
df1 = df.groupby("id").filter(lambda x: (x["number"] >= 3).all())
>>> great_in_both_years = df.groupby("id").apply(lambda x: (x["number"] >= 3).all())
>>> great_in_both_years
id
1 True
2 False
dtype: bool
>>> df.loc[lambda x: x["id"].map(great_in_both_years)]
id year number
0 1 2016 3
1 1 2017 5

Calculating Spatial Distance returns operand error

It is a follow-up question to my previous question:
I have a dataframe like this
Company_id year dummy_1 dummy_2 dummy_3 dummy_4 dummy_5
1 1990 1 0 1 1 1
1 1991 0 0 1 1 0
1 1992 0 0 1 1 0
1 1993 1 0 1 1 0
1 1994 0 1 1 1 0
1 1995 0 0 1 1 0
1 1996 0 0 1 1 1
I created an numpy array by:
df = df.assign(vector = df.iloc[:, -5:].values.tolist())
df['vector'] = df['vector'].apply(np.array)
I want to compare company's distinctivness in terms of it's strategic practices compared to rivals in last 5 years. Here is the code that I use:
df.sort_values('year', ascending=False)
# These will be our lists of differences.
diffs = []
# Loop over all unique dates
for date in df.year.unique():
# Only take dates earlier then current date.
compare_df = df.loc[df.year - date <= 5 ].copy()
# Loop over each company for this date
for row in df.loc[df.year == date].itertuples():
# If no data available use nans.
if compare_df.empty:
diffs.append(float('nan'))
# Calculate cosine and fill in otherwise
else:
compare_df['distinctivness'] = spatial.distance.cosine(np.array(compare_df.vector) , np.array(row.vector))
row_of_interest = compare_df.distinctivness.mean()
diffs.append(row_of_interest.distinctivness.values[0])
However, I get
compare_df['distinctivness'] = spatial.distance.cosine(np.array(compare_df.vector) - np.array(row.vector))
ValueError: operands could not be broadcast together with shapes (29254,) (93,)
How could I get rid of this problem?

Count number of rows for each ID within 1 year

I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1

Categories