Is there a faster method to do a Pandas groupby cumulative mean? - python

I am trying to create a lookup reference table in Python that calculates the cumulative mean of a Player's previous (by datetime) games scores, grouped by venue. However, for my specific need, a player should have previously played a minimum of 2 times at the relevant Venue for a 'Venue Preference' cumulative mean calculation.
df format looks like the following:
DateTime
Player
Venue
Score
2021-09-25 17:15:00
Tim
Stadium A
20
2021-09-27 10:00:00
Blake
Stadium B
30
My existing code that works perfectly, but unfortunately is very slow, is as follows:
import numpy as np
import pandas as pd
VenueSum = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].sum().reset_index(name = 'Sum'))
VenueSum['Cumulative Sum'] = VenueSum.sort_values('DateTime').groupby(['Player', 'Venue'])['Sum'].cumsum()
VenueCount = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].count().reset_index(name = 'Count'))
VenueCount['Cumulative Count'] = VenueCount.sort_values('DateTime').groupby(['Player', 'Venue'])['Count'].cumsum()
VenueLookup = VenueSum.merge(VenueCount, how = 'outer', on = ['DateTime', 'Player', 'Venue'])
VenueLookup['Venue Preference'] = np.where(VenueLookup['Cumulative Count'] >= 2, VenueLookup['Cumulative Sum'] / VenueLookup['Cumulative Count'], np.nan)
VenueLookup = VenueLookup.drop(['Sum', 'Cumulative Sum', 'Count', 'Cumulative Count'], axis = 1)
I am sure there is a way to calculate the cumulative mean in one step without first calculating the cumulative sum and cumulative count, but unfortunately I couldn't get that to work.

IIUC remove 2 groupby by aggregate by sum and size first and then cumulative sum by both columns:
df1 = df.groupby(['DateTime', 'Player', 'Venue'])['Score'].agg(['sum','count'])
df1 = df1.groupby(['Player', 'Venue'])[['sum', 'count']].cumsum().reset_index()
df1['Venue Preference'] = np.where(df1['count'] >= 2, df1['sum'] / df1['count'], np.nan)
df1 = df1.drop(['sum', 'count'], axis=1)
print (df1)
DateTime Player Venue Venue Preference
0 2021-09-25 17:15:00 Tim Stadium A NaN
1 2021-09-27 10:00:00 Blake Stadium B NaN

Related

Concatenate arrays into a single table using pandas

I have a .csv file, from this file I group it by year so that it gives me as a result the maximum, minimum and average values
import pandas as pd
DF = pd.read_csv("PJME_hourly.csv")
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
His output is as follows:
2002 PJME_MW
max 55934.000000
min 19247.000000
mean 31565.617106
2003 PJME_MW
max 53737.000000
min 19414.000000
mean 31698.758621
2004 PJME_MW
max 51962.000000
min 19543.000000
mean 32270.434867
I would like to know how I can make it all join in a single column (PJME_MW), but that each group of operations (max, min, mean) is identified by the year that corresponds to it.
If you convert the dates to_datetime(), you can group them using the dt.year accessor:
df = pd.read_csv('PJME_hourly.csv')
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
Toy example:
df = pd.DataFrame({'Datetime': ['2019-01-01','2019-02-01','2020-01-01','2020-02-01','2021-01-01'], 'PJME_MV': [3,5,30,50,100]})
# Datetime PJME_MV
# 0 2019-01-01 3
# 1 2019-02-01 5
# 2 2020-01-01 30
# 3 2020-02-01 50
# 4 2021-01-01 100
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
# PJME_MV
# min max mean
# Datetime
# 2019 3 5 4
# 2020 30 50 40
# 2021 100 100 100
The code could be optimized but how is now works, change this part of your code:
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
Use this instead
aggs = ['max','min','mean']
df_group = df.groupby('Datetime')['PJME_MW'].agg(aggs).reset_index()
out_columns = ['agg_year', 'PJME_MW']
out = []
aux = pd.DataFrame(columns=out_columns)
for agg in aggs:
aux['agg_year'] = agg + '_' + df_group['Datetime']
aux['PJME_MW'] = df_group[agg]
out.append(aux)
df_out = pd.concat(out)
Edit: Concatenation form has been changed
Final edit: I didn't understand the whole problem, sorry. You don't need the code after groupby function

How to get mean of last month in pandas

I have a data set with first column is the Date, Second column is the Collaborator and third column is price paid.
I want to get the mean price paid of every Collaborator for the previous month. I want to return a table tha looks like this:
I used some solutions like rolling but i could get only the past X days, not the past month
Pandas has a built-in method .rolling
x = 3 # This is where you define the number of previous entries
df.rolling(x).mean() # Apply the mean
Hence:
df['LastMonthMean'] = df['Price'].rolling(x).mean()
I'm not sure how you want to calculate your mean but hope this helps
I would first add month column and then use groupby and would retrieve the first item
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 1, 2, 2, 2],
'collaborator': [1, 2, 3, 1, 2, 3],
'price': [100, 200, 300, 400, 500, 600]
})
df.groupby(['collaborator', 'month']).mean()
The rolling() method would have to be applied to the DataFrame grouped by Collaborator to obtain the mean sale price of every collaborator in the previous month.
Because the data would be grouped by and summarised, the number of data points would not match the original dataset, thus not allowing you to easily append the result to the original dataset.
If you use a DatetimeIndex in your DataFrame it will be considered a time series and then you can resample() the data more easily.
I have produced a replicable solution below, based on your initial question in which I resample the data and append the last month's mean to it. Thanks to #akilat90 for the function to generate random dates within a range.
import pandas as pd
import numpy as np
def random_dates(start, end, n=10):
# Function copied from #akilat90
# Available on https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
start_u = pd.to_datetime(start).value//10**9
end_u = pd.to_datetime(end).value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
size = 1000
index = random_dates(start='2021-01-01', end='2021-06-30', n=size).sort_values()
collaborators = np.random.randint(low=1, high=4, size=size)
prices = np.random.uniform(low=5., high=25., size=size)
data = pd.DataFrame({'Collaborator': collaborators,
'Price': prices}, index=index)
monthly_mean = data.groupby('Collaborator').resample('M')['Price'].mean()
data_final = pd.merge(data, monthly_mean, how='left', left_on=['Collaborator', data.index.month],
right_on=[monthly_mean.index.get_level_values('Collaborator'), monthly_mean.index.get_level_values(1).month + 1])
data_final.index = data.index
data_final = data_final.drop('key_1', axis=1)
data_final.columns = ['Collaborator', 'Price', 'LastMonthMean']
This is the output:
Collaborator Price LastMonthMean
2021-01-31 04:26:16 2 21.838910 NaN
2021-01-31 05:33:04 2 19.164086 NaN
2021-01-31 12:32:44 2 24.949444 NaN
2021-01-31 12:58:02 2 8.907224 NaN
2021-01-31 14:43:07 1 7.446839 NaN
2021-01-31 18:38:11 3 6.565208 NaN
2021-02-01 00:08:25 2 24.520149 15.230642
2021-02-01 09:25:54 2 20.614261 15.230642
2021-02-01 09:59:48 2 10.879633 15.230642
2021-02-02 10:12:51 1 22.134549 14.180087
2021-02-02 17:22:18 2 24.469944 15.230642
As you can see, the records in January 2021, the first month in this time series, do not have a valid Last Month Mean, unlike the records in February.

Average time between timestamps per group not in order

I would like to get the mean time between timestamps per group. However, the groups are not ordered.
Code to create df:
d = {'ID': ['AI100', 'AI200', 'AI200', 'AI100','AI200','AI100'],
'Date': ['2019-01-10', '2018-06-01', '2018-06-11','2019-01-15','2018-06-21', '2019-01-22']}
data = pd.DataFrame(data=d)
data = data[['ID', 'Date']]
data['Date'] = pd.to_datetime(data['Date'])
data
ID Date
0 AI100 2019-01-10
1 AI200 2018-06-01
2 AI200 2018-06-11
3 AI100 2019-01-15
4 AI200 2018-06-21
5 AI100 2019-01-22
I tried the following:
data = data.sort_values(['ID','Date'],ascending=True).groupby('ID').head(3) #group the IDs
data['diffs'] = data['Date'].diff()
data['diffs'] = data['diffs'].apply(lambda x: x.days)
data = data.groupby(['ID'])[('diffs')].agg('mean')
However, this yields:
data.add_suffix('ID').reset_index()
ID diffs
0 AI100ID 6.000000
1 AI200ID -71.666667
The mean time for group AI100ID is correct, but not for group AI200ID.
What is going wrong?
I think the issue you're having here is that you aren't calculating your diffs by the group so it's calculating the difference between the previous group's last value and the new group's first value.
Change your line to this and you should get the expected result:
data['diffs'] = data.groupby('ID')['Date'].diff()
Footnote:
Another other tip unrelated to the main problem, but just in case you were unaware:
data['diffs'] = data['diffs'].apply(lambda x: x.days)
Can be written to use faster vectorised operations using the .dt accessor:
data['diffs'] = data['diffs'].dt.days

getting percentage and count Python

Suppoose df.bun (df is a Pandas dataframe)is a multi-index(date and name) with variable being category values written in string,
date name values
20170331 A122630 stock-a
A123320 stock-a
A152500 stock-b
A167860 bond
A196030 stock-a
A196220 stock-a
A204420 stock-a
A204450 curncy-US
A204480 raw-material
A219900 stock-a
How can I make this to represent total counts in the same date and its percentage to make table like below with each of its date,
date variable counts Percentage
20170331 stock 7 70%
bond 1 10%
raw-material 1 10%
curncy 1 10%
I have done print(df.groupby('bun').count()) as a resort to this question but it lacks..
cf) Before getting df.bun I used the following code to import nested dictionary to Pandas dataframe.
import numpy as np
import pandas as pd
result = pd.DataFrame()
origDict = np.load("Hannah Lee.npy")
for item in range(len(origDict)):
newdict = {(k1, k2):v2 for k1,v1 in origDict[item].items() for k2,v2 in origDict[item][k1].items()}
df = pd.DataFrame([newdict[i] for i in sorted(newdict)],
index=pd.MultiIndex.from_tuples([i for i in sorted(newdict.keys())]))
print(df.bun)
I believe need SeriesGroupBy.value_counts:
g = df.groupby('date')['values']
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print (df)
counts percentage
date values
20170331 stock-a 6 60.0
bond 1 10.0
curncy-US 1 10.0
raw-material 1 10.0
stock-b 1 10.0
Another solution with size for counts and then divide by new Series created by transform and sum:
df2 = df.reset_index().groupby(['date', 'values']).size().to_frame('count')
df2['percentage'] = df2['count'].div(df2.groupby('date')['count'].transform('sum')).mul(100)
print (df2)
count percentage
date values
20170331 bond 1 10.0
curncy-US 1 10.0
raw-material 1 10.0
stock-a 6 60.0
stock-b 1 10.0
Difference between solutions is first sort by values per groups and second sort MultiIndex.

Using Python's Pandas to find average values by bins

I just started using pandas to analyze groundwater well data over time.
My data in a text file looks like (site_no, date, well_level):
485438103132901 19800417 -7.1
485438103132901 19800506 -6.8
483622101085001 19790910 -6.7
485438103132901 19790731 -6.2
483845101112801 19801111 -5.37
484123101124601 19801111 -5.3
485438103132901 19770706 -4.98
I would like an output with average well levels binned by 5 year increments and with a count:
site_no avg 1960-end1964 count avg 1965-end1969 count avg 1970-end1974 count
I am reading in the data with:
names = ['site_no','date','wtr_lvl']
df = pd.read_csv('D:\info.txt', sep='\t',names=names)
I can find the overall average by site with:
avg = df.groupby(['site_no'])['wtr_lvl'].mean().reset_index()
My crude bin attempts use:
a1 = df[df.date > 19600000]
a2 = a1[a1.date < 19650000]
avga2 = a2.groupby(['site_no'])['wtr_lvl'].mean()
My question: how can I join the results to display as desired? I tried merge, join, and append, but they do not allow for empty data frames (which happens). Also, I am sure there is a simple way to bin the data by the dates. Thanks.
The most concise way is probably to convert this to a timeseris data and them downsample to get the means:
In [75]:
print df
ID Level
1
1980-04-17 485438103132901 -7.10
1980-05-06 485438103132901 -6.80
1979-09-10 483622101085001 -6.70
1979-07-31 485438103132901 -6.20
1980-11-11 483845101112801 -5.37
1980-11-11 484123101124601 -5.30
1977-07-06 485438103132901 -4.98
In [76]:
df.Level.resample('60M', how='mean')
#also may consider different time alias: '5A', '5BA', '5AS', etc:
#see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Out[76]:
1
1977-07-31 -4.980
1982-07-31 -6.245
Freq: 60M, Name: Level, dtype: float64
Alternatively, you may use groupby together with cut:
In [99]:
print df.groupby(pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)).mean()
ID Level
[1960, 1965] NaN NaN
(1965, 1970] NaN NaN
(1970, 1975] NaN NaN
(1975, 1980] 4.847632e+14 -6.064286
And by ID also:
In [100]:
print df.groupby(['ID',
pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)]).mean()
Level
ID
483622101085001 (1975, 1980] -6.70
483845101112801 (1975, 1980] -5.37
484123101124601 (1975, 1980] -5.30
485438103132901 (1975, 1980] -6.27
so what i like to do is create a separate column with the rounded bin number:
bin_width = 50000
mult = 1. / bin_width
df['bin'] = np.floor(ser * mult + .5) / mult
then, just group by the bins themselves
df.groupby('bin').mean()
another note, you can do multiple truth evaluations in one go:
df[(df.date > a) & (df.date < b)]

Categories