Using Python's Pandas to find average values by bins - python

I just started using pandas to analyze groundwater well data over time.
My data in a text file looks like (site_no, date, well_level):
485438103132901 19800417 -7.1
485438103132901 19800506 -6.8
483622101085001 19790910 -6.7
485438103132901 19790731 -6.2
483845101112801 19801111 -5.37
484123101124601 19801111 -5.3
485438103132901 19770706 -4.98
I would like an output with average well levels binned by 5 year increments and with a count:
site_no avg 1960-end1964 count avg 1965-end1969 count avg 1970-end1974 count
I am reading in the data with:
names = ['site_no','date','wtr_lvl']
df = pd.read_csv('D:\info.txt', sep='\t',names=names)
I can find the overall average by site with:
avg = df.groupby(['site_no'])['wtr_lvl'].mean().reset_index()
My crude bin attempts use:
a1 = df[df.date > 19600000]
a2 = a1[a1.date < 19650000]
avga2 = a2.groupby(['site_no'])['wtr_lvl'].mean()
My question: how can I join the results to display as desired? I tried merge, join, and append, but they do not allow for empty data frames (which happens). Also, I am sure there is a simple way to bin the data by the dates. Thanks.

The most concise way is probably to convert this to a timeseris data and them downsample to get the means:
In [75]:
print df
ID Level
1
1980-04-17 485438103132901 -7.10
1980-05-06 485438103132901 -6.80
1979-09-10 483622101085001 -6.70
1979-07-31 485438103132901 -6.20
1980-11-11 483845101112801 -5.37
1980-11-11 484123101124601 -5.30
1977-07-06 485438103132901 -4.98
In [76]:
df.Level.resample('60M', how='mean')
#also may consider different time alias: '5A', '5BA', '5AS', etc:
#see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Out[76]:
1
1977-07-31 -4.980
1982-07-31 -6.245
Freq: 60M, Name: Level, dtype: float64
Alternatively, you may use groupby together with cut:
In [99]:
print df.groupby(pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)).mean()
ID Level
[1960, 1965] NaN NaN
(1965, 1970] NaN NaN
(1970, 1975] NaN NaN
(1975, 1980] 4.847632e+14 -6.064286
And by ID also:
In [100]:
print df.groupby(['ID',
pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)]).mean()
Level
ID
483622101085001 (1975, 1980] -6.70
483845101112801 (1975, 1980] -5.37
484123101124601 (1975, 1980] -5.30
485438103132901 (1975, 1980] -6.27

so what i like to do is create a separate column with the rounded bin number:
bin_width = 50000
mult = 1. / bin_width
df['bin'] = np.floor(ser * mult + .5) / mult
then, just group by the bins themselves
df.groupby('bin').mean()
another note, you can do multiple truth evaluations in one go:
df[(df.date > a) & (df.date < b)]

Related

Is there a faster method to do a Pandas groupby cumulative mean?

I am trying to create a lookup reference table in Python that calculates the cumulative mean of a Player's previous (by datetime) games scores, grouped by venue. However, for my specific need, a player should have previously played a minimum of 2 times at the relevant Venue for a 'Venue Preference' cumulative mean calculation.
df format looks like the following:
DateTime
Player
Venue
Score
2021-09-25 17:15:00
Tim
Stadium A
20
2021-09-27 10:00:00
Blake
Stadium B
30
My existing code that works perfectly, but unfortunately is very slow, is as follows:
import numpy as np
import pandas as pd
VenueSum = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].sum().reset_index(name = 'Sum'))
VenueSum['Cumulative Sum'] = VenueSum.sort_values('DateTime').groupby(['Player', 'Venue'])['Sum'].cumsum()
VenueCount = pd.DataFrame(df.groupby(['DateTime', 'Player', 'Venue'])['Score'].count().reset_index(name = 'Count'))
VenueCount['Cumulative Count'] = VenueCount.sort_values('DateTime').groupby(['Player', 'Venue'])['Count'].cumsum()
VenueLookup = VenueSum.merge(VenueCount, how = 'outer', on = ['DateTime', 'Player', 'Venue'])
VenueLookup['Venue Preference'] = np.where(VenueLookup['Cumulative Count'] >= 2, VenueLookup['Cumulative Sum'] / VenueLookup['Cumulative Count'], np.nan)
VenueLookup = VenueLookup.drop(['Sum', 'Cumulative Sum', 'Count', 'Cumulative Count'], axis = 1)
I am sure there is a way to calculate the cumulative mean in one step without first calculating the cumulative sum and cumulative count, but unfortunately I couldn't get that to work.
IIUC remove 2 groupby by aggregate by sum and size first and then cumulative sum by both columns:
df1 = df.groupby(['DateTime', 'Player', 'Venue'])['Score'].agg(['sum','count'])
df1 = df1.groupby(['Player', 'Venue'])[['sum', 'count']].cumsum().reset_index()
df1['Venue Preference'] = np.where(df1['count'] >= 2, df1['sum'] / df1['count'], np.nan)
df1 = df1.drop(['sum', 'count'], axis=1)
print (df1)
DateTime Player Venue Venue Preference
0 2021-09-25 17:15:00 Tim Stadium A NaN
1 2021-09-27 10:00:00 Blake Stadium B NaN

Python How to combine two rows into one under multiple rules

I try to combine many pairs of rows when run the code one time. As my example shows, for two rows which can be combined, the rules are,
values in PT, DS, SC columns must be same.
time stamps in FS must be the closest pair.
combine on ID column (string) is like ID1,ID2.
combine on WT and CB column (number) is sum().
combine on FS is as the latest time.
My example is,
df0 = pd.DataFrame({'ID':['1001','1002','1003','1004','2001','2002','2003','2004','3001','3002','3003','3004','4001','4002','4003','4004','5001','5002','5003','5004','6001'],
'PT':['B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','D','D','D','D','F'],
'DS':['AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAA','AAA','AAA','AAB','AAB'],
'SC':['P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P2','P2'],
'FS':['2020-10-16 00:00:00','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:04','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:05','2020-10-16 00:00:07','2020-10-16 00:00:01','2020-10-16 00:00:10','2020-10-16 00:10:00','2020-10-16 00:10:40','2020-10-16 00:00:00','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[1,2,3,4,10,11,12,13,20,21,22,23,30,31,32,33,40,41,42,43,53],
'CB':[0.1,0.2,0.3,0.4,1,1.1,1.2,1.3,2,2.1,2.2,2.3,3,3.1,3.2,3.3,4,4.1,4.2,4.3,5.3]})
When run the code one time, the new dataframe df1 is,
df1 = pd.DataFrame({'ID':['1001,1002','1003,1004','2001,2002','2003,2004','3001,3002','3003,3004','4001,4002','4003,4004','5001,5002','5003','5004','6001'],
'PT':['B','B','B','B','B','B','B','B','D','D','D','F'],
'DS':['AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAA','AAA','AAB','AAB'],
'SC':['P1','P1','P2','P2','P1','P1','P2','P2','P1','P1','P2','P2'],
'FS':['2020-10-16 00:00:02','2020-10-16 00:00:04','2020-10-16 00:00:01','2020-10-16 00:00:03','2020-10-16 00:00:01','2020-10-16 00:00:07','2020-10-16 00:00:10','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[3,7,21,25,41,45,61,65,81,42,43,53],
'CB':[0.3,0.7,2.1,2.5,4.1,4.5,6.1,6.5,8.1,4.2,4.3,5.3]})
When run the code again on df1, the new dataframe df2 is,
df2 = pd.DataFrame({'ID':['1001,1002,1003,1004','2001,2002,2003,2004','3001,3002,3003,3004','4001,4002,4003,4004','5001,5002,5003','5004','6001'],
'PT':['B','B','B','B','D','D','F'],
'DS':['AAA','AAA','AAB','AAB','AAA','AAB','AAB'],
'SC':['P1','P2','P1','P2','P1','P2','P2'],
'FS':['2020-10-16 00:00:04','2020-10-16 00:00:03','2020-10-16 00:00:07','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[10,46,86,126,123,43,53],
'CB':[1,4.6,8.6,12.6,12.3,4.3,5.3]})
Here no more combines can be done on df2 because no any pair of rows meets the rules.
The reason is that I have memory limit and have to decrease the size of data without losing the info. So I try to bundle IDs which shares same features and happens close to each other. I plan to run the code multiple times until no more memory issue or no more possible combines.
This is a good place to use GroupBy operations.
My source was Wes McKinney's Python for Data Analysis.
df0['ID'] = df0.groupby([df0['PT'], df0['DS'], df0['SC']])['ID'].transform(lambda x: ','.join(x))
max_times = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).max().drop(['WT', 'CB'], axis = 1)
sums_WT_CB = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).sum()
df2 = pd.merge(max_times, sums_WT_CB, on=['ID', 'PT', 'DS', 'SC'])
This code just takes the most recent time for each unique grouping of the columns you specified. If there are other requirements for the FS column, you will have to modify this.
Code to concatenate the IDs came from:
Concatenate strings from several rows using Pandas groupby
Perhaps there's something more straightforward (please comment if so :)
but the following seems to work:
def combine(data):
return pd.DataFrame(
{
"ID": ",".join(map(str, data["ID"])),
"PT": data["PT"].iloc[0],
"DS": data["DS"].iloc[0],
"SC": data["SC"].iloc[0],
"WT": data["WT"].sum(),
"CB": data["CB"].sum(),
"FS": data["FS"].max(),
},
index=[0],
).reset_index(drop=True)
df_agg = (
df.sort_values(["PT", "DS", "SC", "FS"])
.groupby(["PT", "DS", "SC"])
.apply(combine)
.reset_index(drop=True)
)
returns
ID PT DS SC WT CB FS
0 1001,1002,1003,1004 B AAA P1 10 1.0 2020-10-16 00:00:04
1 2001,2002,2003,2004 B AAA P2 46 4.6 2020-10-16 00:00:03
2 3001,3002,3003,3004 B AAB P1 86 8.6 2020-10-16 00:00:07
3 4001,4002,4003,4004 B AAB P2 126 12.6 2020-10-16 00:10:40
4 5001,5003,5002 D AAA P1 123 12.3 2020-10-16 00:10:00
5 5004 D AAB P2 43 4.3 2020-10-16 00:00:10
6 6001 F AAB P2 53 5.3 2020-10-16 00:00:05

getting percentage and count Python

Suppoose df.bun (df is a Pandas dataframe)is a multi-index(date and name) with variable being category values written in string,
date name values
20170331 A122630 stock-a
A123320 stock-a
A152500 stock-b
A167860 bond
A196030 stock-a
A196220 stock-a
A204420 stock-a
A204450 curncy-US
A204480 raw-material
A219900 stock-a
How can I make this to represent total counts in the same date and its percentage to make table like below with each of its date,
date variable counts Percentage
20170331 stock 7 70%
bond 1 10%
raw-material 1 10%
curncy 1 10%
I have done print(df.groupby('bun').count()) as a resort to this question but it lacks..
cf) Before getting df.bun I used the following code to import nested dictionary to Pandas dataframe.
import numpy as np
import pandas as pd
result = pd.DataFrame()
origDict = np.load("Hannah Lee.npy")
for item in range(len(origDict)):
newdict = {(k1, k2):v2 for k1,v1 in origDict[item].items() for k2,v2 in origDict[item][k1].items()}
df = pd.DataFrame([newdict[i] for i in sorted(newdict)],
index=pd.MultiIndex.from_tuples([i for i in sorted(newdict.keys())]))
print(df.bun)
I believe need SeriesGroupBy.value_counts:
g = df.groupby('date')['values']
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print (df)
counts percentage
date values
20170331 stock-a 6 60.0
bond 1 10.0
curncy-US 1 10.0
raw-material 1 10.0
stock-b 1 10.0
Another solution with size for counts and then divide by new Series created by transform and sum:
df2 = df.reset_index().groupby(['date', 'values']).size().to_frame('count')
df2['percentage'] = df2['count'].div(df2.groupby('date')['count'].transform('sum')).mul(100)
print (df2)
count percentage
date values
20170331 bond 1 10.0
curncy-US 1 10.0
raw-material 1 10.0
stock-a 6 60.0
stock-b 1 10.0
Difference between solutions is first sort by values per groups and second sort MultiIndex.

Pandas: Group by date and the median of another variable

This is a demo example of my DataFrame. The full DataFrame has multiple additional variables and covers 6 months of data.
sentiment date
1 2015-05-26 18:58:44
0.9 2015-05-26 19:57:31
0.7 2015-05-26 18:58:24
0.4 2015-05-27 19:17:34
0.6 2015-05-27 18:46:12
0.5 2015-05-27 13:32:24
1 2015-05-28 19:27:31
0.7 2015-05-28 18:58:44
0.2 2015-05-28 19:47:34
I want to group the DataFrame by just the day of the date column, but at the same time aggregate the median of the sentiment column.
Everything I have tried with groupby, the dt accessor and timegrouper has failed.
I want to return a pandas DataFrame not a GroupBy object.
The date column is M8[ns]
The sentiment column float64
You fortunately have the tools you need listed in your question.
In [61]: df.groupby(df.date.dt.date)[['sentiment']].median()
Out[61]:
sentiment
2015-05-26 0.9
2015-05-27 0.5
2015-05-28 0.7
I would do this :
df['date'] = df['date'].apply(lambda x : x.date())
df = df.groupby('date').agg({'sentiment':np.median}).reset_index()
You first replace the datetime column with the date.
Then you perform the groupby+agg operation.
I would do this, because you can do multiple aggregations (like median, mean, min, max, etc.) on multiple columns at the same time:
df.groupby(df.date.dt.date).agg({'sentiment': ['median']})
You can get any number of metrics using one group by and .agg() function
1) create new column extracting date.
2) Use groupy by and apply numpy.median,numpy.mean etc
import pandas as pd
x = [[1,'2015-05-26 18:58:44'],
[0.9,'2015-05-26 19:57:31']]
t = pd.DataFrame(x,columns = ['a','b'])
t.b = pd.to_datetime(t['b'])
t['datex'] = t['b'].dt.date
t.groupby(['datex']).agg({
'a': np.median
})
Output -
datex
2015-05-26 0.95

Apply Different Resampling Method to the Same Column (pandas)

I have a time series and I want to apply different functions to the same column.
The main column is weight. I want to create a df that shows both the mean for the weight in the resampled period plus the max. I know I can do:
df.resample('M', how = {'weight':np.max}, kind='YearEnd')
df1.resample('M', how = {'weight': np.mean}, kind='YearEnd')
This seems inefficient.
Optimal:
df.resample('M', how = {'weight': np.mean, 'weight':np.max}, kind='YearEnd')
Try this.
In [23]: df = DataFrame(np.random.randn(100,1),columns=['weight'],index=date_range('20000101',periods=100,freq='MS'))
In [24]: df.resample('A',how=['max','mean'])
Out[24]:
weight
max mean
2000-12-31 1.958570 -0.312230
2001-12-31 1.739518 0.035701
2002-12-31 2.503437 0.169365
2003-12-31 1.115315 0.149279
2004-12-31 2.190617 -0.087536
2005-12-31 1.286224 0.037669
2006-12-31 1.674017 0.147676
2007-12-31 2.107169 -0.064962
2008-12-31 -0.163863 -0.572363
[9 rows x 2 columns]
supporting how as a dict I don't think is too hard, will open an issue about this enhancement: https://github.com/pydata/pandas/issues/6515

Categories