Pandas groupby to get an average day - python

I have a dataframe which is the result from reading a csv. It contains a datetime column and data related to an event. I need to calculate an average day with statistical data per 20 minutes, in the code below I use 'mean' as an example.
Edit:
My data are observations. This means that not all bins have data in it. But this zero-counts do have to be taken into account when calculating the mean value: mean = count / #days
This code works but is this the way to go? It looks to complicated to me and I wonder if I really need to us a BinID and cant't group by time of day.
import pandas as pd
# Create dataframe
data = {'date': pd.date_range('2017-01-01 00:30:00', freq='10min', periods=282),
'i/o': ['in', 'out'] * 141}
df = pd.DataFrame(data)
# Add ones
df['move'] = 1
# I did try:
# 1)
# df['time'] = df['date'].dt.time
# df.groupby(['i/o', pd.Grouper(key='time', freq='20min')])
# This failed with groupby, so should I use my own bins then???
# 2)
# Create 20 minutes bins
# df['binID'] = df['date'].dt.hour*3 + df['date'].dt.minute//20
# averageDay = df.groupby(['i/o', 'binID']).agg(['count', 'sum', 'mean'])
#
# Well, bins with zero moves aren't their.
# So 'mean' can't be used as well as other functions that
# need the number of observations. Resample and reindex then???
# Resample
df2 = df.groupby(['i/o', pd.Grouper(key='date', freq='20min')]).agg('sum')
# Reindex and reset (for binID and groupby)
levels = [['in', 'out'],
pd.date_range('2017-01-01 00:00:00', freq='20min', periods=144)]
newIndex = pd.MultiIndex.from_product(levels, names=['i/o', 'date'])
df2 = df2.reindex(newIndex, fill_value=0).reset_index()
# Create 20 minutes bins
df2['binID'] = df2['date'].dt.hour*3 + df2['date'].dt.minute//20
# Average day
averageDay2 = df2.groupby(['i/o', 'binID']).agg(['count', 'sum', 'mean'])
print(averageDay2)

IIUC:
In [124]: df.groupby(['i/o',df.date.dt.hour*3 + df.date.dt.minute//20]) \
.agg(['count','sum','mean'])
Out[124]:
move
count sum mean
i/o date
in 0 1 1 1
1 2 2 1
2 2 2 1
3 2 2 1
4 2 2 1
5 2 2 1
6 2 2 1
7 2 2 1
8 2 2 1
9 2 2 1
... ... .. ...
out 62 2 2 1
63 2 2 1
64 2 2 1
65 2 2 1
66 2 2 1
67 2 2 1
68 2 2 1
69 2 2 1
70 2 2 1
71 1 1 1
[144 rows x 3 columns]

Related

Get last value from pandas groupeddataframe, summed by another column

I have the following dataframe
x = pd.DataFrame(
{
'FirstGroupCriterium': [1,1,2,2,3],
'SortingCriteria': [1,1,1,2,1],
'Value': [10,20,30,40,50]
}
)
x.sort_values('SortingCriteria').groupby('FirstGroupCriterium').agg(last_value=('Value', 'last'))
The latter outputs:
FirstGroupCriterium
last_value
1
20
2
40
3
50
What I would like to have, is to sum up the last value, based on the last SortingCriteria. So in this case:
FirstGroupCriterium
last_value
1
10+20 = 30
2
40
3
50
My initial idea was to call a custom aggregator function that groups the data yet again, but that fails.
def last_value(group):
return group.groupby('SortingCriteria')['Value'].sum().tail(1)
Do you have any idea how to get this to work? Thank you!
Sorting by both columns first, then filter last rows per FirstGroupCriterium in GroupBy.transform and aggregate sum:
df = x.sort_values(['FirstGroupCriterium','SortingCriteria'])
df1 = df[df['SortingCriteria'].eq(df.groupby('FirstGroupCriterium')['SortingCriteria'].transform('last'))]
print (df1)
FirstGroupCriterium SortingCriteria Value
0 1 1 10
1 1 1 20
3 2 2 40
4 3 1 50
df2 = df1.groupby(['FirstGroupCriterium'],as_index=False)['Value'].sum()
print (df2)
FirstGroupCriterium Value
0 1 30
1 2 40
2 3 50
Anoter idea is aggregate sum by both columns and then remove duplicates with keep last row by DataFrame.drop_duplicates:
df2 = (df.groupby(['FirstGroupCriterium','SortingCriteria'],as_index=False)['Value'].sum()
.drop_duplicates(['FirstGroupCriterium'], keep='last'))
print (df2)
FirstGroupCriterium SortingCriteria Value
0 1 1 30
2 2 2 40
3 3 1 50

pandas - take last N rows from one subgroup

Let's suppose we have a dataframe that be generated using this code:
import pandas as pd
d = {'p1': np.random.rand(32),
'a1': np.random.rand(32),
'phase': [0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3],
'file_number': [1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 2,2,2,2, 2,2,2,2, 2,2,2,2, 2,2,2,2]
}
df = pd.DataFrame(d)
For each file number i want to take only last N rows of phase number 3. So that the result for N==2 looks like this:
Currently I'm doing it in this way:
def phase_3_last_n_observations(df, n):
result = []
for fn in df['file_number'].unique():
file_df = df[df['file_number']==fn]
for phase in [0,1,2,3]:
phase_df = file_df[file_df['phase']==phase]
if phase == 3:
phase_df = phase_df[-n:]
result.append(phase_df)
df = pd.concat(result, axis=0)
return df
phase_3_last_n_observations(df, 2)
However, it is very slow and I have terabytes of data, so I need to worry about performance. Does anyone have any idea how to speed my solution up? Thanks!
Filter the rows where phase is 3 then groupby and use tail to select the last two rows per file_number, finally append to get the result
m = df['phase'].eq(3)
df[~m].append(df[m].groupby('file_number').tail(2)).sort_index()
p1 a1 phase file_number
0 0.223906 0.164288 0 1
1 0.214081 0.748598 0 1
2 0.567702 0.226143 0 1
3 0.695458 0.567288 0 1
4 0.760710 0.127880 1 1
5 0.592913 0.397473 1 1
6 0.721191 0.572320 1 1
7 0.047981 0.153484 1 1
8 0.598202 0.203754 2 1
9 0.296797 0.614071 2 1
10 0.961616 0.105837 2 1
11 0.237614 0.640263 2 1
14 0.500415 0.220355 3 1
15 0.968630 0.351404 3 1
16 0.065283 0.595144 0 2
17 0.308802 0.164214 0 2
18 0.668811 0.826478 0 2
19 0.888497 0.186267 0 2
20 0.199129 0.241900 1 2
21 0.345185 0.220940 1 2
22 0.389895 0.761068 1 2
23 0.343100 0.582458 1 2
24 0.182792 0.245551 2 2
25 0.503181 0.894517 2 2
26 0.144294 0.351350 2 2
27 0.157116 0.847499 2 2
30 0.194274 0.143037 3 2
31 0.542183 0.060485 3 2
I use idea from deleted answer - get indices by previous rows for rows matching 3 by GroupBy.cumcount and remove them by DataFrame.drop:
def phase_3_last_n_observations(df, N):
df1 = df[df['phase'].eq(3)]
idx = df1[df1.groupby('file_number').cumcount(ascending=False).ge(N)].index
return df.drop(idx)
#index is reseted for default, because used for remove rows
df = phase_3_last_n_observations(df.reset_index(drop=True), 2)
As an alternative solution to what already exists: You can calculate the last elements for all phase groups and afterwards just use .loc to get the needed group result. I have written the code for N==2, if you want for N==3, then use [-1, -2, -3]
result = df.groupby(['phase']).nth([-1, -2])
PHASE = 3
result.loc[PHASE]

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1

Encode pandas column as categorical values

I have a dataframe as follow:
d = {'item': [1, 2,3,4,5,6], 'time': [1297468800, 1297468809, 12974688010, 1297468890, 1297468820,1297468805]}
df = pd.DataFrame(data=d)
the output of df is as follow:
item time
0 1 1297468800
1 2 1297468809
2 3 1297468801
3 4 1297468890
4 5 1297468820
5 6 1297468805
the time here is based on the unixsystem time. My goal is to replace the time column in the dataframe.
such as the
mintime = 1297468800
maxtime = 1297468890
And I want to split the time into 10 (can be changed by using parameter like 20 intervals) interval, and recode the time column in df. Such as
item time
0 1 1
1 2 1
2 3 1
3 4 9
4 5 3
5 6 1
what is the most efficient way to do this since I have billion of records? Thanks
You can use pd.cut with np.linspace to specify the bins. This encodes your column categorically, from which you can then extract the codes in order:
bins = np.linspace(df.time.min() - 1, df.time.max(), 10)
df['time'] = pd.cut(df.time, bins=bins, right=True).cat.codes + 1
df
item time
0 1 1
1 2 1
2 3 1
3 4 9
4 5 3
5 6 1
Alternatively, depending on how you treat the interval edges, you could also do
bins = np.linspace(df.time.min(), df.time.max() + 1, 10)
pd.cut(df.time, bins=bins, right=False).cat.codes + 1
0 1
1 1
2 1
3 9
4 2
5 1
dtype: int8

calculating differences within groups

I have a DataFrame whose rows provide a value of one feature at one time. Times are identified by the time column (there's about 1000000 distinct times). Features are identified by the feature column (there's a few dozen features). There's at most one row for any combination of feature and time. At each time, only some of the features are available; the only exception is feature 0 which is available at all times. I'd like to add to that DataFrame a column that shows the value of the feature 0 at that time. Is there a reasonably fast way to do it?
For example, let's say I have
df = pd.DataFrame({
'time': [1,1,2,2,2,3,3],
'feature': [1,0,0,2,4,3,0],
'value':[1,2,3,4,5,6,7],
})
I want to add a column that contains [2,2,3,3,3,7,7].
I tried to use groupby and boolean indexing but no luck.
I'd like to add to that DataFrame a column that shows the value of the feature 0 at that time. Is there a reasonably fast way to do it?
I think that a groupby (which is quite an expensive operation) is an overkill for this. Try a merge with the values only of the 0 feature:
>>> pd.merge(
df,
df[df.feature == 0].drop('feature', axis=1).rename(columns={'value': 'value_0'}))
feature time value value_0
0 1 1 1 2
1 0 1 2 2
2 0 2 3 3
3 2 2 4 3
4 4 2 5 3
5 3 3 6 7
6 0 3 7 7
Edit
Per #jezrael's request, here is a timing test:
import pandas as pd
m = 10000
df = pd.DataFrame({
'time': range(m / 2) + range(m / 2),
'feature': range(m / 2) + [0] * (m / 2),
'value': range(m),
})
On this input, #jezrael's solution takes 396 ms, whereas mine takes 4.03 ms.
If you'd like to drop the zero rows and add them as a separate column (slightly different than your original request), you could do the following:
# Create initial dataframe.
df = pd.DataFrame({
'time': [1,1,2,2,2,3,3],
'feature': [1,0,0,2,4,3,0],
'value':[1,2,3,4,5,6,7],
})
# Set the index to 'time'
df = df.set_index('time')
# Join the zero feature value to the non-zero feature rows.
>>> df.loc[df.feature > 0, :].join(df.loc[df.feature == 0, 'value'], rsuffix='_feature_0')
feature value value_feature_0
time
1 1 1 2
2 2 4 3
2 4 5 3
3 3 6 7
You can set_index from column value and then groupby with transform idxmin.
This solution works, if the value 0 in column feature is min.
df = df.set_index('value')
df['diff'] = df.groupby('time')['feature'].transform('idxmin')
print df.reset_index()
value feature time diff
0 1 1 1 2
1 2 0 1 2
2 3 0 2 3
3 4 2 2 3
4 5 4 2 3
5 6 3 3 7
6 7 0 3 7

Categories