Pandas groupby: 3 max per period among multiple columns - python

I have these data:
val1 val2 val3
dt
2017-12-15 00:00:00 81 90 79
2017-12-15 00:01:00 67 85 80
2017-12-15 00:02:00 4 41 37
2017-12-15 00:03:00 61 68 29
2017-12-15 00:04:00 49 6 56
2017-12-15 00:05:00 94 13 93
2017-12-15 00:06:00 91 3 75
2017-12-15 00:07:00 94 81 7
2017-12-15 00:08:00 55 59 33
2017-12-15 00:09:00 97 89 26
2017-12-15 00:10:00 17 75 88
2017-12-15 00:11:00 39 40 96
2017-12-15 00:12:00 61 20 70
2017-12-15 00:13:00 62 31 93
2017-12-15 00:14:00 7 26 29
I would like to find the 3 max values for each 5-minute period.
The max values can be in any column (val1, val2, val3) and must be searched among the 15 values available for the 5 minutes.
At the moment I can only find the largest in a single column.
Is it possible to search for nlargest in multiple columns?
This is the code to generate the data and to search for the max for val1:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_ref = datetime(2017, 12, 15, 0,0,0)
dtime = pd.date_range(date_ref, freq='1min', periods=15)
np.random.seed(seed=1115)
data1 = np.random.randint(1, high=100, size=len(dtime))
data2 = np.random.randint(1, high=100, size=len(dtime))
data3 = np.random.randint(1, high=100, size=len(dtime))
df = pd.DataFrame({'dt': dtime, 'val1': data1, 'val2': data2, 'val3': data3})
df.set_index('dt', inplace=True)
print(df)
group = df.groupby(pd.Grouper(freq='5min'))
max_only_for_val1 = (pd.DataFrame(
group["val1"]
.nlargest(3))
.reset_index(level=1, drop=True)
)
print(max_only_for_val1)
This is the output:
val1
dt
2017-12-15 00:00:00 81
2017-12-15 00:00:00 67
2017-12-15 00:00:00 61
2017-12-15 00:05:00 97
2017-12-15 00:05:00 94
2017-12-15 00:05:00 94
2017-12-15 00:10:00 62
2017-12-15 00:10:00 61
2017-12-15 00:10:00 39

Since it doesn't matter where your values come from, let's reshape your data a bit.
df = df.reset_index().melt('dt').drop('variable', 1)
df.head(10)
dt value
0 2017-12-15 00:00:00 81
1 2017-12-15 00:01:00 67
2 2017-12-15 00:02:00 4
3 2017-12-15 00:03:00 61
4 2017-12-15 00:04:00 49
5 2017-12-15 00:05:00 94
6 2017-12-15 00:06:00 91
7 2017-12-15 00:07:00 94
8 2017-12-15 00:08:00 55
9 2017-12-15 00:09:00 97
Now, call groupby + apply -
def get_max3(x):
return x.sort_values(ascending=False).head(3)
df = df.groupby(pd.Grouper(key='dt', freq='5min'))['value']\
.apply(get_max3)\
.reset_index(0)\
.reset_index(drop=True)
dt value
0 2017-12-15 00:00:00 90
1 2017-12-15 00:00:00 85
2 2017-12-15 00:00:00 81
3 2017-12-15 00:05:00 97
4 2017-12-15 00:05:00 94
5 2017-12-15 00:05:00 94
6 2017-12-15 00:10:00 96
7 2017-12-15 00:10:00 93
8 2017-12-15 00:10:00 88
An alternative definition for get_max3 using numpy.sort -
def get_max3(x):
return np.sort(x.values)[-4::-1]

Related

Creating Time slots for a list of timestamps using Pandas

I have a column of Pandas Datetime64 type elements
df['time']
0 2019-10-04 12:03:53+00:00
1 2019-10-04 11:21:23+00:00
2 2019-10-04 12:23:11+00:00
3 2019-10-04 18:04:52+00:00
4 2019-10-04 12:22:21+00:00
...
2889974 2019-10-11 10:53:19+00:00
2889975 2019-10-11 10:58:38+00:00
2889976 2019-10-10 10:36:47+00:00
2889977 2019-10-10 10:36:47+00:00
2889978 2019-07-08 04:36:45+00:00
Name: time, Length: 2889979, dtype: datetime64[ns, UTC]
AND a column of the corresponding timestamps called df['time_full'] like so;
df['time_full']
0 12:03:53
1 11:21:23
2 12:23:11
3 18:04:52
4 12:22:21
...
2889974 10:53:19
2889975 10:58:38
2889976 10:36:47
2889977 10:36:47
2889978 04:36:45
Name: time_full, Length: 2889979, dtype: object
I want to create slots of 30 minutes throughout the day (basically 48 slots) and assign a slot to all of the values in the df['time'] column. Basically, create a bunch of categorical variables a time stamp. Something like this (just an example):
df['time'] df['slot']
0 2019-10-04 12:03:53+00:00 4
1 2019-10-04 11:21:23+00:00 2
2 2019-10-04 12:23:11+00:00 32
3 2019-10-04 18:04:52+00:00 40
4 2019-10-04 12:22:21+00:00 5
I tried binning the slots using Pandas' pd.cut() method like here, and ended up doing this :
pd.cut(df['time'].astype(np.int64)//10**9,
bins=pd.date_range("00:00", "23:59", freq="30min"))
BUT got an output that looked like :
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
2889974 NaN
2889975 NaN
2889976 NaN
2889977 NaN
2889978 NaN
Name: time, Length: 2889979, dtype: category
Categories (47, interval[int64]): [(1575331200000000000, 1575333000000000000] < (1575333000000000000, 1575334800000000000] < (1575334800000000000, 1575336600000000000] < (1575336600000000000, 1575338400000000000] ... (1575408600000000000, 1575410400000000000] < (1575410400000000000, 1575412200000000000] < (1575412200000000000, 1575414000000000000] < (1575414000000000000, 1575415800000000000]]
I also tried using df['time_full'] as bins, but it threw an error since it is a list of strings. I think the issue is that df['time'] is not meant for binning when it has both date and time, but I'm not really sure. Any help would be appreciated.
If you want the slots to range from 0 to 47 you can use this:
df['slots'] = df['time'].apply(lambda x: x.hour*2 if x.minute <= 29 else x.hour*2+1)
df
time slots
0 2019-10-04 12:03:53+00:00 24
1 2019-10-04 11:21:23+00:00 22
2 2019-10-04 12:23:11+00:00 24
3 2019-10-04 18:04:52+00:00 36
4 2019-10-04 12:22:21+00:00 24
2889974 2019-10-11 10:53:19+00:00 21
2889975 2019-10-11 10:58:38+00:00 21
2889976 2019-10-10 10:36:47+00:00 21
2889977 2019-10-10 10:36:47+00:00 21
2889978 2019-07-08 04:36:45+00:00 9
Further testing:
date slots
0 2019-10-04 00:00:00 0
1 2019-10-04 00:30:00 1
2 2019-10-04 01:00:00 2
3 2019-10-04 01:30:00 3
4 2019-10-04 02:00:00 4
5 2019-10-04 02:30:00 5
6 2019-10-04 03:00:00 6
7 2019-10-04 03:30:00 7
8 2019-10-04 04:00:00 8
9 2019-10-04 04:30:00 9
10 2019-10-04 05:00:00 10
11 2019-10-04 05:30:00 11
12 2019-10-04 06:00:00 12
13 2019-10-04 06:30:00 13
14 2019-10-04 07:00:00 14
15 2019-10-04 07:30:00 15
16 2019-10-04 08:00:00 16
17 2019-10-04 08:30:00 17
18 2019-10-04 09:00:00 18
19 2019-10-04 09:30:00 19
20 2019-10-04 10:00:00 20
21 2019-10-04 10:30:00 21
22 2019-10-04 11:00:00 22
23 2019-10-04 11:30:00 23
24 2019-10-04 12:00:00 24
25 2019-10-04 12:30:00 25
26 2019-10-04 13:00:00 26
27 2019-10-04 13:30:00 27
28 2019-10-04 14:00:00 28
29 2019-10-04 14:30:00 29
30 2019-10-04 15:00:00 30
31 2019-10-04 15:30:00 31
32 2019-10-04 16:00:00 32
33 2019-10-04 16:30:00 33
34 2019-10-04 17:00:00 34
35 2019-10-04 17:30:00 35
36 2019-10-04 18:00:00 36
37 2019-10-04 18:30:00 37
38 2019-10-04 19:00:00 38
39 2019-10-04 19:30:00 39
40 2019-10-04 20:00:00 40
41 2019-10-04 20:30:00 41
42 2019-10-04 21:00:00 42
43 2019-10-04 21:30:00 43
44 2019-10-04 22:00:00 44
45 2019-10-04 22:30:00 45
46 2019-10-04 23:00:00 46
47 2019-10-04 23:30:00 47
If you want to the slot range from 1 to 48:
df['slots'] = df['time'].apply(lambda x: x.hour*2+1 if x.minute <= 29 else x.hour*2+2)
Depends how you want to have the values.
Check How to convert DatetimeIndexResampler to DataFrame?
df = pd.DataFrame(pd.date_range('2000-01-02', freq='15min', periods=15), columns=['time'])
df.set_index(df['time'], inplace=True)
df=df.resample('30min').interpolate()
df

Print rows from output1 based on output2 values

df1
slot Time Location User
56 2017-10-26 22:15:00 89 1
2 2017-10-27 00:30:00 54 1
20 2017-10-28 05:00:00 64 1
24 2017-10-29 06:00:00 2 1
91 2017-11-01 22:45:00 78 1
62 2017-11-02 15:30:00 99 1
91 2017-11-02 22:45:00 34 1
47 2017-10-26 20:15:00 465 2
1 2017-10-27 00:10:00 67 2
20 2017-10-28 05:00:00 5746 2
28 2017-10-29 07:00:00 36 2
91 2017-11-01 22:45:00 786 2
58 2017-11-02 14:30:00 477 2
95 2017-11-02 23:45:00 7322 2
df2
slot
2
91
62
58
I need the output df3 as
slot Time Location User
2 2017-10-27 00:30:00 54 1
91 2017-11-01 22:45:00 78 1
91 2017-11-02 22:45:00 34 1
91 2017-11-01 22:45:00 786 2
62 2017-11-02 15:30:00 99 1
58 2017-11-02 14:30:00 477 2
if those are csv file then we can join them
join File1 file2 > file3
But how can we do the same for the outputs in Jupyter notebook
Try isin:
df1[df1.slot.isin(df2.slot)]
Output:
slot Time Location User
1 2 2017-10-27 00:30:00 54 1
4 91 2017-11-01 22:45:00 78 1
5 62 2017-11-02 15:30:00 99 1
6 91 2017-11-02 22:45:00 34 1
11 91 2017-11-01 22:45:00 786 2
12 58 2017-11-02 14:30:00 477 2

how to fill missing time slots in python?

I'm trying to fill the missing slots in the CSV file which has date and time as a string.
My input from a csv file is:
A B C
56 2017-10-26 22:15:00 89
2 2017-10-27 00:30:00 54
20 2017-10-28 05:00:00 64
24 2017-10-29 06:00:00 2
91 2017-11-01 22:45:00 78
62 2017-11-02 15:30:00 99
91 2017-11-02 22:45:00 34
Output should be
A B C
0 2017-10-26 00:00:00 89
1 2017-10-26 00:15:00 89
.
.
.
.
.
56 2017-10-26 22:15:00 89
..
.
.
.
.
96 2017-10-26 23:45:00 89
0 2017-10-27 00:00:00 54
1 2017-10-27 00:15:00 54
2 2017-10-27 00:30:00 54
.
.
.
20 2017-10-28 05:00:00 64
21 2017-10-28 05:15:00 64
.
.
.
.
24 2017-10-29 06:00:00 2
.
91 2017-11-01 22:45:00 78
.
62 2017-11-02 15:30:00 99
.
91 2017-11-02 22:45:00 34
The output range is 15 min time slots for days between 2017-10-26 -> 2017-11-02 and each day have 96 slots.
And the same as above.
Using resample to get 15-min intervalsand bfill to fill missing values in B:
df = df.set_index(pd.to_datetime(df.pop('B')))
df.loc[df.index.min().normalize()] = None
df = df.resample('15min').max().bfill()
df['A'] = 4*df.index.hour + df.index.minute//15
print(df)
Output:
A C
B
2017-10-26 00:00:00 0 89.0
2017-10-26 00:15:00 1 89.0
2017-10-26 00:30:00 2 89.0
... .. ...
2017-11-02 22:15:00 89 34.0
2017-11-02 22:30:00 90 34.0
2017-11-02 22:45:00 91 34.0
You need to resample your data and to fill missing values by propagating the last known value for each date. Pandas could be helpful to do that. Assuming you loaded your csv in pandas (with pandas.read_csv), and you obtained a dataframe (let's call it df) where the date column is your index (df.set_index('B')), then:
df.resample(rule='15M').ffill()
The rule parameter defines the new frequency, and the call to .ffill() means "forward fill", i.e., replace missing data with previous ones.

Apply different variables across different date range in Pandas

I have a dataframe with dates and values from column A to H. Also, I have some fixed variables X1=5, X2=6, Y1=7,Y2=8, Z1=9
Date A B C D E F G H
0 2018-01-02 00:00:00 7161 7205 -44 54920 73 7 5 47073
1 2018-01-03 00:00:00 7101 7147 -46 54710 73 6 5 46570
2 2018-01-04 00:00:00 7146 7189 -43 54730 70 7 5 46933
3 2018-01-05 00:00:00 7079 7121 -43 54720 70 6 5 46404
4 2018-01-08 00:00:00 7080 7125 -45 54280 70 6 5 46355
5 2018-01-09 00:00:00 7060 7102 -43 54440 70 6 5 46319
6 2018-01-10 00:00:00 7113 7153 -40 54510 70 7 5 46837
7 2018-01-11 00:00:00 7103 7141 -38 54690 70 7 5 46728
8 2018-01-12 00:00:00 7074 7110 -36 54310 65 6 5 46357
9 2018-01-15 00:00:00 7181 7210 -29 54320 65 6 5 46792
10 2018-01-16 00:00:00 7036 7078 -42 54420 65 6 5 45709
11 2018-01-17 00:00:00 6994 7034 -40 53690 65 6 5 45416
12 2018-01-18 00:00:00 7032 7076 -44 53590 65 6 5 45705
13 2018-01-19 00:00:00 6999 7041 -42 53560 65 6 5 45331
14 2018-01-22 00:00:00 7025 7068 -43 53500 65 6 5 45455
15 2018-01-23 00:00:00 6883 6923 -41 53490 65 6 5 44470
16 2018-01-24 00:00:00 7111 7150 -39 52630 65 6 5 45866
17 2018-01-25 00:00:00 7101 7138 -37 53470 65 6 5 45663
18 2018-01-26 00:00:00 7043 7085 -43 53380 65 6 5 45087
19 2018-01-29 00:00:00 7041 7085 -44 53370 65 6 5 44958
20 2018-01-30 00:00:00 7010 7050 -41 53040 65 6 5 44790
21 2018-01-31 00:00:00 7079 7118 -39 52880 65 6 5 45248
What I wanted to do is adding some column-wise simple calculations to this dataframe using values in column A to H as well as those fixed variables.
The tricky part is that I need to apply different variables to different date ranges.
For example, during 2018-01-01 to 2018-01-10, I wanted to calculate a new column I where the value equals to: (A+B+C)*X1*Y1+Z1;
While during 2018-01-11 to 2018-01-25, the calculation needs to take (A+B+C)*X2*Y1+Z1. Similar to Y1 and Y2 applied to each of their date ranges.
I know this can calculate/create a new column I.
df[I]=(df[A]+df[B]+df[C])*X1*Y1+Z1
but not sure how to be able to have that flexibility to use different variables to different date ranges.
You can use np.select to define a value based on a condition:
cond = [df.Date.between('2018-01-01','2018-01-10'), df.Date.between('2018-01-11','2018-01-25')]
values = [(df['A']+df['B']+df['C'])*X1*Y1+Z1, (df['A']+df['B']+df['C'])*X2*Y2+Z1]
# select values depending on the condition
df['I'] = np.select(cond, values)

How can I do computations on dataframes or series that have different indexes in PANDAS?

I have two Series that are of the same length and datatype. Both are float64. The only difference are the indexes both are dates but one date is at the beginnning of the month and the other is at the end of the month. How can I do computations like correlation or covariance on Series or dataframes that have different indexes?
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
import Quandl
IPO=Quandl.get("RITTER/US_IPO_STATS", authtoken="api key")
ir=Quandl.get("FRBC/REALRT", authtoken="api key")
ipo_splice=IPO[264:662]
new_ipo=ipo_splice['Gross Number of IPOs'];
new_ipo=new_ipo.T
ir_splice=ir[0:398]
new_ir=ir_splice['RR 1 Month']
new_ir=new_ir.T
new_ipo.corr(new_ir)
reset_index(drop=True) for the things you want to correlate then concat.
s1 = pd.DataFrame(np.random.rand(10), list('abcdefghij'), columns=['s1'])
s2 = pd.DataFrame(np.random.rand(10), list('ABCDEFGHIJ'), columns=['s2'])
print pd.concat([s.reset_index(drop=True) for s in [s1, s2]], axis=1).corr()
s1 s2
s1 1.000000 -0.437945
s2 -0.437945 1.000000
you can use resample() function in order to resample one of your indices (our goal is have either both indices BoM or EoM):
data:
In [63]: df_bom
Out[63]:
val
2015-01-01 76
2015-02-01 27
2015-03-01 65
2015-04-01 71
2015-05-01 9
2015-06-01 23
2015-07-01 52
2015-08-01 10
2015-09-01 62
2015-10-01 25
In [64]: df_eom
Out[64]:
val
2015-01-31 87
2015-02-28 16
2015-03-31 85
2015-04-30 4
2015-05-31 37
2015-06-30 63
2015-07-31 3
2015-08-31 73
2015-09-30 81
2015-10-31 69
Solution:
In [61]: df_eom.resample('MS') + df_bom
C:\envs\py35\Scripts\ipython:1: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
Out[61]:
val
2015-01-01 163
2015-02-01 43
2015-03-01 150
2015-04-01 75
2015-05-01 46
2015-06-01 86
2015-07-01 55
2015-08-01 83
2015-09-01 143
2015-10-01 94
In [62]: df_eom.resample('MS').join(df_bom, lsuffix='_lft')
C:\envs\py35\Scripts\ipython:1: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
Out[62]:
val_lft val
2015-01-01 87 76
2015-02-01 16 27
2015-03-01 85 65
2015-04-01 4 71
2015-05-01 37 9
2015-06-01 63 23
2015-07-01 3 52
2015-08-01 73 10
2015-09-01 81 62
2015-10-01 69 25
alternative approach - merging DF's by year and month parts:
In [69]: %paste
(pd.merge(df_bom, df_eom,
left_on=[df_bom.index.year, df_bom.index.month],
right_on=[df_eom.index.year, df_eom.index.month],
suffixes=('_bom','_eom')))
## -- End pasted text --
Out[69]:
key_0 key_1 val_bom val_eom
0 2015 1 76 87
1 2015 2 27 16
2 2015 3 65 85
3 2015 4 71 4
4 2015 5 9 37
5 2015 6 23 63
6 2015 7 52 3
7 2015 8 10 73
8 2015 9 62 81
9 2015 10 25 69
Setup:
In [59]: df_bom = pd.DataFrame({'val':np.random.randint(0,100, 10)}, index=pd.date_range('2015-01-01', periods=10, freq='MS'))
In [60]: df_eom = pd.DataFrame({'val':np.random.randint(0,100, 10)}, index=pd.date_range('2015-01-01', periods=10, freq='M'))

Categories