Pandas DataFrame find closest index in previous rows where condition is met - python

I have the following df1 dataframe:
t A
0 23:00 2
1 23:01 1
2 23:02 2
3 23:03 2
4 23:04 6
5 23:05 5
6 23:06 4
7 23:07 9
8 23:08 7
9 23:09 10
10 23:10 8
For each t (increments simplified here, not uniformly distributed in real life), I would like to find, if any, the most recent time tr within the previous 5 min where A(t)- A(tr) >= 4. I want to get:
t A tr
0 23:00 2
1 23:01 1
2 23:02 2
3 23:03 2
4 23:04 6 23:03
5 23:05 5 23:01
6 23:06 4
7 23:07 9 23:06
8 23:08 7
9 23:09 10 23:06
10 23:10 8 23:06
Currently, I can use shift(-1) to compare each row to the previous row like cond = df1['A'] >= df1['A'].shift(-1) + 4.
How can I look further in time?

Assuming your data is continuous by the minute, then you can do usual shift:
df1['t'] = pd.to_timedelta(df1['t'].add(':00'))
df = pd.DataFrame({i:df1.A - df1.A.shift(i) >= 4 for i in range(1,5)})
df1['t'] - pd.to_timedelta('1min') * df.idxmax(axis=1).where(df.any(1))
Output:
0 NaT
1 NaT
2 NaT
3 NaT
4 23:03:00
5 23:01:00
6 NaT
7 23:06:00
8 NaT
9 23:06:00
10 23:06:00
dtype: timedelta64[ns]

I added a datetime index and used rolling(), which now includes time-window functionalities beyond simple index-window.
import pandas as pd
import numpy as np
import datetime
df1 = pd.DataFrame({'t' : [
datetime.datetime(2020, 5, 17, 23, 0, 0),
datetime.datetime(2020, 5, 17, 23, 0, 1),
datetime.datetime(2020, 5, 17, 23, 0, 2),
datetime.datetime(2020, 5, 17, 23, 0, 3),
datetime.datetime(2020, 5, 17, 23, 0, 4),
datetime.datetime(2020, 5, 17, 23, 0, 5),
datetime.datetime(2020, 5, 17, 23, 0, 6),
datetime.datetime(2020, 5, 17, 23, 0, 7),
datetime.datetime(2020, 5, 17, 23, 0, 8),
datetime.datetime(2020, 5, 17, 23, 0, 9),
datetime.datetime(2020, 5, 17, 23, 0, 10)
], 'A' : [2,1,2,2,6,5,4,9,7,10,8]}, columns=['t', 'A'])
df1.index = df1['t']
df2 = df1
cond = df1['A'] >= df1.rolling('5s')['A'].apply(lambda x: x[0] + 4)
result = df1[cond]
Gives
t A
2020-05-17 23:00:04 6
2020-05-17 23:00:05 5
2020-05-17 23:00:07 9
2020-05-17 23:00:09 10
2020-05-17 23:00:10 8

Related

Get 25 quantile in cumsum pandas

Suppose I have the following DataFrame:
df = pd.DataFrame({'id': [2, 4, 10, 12, 13, 14, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31, 42, 50, 54],
'value': [37410.0, 18400.0, 200000.0, 392000.0, 108000.0, 423000.0, 80000.0, 307950.0,
50807.0, 201740.0, 182700.0, 131300.0, 282005.0, 428800.0, 56000.0, 412400.0, 1091595.0, 1237200.0,
927500.0]})
And I do the following:
df.sort_values(by='id').set_index('id').cumsum()
value
id
2 37410.0
4 55810.0
10 255810.0
12 647810.0
13 755810.0
14 1178810.0
19 1258810.0
20 1566760.0
21 1617567.0
22 1819307.0
24 2002007.0
25 2133307.0
27 2415312.0
29 2844112.0
30 2900112.0
31 3312512.0
42 4404107.0
50 5641307.0
54 6568807.0
I want to know the first element of id that is bigger than 25% of the cumulative sum. In this example, 25% of the cumsum would be 1,642,201.75. The first element to exceed that would be 22. I know it can be done with a for, but I think it would be pretty inefficient.
You could do:
percentile_25 = df['value'].sum() * 0.25
res = df[df['value'].cumsum() > percentile_25].head(1)
print(res)
Output
id value
9 22 201740.0
Or use searchsorted to do the search in O(log N):
percentile_25 = df['value'].sum() * 0.25
i = df['value'].cumsum().searchsorted(percentile_25)
res = df.iloc[i]
print(res)
Output
id 22.0
value 201740.0
Name: 9, dtype: float64

Can you explain the output: diff.sort_values(ascending=False).index.astype

Can anyone explain the following statement.
list(diff.sort_values(ascending=False).index.astype(int)[0:5])
Output: Int64Index([24, 26, 17, 2, 1], dtype='int64')
It sorts first, but what is the index doing and how do i get 24, 26, 17, 2 ,1 ??
diff is series
ipdb> diff
1 0.017647
2 0.311765
3 -0.060000
4 -0.120000
5 -0.040000
6 -0.120000
7 -0.190000
8 -0.200000
9 -0.100000
10 -0.011176
11 -0.130000
12 0.008824
13 -0.060000
14 -0.090000
15 -0.060000
16 0.008824
17 0.341765
18 -0.140000
19 -0.050000
20 -0.060000
21 -0.040000
22 -0.210000
23 0.008824
24 0.585882
25 -0.060000
26 0.555882
27 -0.031176
28 -0.060000
29 -0.170000
30 -0.220000
31 -0.170000
32 -0.040000
dtype: float64
Yout code return list of index values of top5 values of Series sorted in descending order.
First 'column' printed in pandas Series is called index, so your code after sorting convert values of index to integers and slice by indexing.
print (diff.sort_values(ascending=False))
24 0.585882
26 0.555882
17 0.341765
2 0.311765
1 0.017647
12 0.008824
23 0.008824
16 0.008824
10 -0.011176
27 -0.031176
32 -0.040000
21 -0.040000
5 -0.040000
19 -0.050000
15 -0.060000
3 -0.060000
13 -0.060000
25 -0.060000
28 -0.060000
20 -0.060000
14 -0.090000
9 -0.100000
6 -0.120000
4 -0.120000
11 -0.130000
18 -0.140000
31 -0.170000
29 -0.170000
7 -0.190000
8 -0.200000
22 -0.210000
30 -0.220000
Name: a, dtype: float64
print (diff.sort_values(ascending=False).index.astype(int))
Int64Index([24, 26, 17, 2, 1, 12, 23, 16, 10, 27, 32, 21, 5, 19, 15, 3, 13,
25, 28, 20, 14, 9, 6, 4, 11, 18, 31, 29, 7, 8, 22, 30],
dtype='int64')
print (diff.sort_values(ascending=False).index.astype(int)[0:5])
Int64Index([24, 26, 17, 2, 1], dtype='int64')
print (list(diff.sort_values(ascending=False).index.astype(int)[0:5]))
[24, 26, 17, 2, 1]
Here's what's happening:
diff.sort_values(ascending) - sorts a Series. By default, ascending is True, but you've kept it false, so it returns sorted Series in descending order.
pandas.Series.index returns a row-labels of the index (the sorted numbers 1 - 32 in your case)
.as_type(int) typecasts index row-labels as integers.
[0: 5] just picks the cells 0 through 5
Let me know if this helps!

Find the last value in a list in a dataframe

please, i want to find the last value of in client in a dataframe, how can i do it?
Example :
df = pd.DataFrame({'date':
['2018-06-13', '2018-06-14', '2018-06-15', '2018-06-16'],
'gain': [[10, 12, 15],[14, 11, 15],[9, 10, 12], [6, 4, 2]],
'how': [['customer1', 'customer2', 'customer3'],
['customer4','customer5','customer6' ],
['customer7', 'customer8', 'customer9'],
['customer5', 'customer6', 'customer10'] ]}
df :
date gain how
0 2018-06-13 [10, 12, 15] [customer1, customer2, customer3]
1 2018-06-14 [14, 11, 15] [customer4, customer5, customer6]
2 2018-06-15 [9, 10, 12] [customer7, customer8, customer9]
3 2018-06-16 [6, 4, 2] [customer5, customer6, customer10]
I want to do a function that returns the last gain in the dataframe.
example :
for the customer5 = 6
for the customer4 = 14
for the customer20 = 'not found'
thank you so much
Using unnesting function then , drop_duplicates
newdf=unnesting(df,['gain','how']).drop_duplicates('how',keep='last')
newdf
Out[25]:
gain how date
0 10 customer1 2018-06-13
0 12 customer2 2018-06-13
0 15 customer3 2018-06-13
1 14 customer4 2018-06-14
2 9 customer7 2018-06-15
2 10 customer8 2018-06-15
2 12 customer9 2018-06-15
3 6 customer5 2018-06-16
3 4 customer6 2018-06-16
3 2 customer10 2018-06-16
Then input your search list with reindex
l=['customer5','customer6','customer20']
newdf.loc[newdf.how.isin(l)].set_index('how').reindex(l,fill_value='not_find')
Out[34]:
gain date
how
customer5 6 2018-06-16
customer6 4 2018-06-16
customer20 not_find not_find
Interesting reading about the solution of this type question
How do I unnest a column in a pandas DataFrame?
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

Down-sampling specific period on dataframe using Pandas

I have a long time serie that starts in 1963 and ends in 2013. However, from 1963 til 2007 it has an hourly sampling period while after 2007's sampling rate changes to 5 minutes. Is it possible to resample data just after 2007 in a way that the entire time serie has hourly data sampling? Data slice below.
yr, m, d, h, m, s, sl
2007, 11, 30, 19, 0, 0, 2180
2007, 11, 30, 20, 0, 0, 2310
2007, 11, 30, 21, 0, 0, 2400
2007, 11, 30, 22, 0, 0, 2400
2007, 11, 30, 23, 0, 0, 2270
2008, 1, 1, 0, 0, 0, 2210
2008, 1, 1, 0, 5, 0, 2210
2008, 1, 1, 0, 10, 0, 2210
2008, 1, 1, 0, 15, 0, 2200
2008, 1, 1, 0, 20, 0, 2200
2008, 1, 1, 0, 25, 0, 2200
2008, 1, 1, 0, 30, 0, 2200
2008, 1, 1, 0, 35, 0, 2200
2008, 1, 1, 0, 40, 0, 2200
2008, 1, 1, 0, 45, 0, 2200
2008, 1, 1, 0, 50, 0, 2200
2008, 1, 1, 0, 55, 0, 2200
2008, 1, 1, 1, 0, 0, 2190
2008, 1, 1, 1, 5, 0, 2190
Thanks!
Give your dataframe proper column names
df.columns = 'year month day hour minute second sl'.split()
Solution
df.groupby(['year', 'month', 'day', 'hour'], as_index=False).first()
year month day hour minute second sl
0 2007 11 30 19 0 0 2180
1 2007 11 30 20 0 0 2310
2 2007 11 30 21 0 0 2400
3 2007 11 30 22 0 0 2400
4 2007 11 30 23 0 0 2270
5 2008 1 1 0 0 0 2210
6 2008 1 1 1 0 0 2190
Option 2
Here is an option that builds off of the column renaming. We'll use pd.to_datetime to cleverly get at our dates, then use resample. However, you have time gaps and will have to address nulls and re-cast dtypes.
df.set_index(
pd.to_datetime(df.drop('sl', 1))
).resample('H').first().dropna().astype(df.dtypes)
year month day hour minute second sl
2007-11-30 19:00:00 2007 11 30 19 0 0 2180
2007-11-30 20:00:00 2007 11 30 20 0 0 2310
2007-11-30 21:00:00 2007 11 30 21 0 0 2400
2007-11-30 22:00:00 2007 11 30 22 0 0 2400
2007-11-30 23:00:00 2007 11 30 23 0 0 2270
2008-01-01 00:00:00 2008 1 1 0 0 0 2210
2008-01-01 01:00:00 2008 1 1 1 0 0 2190
Rename the minute column for convenience:
df.columns = ['yr', 'm', 'd', 'h', 'M', 's', 'sl']
Create a datetime column:
from datetime import datetime as dt
df['dt'] = df.apply(axis=1, func=lambda x: dt(x.yr, x.m, x.d, x.h, x.M, x.s))
Resample:
For pandas < 0.19:
df = df.set_index('dt').resample('60T').reset_index('dt')
For pandas >= 0.19:
df = df.resample('60T', on='dt')
You'd better first append a datetime column to your dataframe:
df['datetime'] = pd.to_datetime(df[['yr', 'mnth', 'd', 'h', 'm', 's']])
But before that you should rename the month column:
df.rename(columns={ df.columns[1]: "mnth" })
Then you set a datetime column as dataframe index:
data.set_index('datetime', inplace=True)
Now you can apply resample method on your dataframe on by prefereed sampling rate:
df.resample('60T', on='datatime').mean()
Here I used mean to aggregate. You can use other method based on your need.
See Pandas document as a ref.

Pandas dot product ValueError

I am trying to calculate the dot product of a data frame and a series, but I am getting ValueError: matrices are not aligned and I do not really understand why. I get
if (len(common) > len(self.columns) or len(common) > len(other.index)):
raise ValueError('matrices are not aligned')
with the error message, which I totally understand. But when I check my series, it has 25 values:
weights
Out[193]:
0 0.000002
1 0.000577
2 0.002480
3 0.004720
4 0.003640
5 0.001480
6 0.000054
7 0.000022
8 0.009060
9 0.000511
10 0.034900
11 0.140000
12 0.065600
13 0.325000
14 0.072900
15 0.031100
16 0.209000
17 0.003280
18 0.001390
19 0.002100
20 0.000847
21 0.009560
22 0.006320
23 0.014000
24 0.061900
Name: 3, dtype: float64
And when I check my data frame, it also has 25 columns:
In [195]: data
Out[195]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 0 to 130
Data columns (total 25 columns):
(etc)
So I don't understand why I get the error message. What am I missing here?
Some additional information:
I am using weightedave=data.dot(weights)
And I just figured out in the dot code that it does common = data.columns.union(weights.index) to get the common referred to in the error message. So I tested that, but in my case that becomes
In[220]: common
Out[220]: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, u'100_AVET', u'101_AVET', u'102_AVET', u'13_AVET', u'14_AVET', u'15_AVET', u'18_AVET', u'19_AVET', u'20_AVET', u'22_AVET', u'36_AVET', u'62_AVET', u'74_AVET', u'78_AVET', u'79_AVET', u'80_AVET', u'83_AVET', u'85_AVET', u'86_AVET', u'88_AVET', u'94_AVET', u'95_AVET', u'96_AVET', u'97_AVET', u'99_AVET'], dtype=object)
Which indeed is longer (50) than my number of columns/indices (25). Should I rename either my series or the columns in my data frame?

Categories