Conditional count per day from pandas dataframe - python

I have a dataset with a reading (Tank Level) every minute from a piece of equipment and want to create a new dataset (dataframe) with a count of the number of samples per day and the number of readings above a set value.
Noxious Tank Level.MIN Noxious Tank Level.MAX Date_Time
0 9.32 9.33 2019-12-31 05:01:00
1 9.32 9.34 2019-12-31 05:02:00
2 9.32 9.35 2019-12-31 05:03:00
3 9.31 9.35 2019-12-31 05:04:00
4 9.31 9.35 2019-12-31 05:05:00
... ... ... ...
528175 2.98 3.01 2020-12-31 23:56:00
528176 2.98 3.02 2020-12-31 23:57:00
528177 2.98 3.01 2020-12-31 23:58:00
528178 2.98 3.02 2020-12-31 23:59:00
528179 2.98 2.99 2021-01-01 00:00:00
Using a lamdba function I can see whether each value is an overflow (Tank Level > setpoint) - I have also indexed the dataframe by Date_Time:
df['Overflow'] = df.apply(lambda x: True if x['Noxious Tank Level.MIN'] > 89 else False , axis=1)
Noxious Tank Level.MIN Noxious Tank Level.MAX Overflow
Date_Time
2019-12-31 05:01:00 9.32 9.33 False
2019-12-31 05:02:00 9.32 9.34 False
2019-12-31 05:03:00 9.32 9.35 False
2019-12-31 05:04:00 9.31 9.35 False
2019-12-31 05:05:00 9.31 9.35 False
... ... ... ...
2020-12-31 23:56:00 2.98 3.01 False
2020-12-31 23:57:00 2.98 3.02 False
2020-12-31 23:58:00 2.98 3.01 False
2020-12-31 23:59:00 2.98 3.02 False
2021-01-01 00:00:00 2.98 2.99 False
Now I want to count the number of samples per day and the number of 'True' values in the Overflow column to work out what fraction per day is in Overflow
I get the feeling that resample or groupby will be the way to go but I can't figure out how to create a new dataset with just these counts and include the conditional count from the Overflow column

First use:
df['Overflow'] = df['Noxious Tank Level.MIN'] > 89
And then for count Trues use sum nad for count values use size per days/ dates:
df1 = df.resample('d')['Overflow'].agg(['sum','size'])
Or:
df1 = df.groupby(pd.Grouper(freq='D'))['Overflow'].agg(['sum','size'])
Or:
df2 = df.groupby(df.index.date)['Overflow'].agg(['sum','size'])

Related

Grouping data across midnight and performing an operation using pandas

I have the following data contained in a DataFrame which is part of a custom Class, and I want to compute stats on it for night-time periods.
LAeq,T LAFmax,T LA90,T
Start date & time
2021-08-18 22:00:00 71.5 90.4 49.5
2021-08-18 22:15:00 70.6 94.0 45.7
2021-08-18 22:30:00 69.3 82.2 48.3
2021-08-18 22:45:00 70.1 89.9 46.4
2021-08-18 23:00:00 68.9 82.4 46.0
... ... ...
2021-08-24 08:30:00 72.3 85.0 61.3
2021-08-24 08:45:00 72.9 84.6 62.2
2021-08-24 09:00:00 73.1 86.1 62.6
2021-08-24 09:15:00 72.8 86.4 61.6
2021-08-24 09:30:00 73.2 93.5 61.5
For example, I want to find the nth highest LAFmax, T for each given night-time period.
The night-time period typically spans 23:00 to 07:00, and I have managed to accomplish my goal using the resample() method as follows.
def compute_nth_lmax(self, n):
nth_lmax = self.df["LAFmax,T"].between_time(self._night_start, self._day_start,
include_start=True, include_end=False).resample(
rule=self._night_length, offset=pd.Timedelta(self._night_start)).apply(
lambda x: (np.sort(x))[-n] if x.size > 0 else np.nan).dropna()
return nth_lmax
The problem is that resample() assumes a regular resampling, and this works fine when the night-time period is 8 hours and therefore subdivides 24 equally (as in the default case of 23:00 to 07:00), but not for an irregular night-time period (say, if I extended it to 22:00 to 07:00).
I have tried to accomplish this using groupby(), but had no luck.
The only thing I can think of is adding another column to label each of the rows as "Night-time 1", "Night-time 2" etc., and grouping by these, but that feels rather messy.
I decided to go with what I consider a slightly inelegant approach and create a separate column which flags the night-time periods, before processing them. Still, I managed to achieve my goal in 2 lines of code.
self.df["Night-time indices"] = (self.df.index - pd.Timedelta(self._day_start)).date
nth_event = self.df.sort_values(by=[col], ascending=False).between_time(self._night_start, self._day_start)[
[col, period]].groupby(by=period).nth(n)
Out[43]:
Night-time indices
2021-08-18 100.0
2021-08-19 96.9
2021-08-20 97.7
2021-08-21 95.5
2021-08-22 101.7
2021-08-23 92.7
2021-08-24 85.8
Name: LAFmax,T, dtype: float64

Resampling by group ends up returning Type error

I have a dataframe like this :
Symbol Time Open High Low Close Volume LOD 10MA 20MA Sessions
0 AEHR 2021-08-20 09:31:00 5.52 5.52 5.52 5.52 3383 5.52 NaN NaN 1
1 AEHR 2021-08-20 09:32:00 5.57 5.57 5.57 5.57 1012 5.52 NaN NaN 1
2 AEHR 2021-08-20 09:35:00 5.56 5.56 5.56 5.56 4119 5.52 NaN NaN 1
60864 ZI 2021-09-10 15:58:00 63.07 63.12 63.07 63.10 16650 62.84 63.105 63.1420 165
60865 ZI 2021-09-10 15:59:00 63.09 63.12 63.06 63.11 25775 62.84 63.108 63.1270 165
60866 ZI 2021-09-10 16:00:00 63.11 63.17 63.11 63.17 28578 62.84 63.115 63.1200 165
I have already added a masking column called Sessions. I would like to resample current 1Min into 60Min for each unique Sessions
Don't care what happens to Volume LOD 10MA 20MA, will be dropping those. Also, this is the sample set, the final dataset will have 350million rows so efficiency is of concern.
df = df.groupby("Sessions").resample('60Min', label='right', on='Time')
Returns
AttributeError: 'DatetimeIndexResamplerGroupby' object has no attribute 'grouper'
My understanding is that .resample() returns this:
Time e.g 10:00:00 will include data from 9:30:00 - 10:00:00 (since market only opens at 9:30, all other time will be for the full hour). 11:00:00 will include data from 10:01:00 - 11:00:00 etc
Open should be the first value of the series for each Sessions
High should be the max() value for each Sessions
Low should be the min() value for each Sessions
Close should be the last value of the series for each Sessions

Adding a row into DataFrame with multiindex

I would like to create a new DataFrame and a bunch of stock data per each date.
Declaring a DataFrame with a multi-index - date and stock ticker.
Adding data for 2020-06-07
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
Adding data for 2020-06-08
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
2020-06-08 AAPL 32.50 34.20 31.1 32.30
2020-06-08 MSFT 58.50 59.20 52.1 53.30
What would be the best and most efficient solution?
Here's my current version that doesn't work as I expect.
df = pd.DataFrame()
for date in dates:
universe500 = get_universe(date) #returns stocks on a specific date
for security in universe500:
prices = data.get_prices(security, ['open','high','low','close'], 1, '1d') # returns pd.DataFrame
df.iloc[(date, security),:] = prices
If prices is a DataFrame formatted in the same manner as the original df, you can use concat:
In[0]:
#consttucting a fake entry
arrays = [['06-07-2020'], ['ABCD']]
multi = pd.MultiIndex.from_arrays(arrays, names=('date', 'stock'))
to_add = pd.DataFrame({'open':1, 'high':2, 'low':3, 'close':4},index=multi)
print(to_add)
Out[0]:
open high low close
date stock
2020-06-09 ABCD 1 2 3 4
In[1]:
#now adding to your data
df = pd.concat([df, to_add])
print(df)
Out[1]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
If the data (prices) were just an array of 4 numbers (the open, high, low, and close) values, then loc would work in the place you used iloc:
In[2]:
df.loc[('2020-06-10','WXYZ'),:] = [10,20,30,40]
Out[2]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
2020-06-10 WXYZ 10.0 20.0 30.0 40.0

30 Day distance between dates in datetime64[ns] column

I have data of the following form:
6460 2001-07-24 00:00:00 67.5 75.1 75.9 71.0 75.2 81.8
6490 2001-06-24 00:00:00 68.4 74.9 76.1 70.9 75.5 82.7
6520 2001-05-25 00:00:00 69.6 74.7 76.3 70.8 75.5 83.2
6550 2001-04-25 00:00:00 69.2 74.6 76.1 70.6 75.0 83.1
6580 2001-03-26 00:00:00 69.1 74.4 75.9 70.5 74.3 82.8
6610 2001-02-24 00:00:00 69.0 74.0 75.3 69.8 73.8 81.9
6640 2001-01-25 00:00:00 68.9 73.9 74.6 69.7 73.5 80.0
6670 2000-12-26 00:00:00 69.0 73.5 75.0 69.5 72.6 81.8
6700 2000-11-26 00:00:00 69.8 73.2 75.1 69.5 72.0 82.7
6730 2000-10-27 00:00:00 70.3 73.1 75.0 69.4 71.3 82.6
6760 2000-09-27 00:00:00 69.4 73.0 74.8 69.4 71.0 82.3
6790 2000-08-28 00:00:00 69.6 72.8 74.6 69.2 70.7 81.9
6820 2000-07-29 00:00:00 67.8 72.9 74.4 69.1 70.6 81.8
I want all the dates to have a 30 day difference between each other. I know how to add a specific day or month to a datetime object with something like
ndfd = ndf['Date'].astype('datetime64[ns]')
ndfd = ndfd.apply(lambda dt: dt.replace(day=15))
But this does not take into account the difference in days from month to month.
How can I ensure there is a consistent step in days from month to month in my data, given that I am able to change the day as long as it remains on the same month?
You could use date_range:
df['date'] = pd.date_range(start=df['date'][0], periods=len(df), freq='30D')
IIUC you could change your date column like this:
import datetime
a = df.iloc[0,0] # first date, assuming date col is first
df['date'] = [a + datetime.timedelta(days=30 * i) for i in range(len(df))]
I haven't tested this so not sure it work as smooth as I thought it will =).
You can transform your first day into ordinal, add 30*i to it and then transform it back.
first_day=df.iloc[0]['date_column'].toordinal()
df['date']=(first_day+30*i for i in range(len(df))).fromordinal

How to assign a values to dataframe's column by comparing values in another dataframe

I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)

Categories