Pandas (Python) Splitting hour and time interval on Index and Column - python

I have the temperatures being measuered and I want to create a heatmap from it. For this I first have to create a DataFrame where hour and 15 minute intervals are on the Index and the column.
The source data is like this:
date
temperature
2021-08-14 11:14:00
27.8
2021-08-14 11:15:00
27.9
2021-08-14 11:16:00
27.9
2021-08-14 11:17:00
27.9
2021-08-14 11:18:00
27.9
....
....
2021-08-14 11:31:00
28.10
2021-08-14 11:32:00
28.10
2021-08-14 11:33:00
28.10
2021-08-14 11:34:00
28.10
What I want to get is:
date
00
15
30
45
11:00
27.8
27.9
28.1
28.3
12:00
..
..
..
..
So I want the time intervals within the hour to be split on the columns and the index containing the specific hours (on which the columns occur).
Is there any way to do this action in Pandas in an easy way?
Thanks in advance!

Use resample and pivot_table to get expected outcome:
out = df.set_index('date').resample('15T').mean()
out = pd.pivot_table(out, index=out.index.strftime('%H:00'),
columns=out.index.strftime('%M'),
values='temperature')
out = out.rename_axis(index=None, columns=None)
>>> out
00 15 30
11:00 27.8 27.9 28.1

Let’s first separate hours and minutes (rounded to 15), put them back in the dataframe and use .pivot_table() to build your dataframe with interval means:
>>> h = df['date'].dt.strftime('%H:00').rename('hour')
>>> m = df['date'].dt.floor(freq='15T').dt.minute.rename('minutes')
>>> df.join([h, m]).pivot_table(index='hour', columns='minutes', values='temperature', aggfunc='mean')
minutes 0 15 30 45
hour
2021-08-14 11:00:00 28.709492 28.026066 27.991953 28.096947
2021-08-14 12:00:00 27.877558 28.022282 27.720347 28.201100
2021-08-14 13:00:00 27.739935 NaN NaN NaN

Related

Trying to organize a large dataset and then determine the mean days and standard deviation using Spyder (Python 3.9)

*Edit: Posted code as dataframe not sheets link.
I have a large data set consisting of ~6.5M rows and 6 columns. The rows are BrandId's (e.g., 01-00058) associated with unique items and the 3 columns I need utilized are: BrandId, InventoryDate, and OnHand.
BrandID SalesPrice InventoryDate Size OnHand PurchasePrice
0 01-00058 9.28 2018-06-30 750mL 6 6.77
1 01-00058 9.28 2018-07-01 750mL 6 6.77
2 01-00058 9.28 2018-07-02 750mL 6 6.77
3 01-00058 9.28 2018-07-03 750mL 102 6.77
4 01-00058 9.28 2018-07-04 750mL 96 6.77
... ... ... ... ... ...
6531265 02-90631 12.74 2019-06-26 400mL 60 8.49
6531266 02-90631 12.74 2019-06-27 400mL 60 8.49
6531267 02-90631 12.74 2019-06-28 400mL 60 8.49
6531268 02-90631 12.74 2019-06-29 400mL 60 8.49
6531269 02-90631 12.74 2019-06-30 400mL 60 8.49
[6531270 rows x 6 columns]
I would like to determine how many days each particular BrandId has no inventory on hand. For example, BrandId 01-00058 has 27 unique days where OnHand = 0. I would like summarize that information for all unique BrandId's.
I would then like to find the mean and standard deviation of these unique BrandId's from the days each is stocked out.
Ideally, I would love this information to be viewed in the variable explorer as a table that reads:
BrandID Sum OnHand = 0
01-00058 27
01-00061 39
01-00062 14
``'
IIUC, try with groupby:
>>> df[df["OnHand"].eq(0)].groupby("BrandID")["InventoryDate"].nunique()

Grouping data across midnight and performing an operation using pandas

I have the following data contained in a DataFrame which is part of a custom Class, and I want to compute stats on it for night-time periods.
LAeq,T LAFmax,T LA90,T
Start date & time
2021-08-18 22:00:00 71.5 90.4 49.5
2021-08-18 22:15:00 70.6 94.0 45.7
2021-08-18 22:30:00 69.3 82.2 48.3
2021-08-18 22:45:00 70.1 89.9 46.4
2021-08-18 23:00:00 68.9 82.4 46.0
... ... ...
2021-08-24 08:30:00 72.3 85.0 61.3
2021-08-24 08:45:00 72.9 84.6 62.2
2021-08-24 09:00:00 73.1 86.1 62.6
2021-08-24 09:15:00 72.8 86.4 61.6
2021-08-24 09:30:00 73.2 93.5 61.5
For example, I want to find the nth highest LAFmax, T for each given night-time period.
The night-time period typically spans 23:00 to 07:00, and I have managed to accomplish my goal using the resample() method as follows.
def compute_nth_lmax(self, n):
nth_lmax = self.df["LAFmax,T"].between_time(self._night_start, self._day_start,
include_start=True, include_end=False).resample(
rule=self._night_length, offset=pd.Timedelta(self._night_start)).apply(
lambda x: (np.sort(x))[-n] if x.size > 0 else np.nan).dropna()
return nth_lmax
The problem is that resample() assumes a regular resampling, and this works fine when the night-time period is 8 hours and therefore subdivides 24 equally (as in the default case of 23:00 to 07:00), but not for an irregular night-time period (say, if I extended it to 22:00 to 07:00).
I have tried to accomplish this using groupby(), but had no luck.
The only thing I can think of is adding another column to label each of the rows as "Night-time 1", "Night-time 2" etc., and grouping by these, but that feels rather messy.
I decided to go with what I consider a slightly inelegant approach and create a separate column which flags the night-time periods, before processing them. Still, I managed to achieve my goal in 2 lines of code.
self.df["Night-time indices"] = (self.df.index - pd.Timedelta(self._day_start)).date
nth_event = self.df.sort_values(by=[col], ascending=False).between_time(self._night_start, self._day_start)[
[col, period]].groupby(by=period).nth(n)
Out[43]:
Night-time indices
2021-08-18 100.0
2021-08-19 96.9
2021-08-20 97.7
2021-08-21 95.5
2021-08-22 101.7
2021-08-23 92.7
2021-08-24 85.8
Name: LAFmax,T, dtype: float64

How do I get the mean of celsius with based in the measued_at column?

I want to get the expected output below. How do I use groupby or resampling to get the mean celcius by hour but still keep the minute values in the measured_at column?
My input:
measured_at celsius
0 2020-05-19 01:13:40+00:00 15.00
1 2020-05-19 01:14:40+00:00 16.50
1 2020-05-20 02:13:26+00:00 30.00
2 2020-05-20 02:14:57+00:00 15.35
3 2020-05-20 02:15:19+00:00 14.00
4 2020-05-20 12:06:39+00:00 20.00
5 2020-05-21 03:13:07+00:00 15.50
6 2020-05-22 12:09:37+00:00 15.00
df['measured_at'] = pd.to_datetime(df.measured_at)
df1 = df.resample('60T', on='measured_at')['celsius'].mean().dropna().reset_index()
My output:
measured_at celsius
0 2020-05-19 01:00:00+00:00 15.750000
1 2020-05-20 02:00:00+00:00 19.783333
2 2020-05-20 12:00:00+00:00 20.000000
3 2020-05-21 03:00:00+00:00 15.500000
4 2020-05-22 12:00:00+00:00 15.000000
Expected output:
measured_at celsius
0 2020-05-19 01:13:00+00:00 15.750000
1 2020-05-20 02:13:00+00:00 19.783333
2 2020-05-20 12:06:00+00:00 20.000000
3 2020-05-21 03:13:00+00:00 15.500000
4 2020-05-22 12:09:00+00:00 15.000000
Here's the code for your use case.
I took out the minutes and seconds part so that they could be averaged and add back after the resampling.
Not sure what the +00:00 is for, if it is for better precision and you need it, you can convert into microseconds or nanoseconds instead.
import pandas as pd
from datetime import datetime
# Convert to datetime object
df['measured_at'] = df['measured_at'].apply(pd.to_datetime)
# Extract minutes and seconds as total seconds
df['seconds'] = df['measured_at'].apply(lambda x: (x.minute*60)+x.second)
# Resample to periods of one hour
df = df.resample('60T', on='measured_at').mean().dropna().reset_index()
# Add back average minutes for each period
df['measured_at'] = df['measured_at'] + pd.to_timedelta(df['seconds'].apply(int),'s')
# Remove seconds column
df = df.drop(columns='seconds')

How to calculate daily averages from noon to noon with pandas?

I am fairly new to python and pandas, so I apologise for any future misunderstandings.
I have a pandas DataFrame with hourly values, looking something like this:
2014-04-01 09:00:00 52.9 41.1 36.3
2014-04-01 10:00:00 56.4 41.6 70.8
2014-04-01 11:00:00 53.3 41.2 49.6
2014-04-01 12:00:00 50.4 39.5 36.6
2014-04-01 13:00:00 51.1 39.2 33.3
2016-11-30 16:00:00 16.0 13.5 36.6
2016-11-30 17:00:00 19.6 17.4 44.3
Now I need to calculate 24h average values for each column starting from 2014-04-01 12:00 to 2014-04-02 11:00
So I want daily averages from noon to noon.
Unfortunately, I have no idea how to do that. I have read some suggestions to use groupby, but I don't really know how...
Thank you very much in advance! Any help is appreciated!!
For newer versions of pandas (>= 1.1.0) use the offset argument:
df.resample('24H', offset='12H').mean()
The base argument.
A day is 24 hours, so a base of 12 would start the grouping from Noon - Noon. Resample gives you all days in between, so you could .dropna(how='all') if you don't need the complete basis. (I assume you have a DatetimeIndex, if not you can use the on argument of resample to specify your datetime column.)
df.resample('24H', base=12).mean()
#df.groupby(pd.Grouper(level=0, base=12, freq='24H')).mean() # Equivalent
1 2 3
0
2014-03-31 12:00:00 54.20 41.30 52.233333
2014-04-01 12:00:00 50.75 39.35 34.950000
2014-04-02 12:00:00 NaN NaN NaN
2014-04-03 12:00:00 NaN NaN NaN
2014-04-04 12:00:00 NaN NaN NaN
... ... ... ...
2016-11-26 12:00:00 NaN NaN NaN
2016-11-27 12:00:00 NaN NaN NaN
2016-11-28 12:00:00 NaN NaN NaN
2016-11-29 12:00:00 NaN NaN NaN
2016-11-30 12:00:00 17.80 15.45 40.450000
You could subtract your time and groupby:
df.groupby((df.index - pd.to_timedelta('12:00:00')).normalize()).mean()
You can shift the hours by 12 hours and resample on day level.
from io import StringIO
import pandas as pd
data = """
2014-04-01 09:00:00,52.9,41.1,36.3
2014-04-01 10:00:00,56.4,41.6,70.8
2014-04-01 11:00:00,53.3,41.2,49.6
2014-04-01 12:00:00,50.4,39.5,36.6
2014-04-01 13:00:00,51.1,39.2,33.3
2016-11-30 16:00:00,16.0,13.5,36.6
2016-11-30 17:00:00,19.6,17.4,44.3
"""
df = pd.read_csv(StringIO(data), sep=',', header=None, index_col=0)
df.index = pd.to_datetime(df.index)
# shift by 12 hours
df.index = df.index - pd.Timedelta(hours=12)
# resample and drop na rows
df.resample('D').mean().dropna()

How to assign a values to dataframe's column by comparing values in another dataframe

I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)

Categories