I have a pandas dataframe with a minute datetime index:
Index
Col1
2022-12-25 09:01:00
5
2022-12-25 09:10:00
15
2022-12-25 11:12:00
10
2022-12-26 10:05:00
2
2022-12-26 12:29:00
2
2022-12-26 13:56:00
5
I want to remove the daily average from this data, resulting in this dataframe (here 10 for the first day and 3 for the second day):
Index
Col1
2022-12-25 09:01:00
-5
2022-12-25 09:10:00
5
2022-12-25 11:12:00
0
2022-12-26 10:05:00
-1
2022-12-26 12:29:00
-1
2022-12-26 13:56:00
2
Assuming df is your dataframe, this should do the trick:
for day, df_group in df.groupby(by=df.index.day):
df.loc[df_group.index,"Col1"] -= df_group["Col1"].mean()
Related
I would like to analyze a dataframe with hourly data for several days, e.g. df:
DATE TIME Threshold Value
2022-11-04 02:00:00 10 9
2022-11-04 03:00:00 11 10
2022-11-04 04:00:00 10 11
2022-11-04 06:00:00 12 11
2022-11-04 05:00:00 12 12
2022-11-04 07:00:00 10 11
2022-11-04 08:00:00 11 10
2022-11-04 09:00:00 11 9
2022-11-04 10:00:00 12 9
2022-11-04 11:00:00 10 10
2022-11-04 12:00:00 10 10
...
2022-11-05 01:00:00 10 9
2022-11-05 02:00:00 11 10
...
Now I would like to examine the data based on threshold/value and time.
Let's say I am interested in the Value of time "08:00:00" if the threshold of the preceding time "04:00:00" was 10. To find possible patterns, I might also look at other combinations in the future.
My approach was:
Create a new dataframe df_2 with all slices of 04:00:00 and value = 10
Create a new dataframe df_3 with all slices of 08:00:00
merge df_2 and df_3 and select only rows where a time = 04:00:00 of the same day precedes a time = 8:00:00 entry.
This seems to be a bit cumbersome and I was wondering if there was a more practical way to do this.
Maybe someone could suggest a more efficient way?
at first make DatetimeInex:
date_idx=df.iloc[:, :2].astype('str').apply(lambda x: pd.to_datetime(' '.join(x)), axis=1)
and make new column that have Threshold before 4H
and make result to df1
df1 = (df.set_index(date_idx)
.drop(['DATE', 'TIME'], axis=1)
.sort_index()
.assign(new=df1.shift(freq='4H')['Threshold']))
output(df1):
Threshold Value new
2022-11-04 02:00:00 10 9 NaN
2022-11-04 03:00:00 11 10 NaN
2022-11-04 04:00:00 10 11 NaN
2022-11-04 05:00:00 12 12 NaN
2022-11-04 06:00:00 12 11 10.0
2022-11-04 07:00:00 10 11 11.0
2022-11-04 08:00:00 11 10 10.0
2022-11-04 09:00:00 11 9 12.0
2022-11-04 10:00:00 12 9 12.0
2022-11-04 11:00:00 10 10 10.0
2022-11-04 12:00:00 10 10 11.0
filter data at 08:00:00:
df1.at_time('08:00')
output:
Threshold Value new
2022-11-04 08:00:00 11 10 10.0
check or filter Value and new column
here is one way to do it
out=(df.loc[
(df['TIME'].isin(['04:00:00','08:00:00']) & # choose rows where time is 4:00 or 8:00
df['DATE'].isin( # and date where
df.loc[df['TIME'].eq('04:00:00') & # time is 04:00:00
df['Threshold'].eq(10)]['DATE']) # and Threshold is 10
)])
out
DATE TIME Threshold Value
2 2022-11-04 04:00:00 10 11
6 2022-11-04 08:00:00 11 10
Alternately, same as above just choose time eq to 08:00:00
out=(df.loc[
(df['TIME'].isin(['08:00:00']) &
df['DATE'].isin(
df.loc[df['TIME'].eq('04:00:00') &
df['Threshold'].eq(10)]['DATE'])
)])
out
DATE TIME Threshold Value
6 2022-11-04 08:00:00 11 10
I want to get the expected output below. How do I use groupby or resampling to get the mean celcius by hour but still keep the minute values in the measured_at column?
My input:
measured_at celsius
0 2020-05-19 01:13:40+00:00 15.00
1 2020-05-19 01:14:40+00:00 16.50
1 2020-05-20 02:13:26+00:00 30.00
2 2020-05-20 02:14:57+00:00 15.35
3 2020-05-20 02:15:19+00:00 14.00
4 2020-05-20 12:06:39+00:00 20.00
5 2020-05-21 03:13:07+00:00 15.50
6 2020-05-22 12:09:37+00:00 15.00
df['measured_at'] = pd.to_datetime(df.measured_at)
df1 = df.resample('60T', on='measured_at')['celsius'].mean().dropna().reset_index()
My output:
measured_at celsius
0 2020-05-19 01:00:00+00:00 15.750000
1 2020-05-20 02:00:00+00:00 19.783333
2 2020-05-20 12:00:00+00:00 20.000000
3 2020-05-21 03:00:00+00:00 15.500000
4 2020-05-22 12:00:00+00:00 15.000000
Expected output:
measured_at celsius
0 2020-05-19 01:13:00+00:00 15.750000
1 2020-05-20 02:13:00+00:00 19.783333
2 2020-05-20 12:06:00+00:00 20.000000
3 2020-05-21 03:13:00+00:00 15.500000
4 2020-05-22 12:09:00+00:00 15.000000
Here's the code for your use case.
I took out the minutes and seconds part so that they could be averaged and add back after the resampling.
Not sure what the +00:00 is for, if it is for better precision and you need it, you can convert into microseconds or nanoseconds instead.
import pandas as pd
from datetime import datetime
# Convert to datetime object
df['measured_at'] = df['measured_at'].apply(pd.to_datetime)
# Extract minutes and seconds as total seconds
df['seconds'] = df['measured_at'].apply(lambda x: (x.minute*60)+x.second)
# Resample to periods of one hour
df = df.resample('60T', on='measured_at').mean().dropna().reset_index()
# Add back average minutes for each period
df['measured_at'] = df['measured_at'] + pd.to_timedelta(df['seconds'].apply(int),'s')
# Remove seconds column
df = df.drop(columns='seconds')
Currently, the week number for the period of '2020-5-6' to '2020-5-19' is 20 and 21.
How do I customise it so that the week number is 1 and 2 instead, and also have the subsequent periods change accordingly.
My code:
import pandas as pd
df = pd.DataFrame({'Date':pd.date_range('2020-5-6', '2020-5-19')})
df['Period'] = df['Date'].dt.to_period('W-TUE')
df['Week_Number'] = df['Period'].dt.week
df.head()
print(df)
My output:
Date Period Week_Number
0 2020-05-06 2020-05-06/2020-05-12 20
1 2020-05-07 2020-05-06/2020-05-12 20
2 2020-05-08 2020-05-06/2020-05-12 20
3 2020-05-09 2020-05-06/2020-05-12 20
...
11 2020-05-17 2020-05-13/2020-05-19 21
12 2020-05-18 2020-05-13/2020-05-19 21
13 2020-05-19 2020-05-13/2020-05-19 21
What I want:
Date Period Week_Number
0 2020-05-06 2020-05-06/2020-05-12 1
1 2020-05-07 2020-05-06/2020-05-12 1
2 2020-05-08 2020-05-06/2020-05-12 1
3 2020-05-09 2020-05-06/2020-05-12 1
...
11 2020-05-17 2020-05-13/2020-05-19 2
12 2020-05-18 2020-05-13/2020-05-19 2
13 2020-05-19 2020-05-13/2020-05-19 2
I'm trying to use pandas to group subscribers by subscription type for a given day and get the average price of a subscription type on that day. The data I have resembles:
Sub_Date Sub_Type Price
2011-03-31 00:00:00 12 Month 331.00
2012-04-16 00:00:00 12 Month 334.70
2013-08-06 00:00:00 12 Month 344.34
2014-08-21 00:00:00 12 Month 362.53
2015-08-31 00:00:00 6 Month 289.47
2016-09-03 00:00:00 6 Month 245.57
2013-04-10 00:00:00 4 Month 148.79
2014-03-13 00:00:00 12 Month 348.46
2015-03-15 00:00:00 12 Month 316.86
2011-02-09 00:00:00 12 Month 333.25
2012-03-09 00:00:00 12 Month 333.88
...
2013-04-03 00:00:00 12 Month 318.34
2014-04-15 00:00:00 12 Month 350.73
2015-04-19 00:00:00 6 Month 291.63
2016-04-19 00:00:00 6 Month 247.35
2011-02-14 00:00:00 12 Month 333.25
2012-05-23 00:00:00 12 Month 317.77
2013-05-28 00:00:00 12 Month 328.16
2014-05-31 00:00:00 12 Month 360.02
2011-07-11 00:00:00 12 Month 335.00
...
I'm looking to get something that resembles:
Sub_Date Sub_type Quantity Price
2011-03-31 00:00:00 3 Month 2 125.00
4 Month 0 0.00 # Promo not available this month
6 Month 1 250.78
12 Month 2 334.70
2011-04-01 00:00:00 3 Month 2 125.00
4 Month 2 145.00
6 Month 0 250.78
12 Month 0 334.70
2013-04-02 00:00:00 3 Month 1 125.00
4 Month 3 145.00
6 Month 0 250.78
12 Month 1 334.70
...
2015-06-23 00:00:00 3 Month 4 135.12
4 Month 0 0.00 # Promo not available this month
6 Month 0 272.71
12 Month 3 354.12
...
I'm only able to get the total number of Sub_Types for a given date.
df.Sub_Date.groupby([df.Sub_Date.values.astype('datetime64[D]')]).size()
This is somewhat of a good start, but not exactly what is needed. I've had a look at the groupby documentation on the pandas site but I can't get the output I desire.
I think you need aggregate by mean and size and then add missing values by unstack with stack.
Also if need change order of level Sub_Type, use ordered categorical.
#generating all months ('1 Month','2 Month'...'12 Month')
cat = [str(x) + ' Month' for x in range(1,13)]
df.Sub_Type = df.Sub_Type.astype('category', categories=cat, ordered=True)
df1 = df.Price.groupby([df.Sub_Date.values.astype('datetime64[D]'), df.Sub_Type])
.agg(['mean', 'size'])
.rename(columns={'size':'Quantity','mean':'Price'})
.unstack(fill_value=0)
.stack()
print (df1)
Price Quantity
Sub_Type
2011-02-09 4 Month 0.00 0
6 Month 0.00 0
12 Month 333.25 1
2011-02-14 4 Month 0.00 0
6 Month 0.00 0
12 Month 333.25 1
2011-03-31 4 Month 0.00 0
6 Month 0.00 0
12 Month 331.00 1
I have a pandas dataFrame like this:
content
date
2013-12-18 12:30:00 1
2013-12-19 10:50:00 1
2013-12-24 11:00:00 0
2014-01-02 11:30:00 1
2014-01-03 11:50:00 0
2013-12-17 16:40:00 10
2013-12-18 10:00:00 0
2013-12-11 10:00:00 0
2013-12-18 11:45:00 0
2013-12-11 14:40:00 4
2010-05-25 13:05:00 0
2013-11-18 14:10:00 0
2013-11-27 11:50:00 3
2013-11-13 10:40:00 0
2013-11-20 10:40:00 1
2008-11-04 14:49:00 1
2013-11-18 10:05:00 0
2013-08-27 11:00:00 0
2013-09-18 16:00:00 0
2013-09-27 11:40:00 0
date being the index.
I reduce the values to months using:
dataFrame = dataFrame.groupby([lambda x: x.year, lambda x: x.month]).agg([sum])
which outputs:
content
sum
2006 3 66
4 65
5 48
6 87
7 37
8 54
9 73
10 74
11 53
12 45
2007 1 28
2 40
3 95
4 63
5 56
6 66
7 50
8 49
9 18
10 28
Now when I plot this dataFrame, I want the x-axis show every month/year as a tick. I have tries setting xticks but it doesn't seem to work. How could this be achieved? This is my current plot using dataFrame.plot():
You can use set_xtick() and set_xticklabels():
idx = pd.date_range("2013-01-01", periods=1000)
val = np.random.rand(1000)
s = pd.Series(val, idx)
g = s.groupby([s.index.year, s.index.month]).mean()
ax = g.plot()
ax.set_xticks(range(len(g)));
ax.set_xticklabels(["%s-%02d" % item for item in g.index.tolist()], rotation=90);
output: