I have a dataframe that contains hourly temperature data from 1990-2019 for 25 different locations. I want to count the amount of hours that a value is above or below a certain threshold and then plot that amount as a sum of the hours for every year. I know I can use a bar chart or histogram to plot, but am unsure how to aggregate the data to perform this task.
Dataframe:
time Antwerp Rotterdam ...
1990-01-01 00:00:00 2 4 ...
1990-01-01 01:00:00 3 4 ...
1990-01-01 02:00:00 2 4 ...
...
Do I need to use the groupby function?
Sample data to demonstrate:
time Antwerp Rotterdam Los Angeles
0 1990-01-01 00:00:00 0 2 15
1 1990-01-01 01:00:00 1 4 14
2 1990-01-01 02:00:00 3 5 15
3 1990-01-01 03:00:00 2 6 16
Now I am looking for the amount of hours that one city is equal to or less than 5 degrees during the year 1990. Expected output:
time Antwerp Rotterdam Los Angeles
1990 4 3 0
Ideally I would want to be able to select whatever temperature value I want.
I think you need DatetimeIndex, compare, e.g. for greater by DataFrame.gt and then count Trues values by aggregate sum:
df['time'] = pd.to_datetime(df['time'])
df = df.set_index('time')
N = 2
df = df.gt(N).groupby(df.index.year).sum()
print (df)
Antwerp Rotterdam
time
1990 0.0 1.0
1991 1.0 2.0
If want low or equal use DataFrame.le:
N = 3
df = df.le(N).groupby(df.index.year).sum()
print (df)
Antwerp Rotterdam
time
1990 1.0 0.0
1991 2.0 0.0
This is without using pandas functions.
# get the time column as a list by timelist = list(df['time'])
def get_hour_ud(df, threshold):
# timelist = list(df['time'])
# df['time'] = ['1990-01-01 00:00:00', '1990-01-01 01:00:00', '1990-01-01 02:00:00'] # remove this line
timelist = list(df['time'])
hour_list = [int(a.split(' ')[1].split(':')[0]) for a in timelist]
up_cnt = sum(a>threshold for a in hour_list)
low_cnt = sum(a<threshold for a in hour_list)
print(up_cnt)
print(low_cnt)
return up_cnt, low_cnt
Related
I'm creating a pandas DataFrame with random dates and random integers values and I want to resample it by month and compute the average value of integers. This can be done with the following code:
def random_dates(start='2018-01-01', end='2019-01-01', n=300):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-01-01')
dates = random_dates(start, end)
ints = np.random.randint(100, size=300)
df = pd.DataFrame({'Month': dates, 'Integers': ints})
print(df.resample('M', on='Month').mean())
The thing is that the resampled months always starts from day one and I want all months to start from day 15. I'm using pandas 1.1.4 and I've tried using origin='15/01/2018' or offset='15' and none of them works with 'M' resample rule (they do work when I use 30D but it is of no use). I've also tried to use '2SM'but it also doesn't work.
So my question is if is there a way of changing the resample rule or I will have to add an offset in my data?
Assume that the source DataFrame is:
Month Amount
0 2020-05-05 1
1 2020-05-14 1
2 2020-05-15 10
3 2020-05-20 10
4 2020-05-30 10
5 2020-06-15 20
6 2020-06-20 20
To compute your "shifted" resample, first shift Month column so that
the 15-th day of month becomes the 1-st:
df.Month = df.Month - pd.Timedelta('14D')
and then resample:
res = df.resample('M', on='Month').mean()
The result is:
Amount
Month
2020-04-30 1
2020-05-31 10
2020-06-30 20
If you want, change dates in the index to month periods:
res.index = res.index.to_period('M')
Then the result will be:
Amount
Month
2020-04 1
2020-05 10
2020-06 20
Edit: Not a working solution for OP's request. See short discussion in the comments.
Interesting problem. I suggest to resample using 'SMS' - semi-month start frequency (1st and 15th). Instead of keeping just the mean values, keep the count and sum values and recalculate the weighted mean for each monthly period by its two sub-period (for example: 15/1 to 15/2 is composed of 15/1-31/1 and 1/2-15/2).
The advantages here is that unlike with an (improper use of an) offset, we are certain we always start on the 15th of the month till the 14th of the next month.
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm
Integers
sum count
Month
2018-01-01 876 16
2018-01-15 864 16
2018-02-01 412 10
2018-02-15 626 12
...
2018-12-01 492 10
2018-12-15 638 16
Rolling sum and rolling count; Find the mean out of them:
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_sm
Integers count_sum count_rolling mean
sum count
Month
2018-01-01 876 16 NaN NaN NaN
2018-01-15 864 16 1740.0 32.0 54.375000
2018-02-01 412 10 1276.0 26.0 49.076923
2018-02-15 626 12 1038.0 22.0 47.181818
...
2018-12-01 492 10 1556.0 27.0 57.629630
2018-12-15 638 16 1130.0 26.0 43.461538
Now, just filter the odd indices of df_sm:
df_sm.iloc[1::2]['mean']
Month
2018-01-15 54.375000
2018-02-15 47.181818
2018-03-15 51.000000
2018-04-15 44.897436
2018-05-15 52.450000
2018-06-15 33.722222
2018-07-15 41.277778
2018-08-15 46.391304
2018-09-15 45.631579
2018-10-15 54.107143
2018-11-15 58.058824
2018-12-15 43.461538
Freq: 2SMS-15, Name: mean, dtype: float64
The code:
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_out = df_sm[1::2]['mean']
Edit: Changed a name of one of the columns to make it clearer
I am trying to calculate the rolling mean for 1 year in the below pandas dataframe. 'mean_1year' for the below dataframe is calcualted using the 1 year calculation based
on month and year.
For example, month and year of first row in the below dataframe is '05' and '2016'. Hence 'mean_1year' is calculated using average 'price' of '2016-04' back to '2015-04'.Hence it
would be (1300+1400+1500)/3 = 1400. Also, while calculating this average, a filter has to be made on the "type" column. As the "type" of first row is "A", while calculating "mean_1year",
the rows have to be filtered on type=="A" and the average is computed using '2016-04' back to '2015-04'.
type year month price mean_1year
A 2016 05 1200 1400
A 2016 04 1300
A 2016 01 1400
A 2015 12 1500
Any suggestions would be appreciated. Thanks !
First you need a datetime index in ascending order so you can apply a rolling time period calculation.
df['date'] = pd.to_datetime(df['year'].astype('str')+'-'+df['month'].astype('str'))
df = df.set_index('date')
df = df.sort_index()
Then you groupby type and apply the rolling mean.
df['mean_1year'] = df.groupby('type')['price'].rolling('365D').mean().reset_index(0,drop=True)
The result is:
type year month price mean_1year
date
2015-12-01 A 2015 12 1500 1500.0
2016-01-01 A 2016 1 1400 1450.0
2016-04-01 A 2016 4 1300 1400.0
2016-05-01 A 2016 5 1200 1350.0
"Ordinary" rolling can't be applied, because it:
includes rows starting from the current row, whereas you want
to exclude it,
the range of the window expands into the future,
whereas you want to expand it back.
So I used different approach, based on loc with suitable
date slices.
As a test DataFrame I used:
type year month price
0 A 2016 5 1200
1 A 2016 4 1300
2 A 2016 1 1400
3 A 2015 12 1500
4 B 2016 5 1200
5 B 2016 4 1300
And the code is as follows:
Compute date offsets of 12 months and 1 day:
yearOffs = pd.offsets.DateOffset(months=12)
dayOffs = pd.offsets.DateOffset(days=1)
Will be needed in loc later.
Set the index to a datetime, derived from year and
month columns:
df.set_index(pd.to_datetime(df.year.astype(str)
+ df.month.astype(str), format='%Y%m'), inplace=True)
Define the function to compute means within the current
group:
def myMeans(grp):
wrk = grp.sort_index()
return wrk.apply(lambda row: wrk.loc[row.name - yearOffs
: row.name - dayOffs, 'price'].mean(), axis=1)
Compute the means:
means = df.groupby('type').apply(myMeans).swaplevel()
So far the result is:
type
2015-12-01 A NaN
2016-01-01 A 1500.0
2016-04-01 A 1450.0
2016-05-01 A 1400.0
2016-04-01 B NaN
2016-05-01 B 1300.0
dtype: float64
but df has a single level index, with non-unique values.
So to add means to df and drop now unnecessary index,
the last step is:
df = df.set_index('type', append=True).assign(mean_1year=means)\
.reset_index(level=1).reset_index(drop=True)
The final result is:
type year month price mean_1year
0 A 2016 5 1200 1400.0
1 A 2016 4 1300 1450.0
2 A 2016 1 1400 1500.0
3 A 2015 12 1500 NaN
4 B 2016 5 1200 1300.0
5 B 2016 4 1300 NaN
For the "earliest" rows in each group the result is NaN,
as there are no source (earlier) rows to compute the means
for them (so there is apparently something wrong in the other solution).
I would like to calculate a mean of a time delta serie excluding 00:00:00 values.
Then this is my time serie:
1 00:28:00
3 01:57:00
5 00:00:00
7 01:27:00
9 00:00:00
11 01:30:00
I try to replace 5 and 9 row per NaN and then apply .mean() to the serie. mean() doesn´t include NaN values and I get the desired value.
How can I do that stuff?
I´am trying:
`df["time_column"].replace('0 days 00:00:00', np.NaN).mean()`
but no values are replaced
One idea is use 0 Timedelta object:
out = df["time_column"].replace(pd.Timedelta(0), np.NaN).mean()
print (out)
0 days 01:20:30
I've separate columns for start( timestamp ) and end( timestamp) and i need to get the earliest starttime and last endtime for each date.
number start end test time
0 1 2020-02-01 06:27:38 2020-02-01 08:29:42 1 02:02:04
1 1 2020-02-01 08:41:03 2020-02-01 11:05:30 2 02:24:27
2 1 2020-02-01 11:20:22 2020-02-01 13:03:49 1 01:43:27
3 1 2020-02-01 13:38:18 2020-02-01 16:04:31 2 02:26:13
4 1 2020-02-01 16:26:46 2020-02-01 17:42:49 1 01:16:03
5 1 2020-02-02 10:11:00 2020-02-02 12:11:00 1 02:00:00
I want the output for each date as : Date Min Max
I'm fairly new to Pandas and most of the solutions i've across is finding the min and max datetime from column. While what i want to do is min and max datetime for each date, where the timestamps are spread over two columns
expected output (ignore the date and time formats please)
date min max
1/2/2020 6:27 17:42
2/2/2020 10:11 12:11
I believe you need to start by creating a date column and later performing groupby with date.
df['date'] = df['start'].dt.date
df['start_hm'] = df['start'].dt.strftime('%H:%M')
df['end_hm'] = df['end'].dt.strftime('%H:%M')
output = df.groupby('date').agg(min = pd.NamedAgg(column = 'start_hm',aggfunc='min'),
max = pd.NamedAgg(column='end_hm',aggfunc='max'))
Output:
min max
date
2020-02-01 06:27 17:42
2020-02-02 10:11 12:11
Dataset is a daily time series of 9 variables on a daily scale
I have extracted the dataset
Data = pd.read_csv('city10.csv', header = None)
Data['Date'] = pd.date_range(start='1/1/1951', periods=len(Data), freq='D')
Data.set_index('Date', inplace=True)
It looks like this
Date 0 1 2 3 ... 5 6 7 8
1951-01-01 28.361 0.0 131.24 405.39 ... 405.39 38.284 0.187010 -1.23550
1951-01-02 27.874 0.0 113.74 409.56 ... 409.56 49.834 0.066903 -1.44770
... ... ... ... ... ... ... ... ...
2005-12-16 27.921 0.0 104.99 429.78 ... 429.78 47.529 -1.814300 -5.47720
2005-12-17 27.918 0.0 112.11 425.32 ... 425.32 46.541 -3.314000 -4.02050
After this I found the month mean of the entire dataet i.e.
Data.groupby(Data.index.month).mean()
The result is
0 1 2 ... 6 7 8
1 29.619322 0.215978 108.621532 ... 45.868395 -0.234236 -1.865947
2 32.404500 0.290335 95.270385 ... 43.443624 0.554149 -2.360776
3 35.131266 0.364438 78.907920 ... 42.065113 1.458203 -2.636451
4 36.631282 0.998401 53.663939 ... 44.239469 3.146849 -2.193416
5 36.823308 2.113330 37.917831 ... 54.287356 5.241153 -0.694375
6 34.444513 2.195926 35.315554 ... 67.840239 6.393643 0.689087
7 32.951826 3.567160 32.466668 ... 82.347247 6.583195 1.183262
8 32.644236 4.053641 36.379228 ... 85.056697 5.102383 0.005426
9 32.205442 4.885259 50.595568 ... 80.335829 2.413891 -0.578568
10 30.448266 5.748111 79.575731 ... 67.582589 -0.769297 -0.614057
11 28.748315 4.350384 100.293532 ... 53.418955 -1.258580 -1.023143
12 28.155611 1.524177 109.510292 ... 51.317731 -0.936495 -1.549105
Now,how to subtract the mean of each month with the respective values of that month of each year.
For e.g.
January month mean for time-series 1951-2005 is 20.25
This mean has to be subtracted from daily values of all January month.
How to do this?
Original answer -- difference between data and this month's average
I would use pandas to complete this task, as it makes it easy to aggregate by date.
First, let's make an example data frame and add a month .
In [45]: import pandas as pd
In [46]: import numpy as np
In [47]: start = datetime.datetime(2011, 1, 1)
In [48]: end = datetime.datetime(2012, 1, 1)
In [49]: df = pd.DataFrame({'date':pd.date_range(start, periods=1000, freq='D'), 'x':np.random.normal(5,1,1000)})
In [84]: df = pd.DataFrame({'date':pd.date_range(start, periods=1000, freq='D'), 'x':np.random.normal(5,1,1000)})
In [86]: df['month'] = df.date.dt.month
In [87]: df.head()
Out[87]:
date x month
0 2011-01-01 5.139113 1
1 2011-01-02 3.774586 1
2 2011-01-03 6.095986 1
3 2011-01-04 5.037072 1
4 2011-01-05 5.871760 1
2011-01-05 2011-01-05 6.308203
Now we can create a new data frame that contains the monthly averages using resample and mean.
In [58]: monthly_mean = df.resample('M').mean()
In [59]: monthly_mean.head()
Out[59]:
x
date
2011-01-31 4.702853
2011-02-28 5.088545
2011-03-31 5.261777
2011-04-30 4.982984
2011-05-31 4.791729
We can compute the o
Next, we need to join the two data frames together to line up the data with the monthly averages. To make this easier, I will create a year and month column in each data frame that will be used in the join/merge.
In [60]: df['month'] = df.index.month
In [61]: monthly_mean['month'] = monthly_mean.index.month
In [62]: df['year'] = df.index.year
In [63]: monthly_mean['year'] = monthly_mean.index.year
In [64]: df_joined = pd.merge(df, monthly_mean, how='left', on=('year', 'month'))
In [65]: df_joined.head()
Out[65]:
date x_x month year x_y
0 2011-01-01 5.388197 1 2011 4.702853
1 2011-01-02 6.442878 1 2011 4.702853
2 2011-01-03 5.979076 1 2011 4.702853
3 2011-01-04 2.846689 1 2011 4.702853
4 2011-01-05 5.103524 1 2011 4.702853
Finally, the new column can be constructed by subtracting columns.
In [66]: df_joined['month_diff'] = df_joined.x_x - df_joined.x_y
In [67]: df_joined.head()
Out[67]:
date x_x month year x_y month_diff
0 2011-01-01 5.388197 1 2011 4.702853 0.685344
1 2011-01-02 6.442878 1 2011 4.702853 1.740025
2 2011-01-03 5.979076 1 2011 4.702853 1.276223
3 2011-01-04 2.846689 1 2011 4.702853 -1.856164
4 2011-01-05 5.103524 1 2011 4.702853 0.400670
EDIT: If you want the difference with the historic monthly averages, make the following changes.
Add the month, group by, and aggregate to get the monthly averages.
In [88]: monthly_mean = df.groupby('month').agg('mean')
Now the process proceed as before, join, this time just by 'month', and compute the difference.
In [90]: df_joined = pd.merge(df, monthly_mean, how='left', on='month')
In [91]: df_joined.head()
Out[91]:
date x_x month x_y
0 2011-01-01 5.139113 1 4.972604
1 2011-01-02 3.774586 1 4.972604
2 2011-01-03 6.095986 1 4.972604
3 2011-01-04 5.037072 1 4.972604
4 2011-01-05 5.871760 1 4.972604
In [92]: df_joined['month_diff'] = df_joined.x_x - df_joined.x_y
In [93]: df_joined.head()
Out[93]:
date x_x month x_y month_diff
0 2011-01-01 5.139113 1 4.972604 0.166509
1 2011-01-02 3.774586 1 4.972604 -1.198018
2 2011-01-03 6.095986 1 4.972604 1.123382
3 2011-01-04 5.037072 1 4.972604 0.064468
4 2011-01-05 5.871760 1 4.972604 0.899156
Thank you, everyone. I am able to solve the problem.
I hope it is right.
Anomaly_Values = Data.sub(Data.groupby([Data.index.month]).transform('mean'))
Let me know if there is any problem in the solution.