Count values greater than threshold and assign to appropriate year pandas - python

I have a dataframe that looks like this:
Date DFW
242 2000-05-01 00:00:00 75.92
243 2000-05-01 12:00:00 75.02
244 2000-05-02 00:00:00 71.96
245 2000-05-02 12:00:00 75.92
246 2000-05-03 00:00:00 71.96
... ... ...
14991 2020-07-09 12:00:00 93.90
14992 2020-07-10 00:00:00 91.00
14993 2020-07-10 12:00:00 93.00
14994 2020-07-11 00:00:00 89.10
14995 2020-07-11 12:00:00 97.00
The df contains the max value of temperature for a specific location every 12 hours from May - July 11 during 2000-2020. I want to count the number of times that the value is >90 and then store that value in a column where the row is the year. Should I use groupby to accomplish this?
Expected output:
Year count
2000 x
2001 y
... ...
2019 z
2020 a

You can do with groupby:
# extract the years from dates
years = df['Date'].dt.year
# compare `DFW` with `90`
# gt90 will be just True or False
gt90 = df['DFW'].gt(90)
# sum the `True` by years
output = gt90.groupby(years).sum()
# set the years as normal column:
output = output.reset_index()
All that in one line:
df['DFW'].gt(90).groupby().sum().reset_index()

One possible approach is to extract and create a new column for year (let's say "year") and then,
df[df['DFW'] > 90].groupby('year').count().reset_index()

Related

monthly resampling pandas with specific start day

I'm creating a pandas DataFrame with random dates and random integers values and I want to resample it by month and compute the average value of integers. This can be done with the following code:
def random_dates(start='2018-01-01', end='2019-01-01', n=300):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-01-01')
dates = random_dates(start, end)
ints = np.random.randint(100, size=300)
df = pd.DataFrame({'Month': dates, 'Integers': ints})
print(df.resample('M', on='Month').mean())
The thing is that the resampled months always starts from day one and I want all months to start from day 15. I'm using pandas 1.1.4 and I've tried using origin='15/01/2018' or offset='15' and none of them works with 'M' resample rule (they do work when I use 30D but it is of no use). I've also tried to use '2SM'but it also doesn't work.
So my question is if is there a way of changing the resample rule or I will have to add an offset in my data?
Assume that the source DataFrame is:
Month Amount
0 2020-05-05 1
1 2020-05-14 1
2 2020-05-15 10
3 2020-05-20 10
4 2020-05-30 10
5 2020-06-15 20
6 2020-06-20 20
To compute your "shifted" resample, first shift Month column so that
the 15-th day of month becomes the 1-st:
df.Month = df.Month - pd.Timedelta('14D')
and then resample:
res = df.resample('M', on='Month').mean()
The result is:
Amount
Month
2020-04-30 1
2020-05-31 10
2020-06-30 20
If you want, change dates in the index to month periods:
res.index = res.index.to_period('M')
Then the result will be:
Amount
Month
2020-04 1
2020-05 10
2020-06 20
Edit: Not a working solution for OP's request. See short discussion in the comments.
Interesting problem. I suggest to resample using 'SMS' - semi-month start frequency (1st and 15th). Instead of keeping just the mean values, keep the count and sum values and recalculate the weighted mean for each monthly period by its two sub-period (for example: 15/1 to 15/2 is composed of 15/1-31/1 and 1/2-15/2).
The advantages here is that unlike with an (improper use of an) offset, we are certain we always start on the 15th of the month till the 14th of the next month.
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm
Integers
sum count
Month
2018-01-01 876 16
2018-01-15 864 16
2018-02-01 412 10
2018-02-15 626 12
...
2018-12-01 492 10
2018-12-15 638 16
Rolling sum and rolling count; Find the mean out of them:
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_sm
Integers count_sum count_rolling mean
sum count
Month
2018-01-01 876 16 NaN NaN NaN
2018-01-15 864 16 1740.0 32.0 54.375000
2018-02-01 412 10 1276.0 26.0 49.076923
2018-02-15 626 12 1038.0 22.0 47.181818
...
2018-12-01 492 10 1556.0 27.0 57.629630
2018-12-15 638 16 1130.0 26.0 43.461538
Now, just filter the odd indices of df_sm:
df_sm.iloc[1::2]['mean']
Month
2018-01-15 54.375000
2018-02-15 47.181818
2018-03-15 51.000000
2018-04-15 44.897436
2018-05-15 52.450000
2018-06-15 33.722222
2018-07-15 41.277778
2018-08-15 46.391304
2018-09-15 45.631579
2018-10-15 54.107143
2018-11-15 58.058824
2018-12-15 43.461538
Freq: 2SMS-15, Name: mean, dtype: float64
The code:
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_out = df_sm[1::2]['mean']
Edit: Changed a name of one of the columns to make it clearer

Rolling mean of 1 year after filtering of data in pandas dataframe

I am trying to calculate the rolling mean for 1 year in the below pandas dataframe. 'mean_1year' for the below dataframe is calcualted using the 1 year calculation based
on month and year.
For example, month and year of first row in the below dataframe is '05' and '2016'. Hence 'mean_1year' is calculated using average 'price' of '2016-04' back to '2015-04'.Hence it
would be (1300+1400+1500)/3 = 1400. Also, while calculating this average, a filter has to be made on the "type" column. As the "type" of first row is "A", while calculating "mean_1year",
the rows have to be filtered on type=="A" and the average is computed using '2016-04' back to '2015-04'.
type year month price mean_1year
A 2016 05 1200 1400
A 2016 04 1300
A 2016 01 1400
A 2015 12 1500
Any suggestions would be appreciated. Thanks !
First you need a datetime index in ascending order so you can apply a rolling time period calculation.
df['date'] = pd.to_datetime(df['year'].astype('str')+'-'+df['month'].astype('str'))
df = df.set_index('date')
df = df.sort_index()
Then you groupby type and apply the rolling mean.
df['mean_1year'] = df.groupby('type')['price'].rolling('365D').mean().reset_index(0,drop=True)
The result is:
type year month price mean_1year
date
2015-12-01 A 2015 12 1500 1500.0
2016-01-01 A 2016 1 1400 1450.0
2016-04-01 A 2016 4 1300 1400.0
2016-05-01 A 2016 5 1200 1350.0
"Ordinary" rolling can't be applied, because it:
includes rows starting from the current row, whereas you want
to exclude it,
the range of the window expands into the future,
whereas you want to expand it back.
So I used different approach, based on loc with suitable
date slices.
As a test DataFrame I used:
type year month price
0 A 2016 5 1200
1 A 2016 4 1300
2 A 2016 1 1400
3 A 2015 12 1500
4 B 2016 5 1200
5 B 2016 4 1300
And the code is as follows:
Compute date offsets of 12 months and 1 day:
yearOffs = pd.offsets.DateOffset(months=12)
dayOffs = pd.offsets.DateOffset(days=1)
Will be needed in loc later.
Set the index to a datetime, derived from year and
month columns:
df.set_index(pd.to_datetime(df.year.astype(str)
+ df.month.astype(str), format='%Y%m'), inplace=True)
Define the function to compute means within the current
group:
def myMeans(grp):
wrk = grp.sort_index()
return wrk.apply(lambda row: wrk.loc[row.name - yearOffs
: row.name - dayOffs, 'price'].mean(), axis=1)
Compute the means:
means = df.groupby('type').apply(myMeans).swaplevel()
So far the result is:
type
2015-12-01 A NaN
2016-01-01 A 1500.0
2016-04-01 A 1450.0
2016-05-01 A 1400.0
2016-04-01 B NaN
2016-05-01 B 1300.0
dtype: float64
but df has a single level index, with non-unique values.
So to add means to df and drop now unnecessary index,
the last step is:
df = df.set_index('type', append=True).assign(mean_1year=means)\
.reset_index(level=1).reset_index(drop=True)
The final result is:
type year month price mean_1year
0 A 2016 5 1200 1400.0
1 A 2016 4 1300 1450.0
2 A 2016 1 1400 1500.0
3 A 2015 12 1500 NaN
4 B 2016 5 1200 1300.0
5 B 2016 4 1300 NaN
For the "earliest" rows in each group the result is NaN,
as there are no source (earlier) rows to compute the means
for them (so there is apparently something wrong in the other solution).

How to loop through a pandas grouped time series?

I have a dataframe like this:
datetime type d13C ... dayofyear week dmy
1 2018-01-05 15:22:30 air -8.88 ... 5 1 5-1-2018
2 2018-01-05 15:23:30 air -9.08 ... 5 1 5-1-2018
3 2018-01-05 15:24:30 air -10.08 ... 5 1 5-1-2018
4 2018-01-05 15:25:30 air -9.51 ... 5 1 5-1-2018
5 2018-01-05 15:26:30 air -9.61 ... 5 1 5-1-2018
... ... ... ... ... ... ...
341543 2018-12-17 12:42:30 air -9.99 ... 351 51 17-12-2018
341544 2018-12-17 12:43:30 air -9.53 ... 351 51 17-12-2018
341545 2018-12-17 12:44:30 air -9.54 ... 351 51 17-12-2018
341546 2018-12-17 12:45:30 air -9.93 ... 351 51 17-12-2018
341547 2018-12-17 12:46:30 air -9.66 ... 351 51 17-12-2018
Full data here: https://drive.google.com/file/d/1KmOwnpvrG2Edz1AlLyD0CKZlBpaFervM/view?usp=sharing
I'm plotting d13C column on the Y-axis and inverse total_co2 on the X and then fitting a regression line for each day in the data. I then filter out and store the dates I want depending on if the r^2 value of the regression line is > 0.8 like this:
import pandas as pd
from numpy.polynomial.polynomial import polyfit
import numpy as np
from scipy import stats
df = pd.read_csv('dataset.txt', usecols = ['datetime', 'type', 'total_co2', 'd13C', 'day','month','year','dayofyear','week','hour'], dtype = {'total_co2':
np.float64, 'd13C':np.float64, 'day':str, 'month':str, 'year':str,'week':str, 'hour': str, 'dayofyear':str})
df['dmy'] = df['day'] +'-'+ df['month'] +'-'+ df['year'] # adding a full date column to make it easir to filter through
# the rows, ie. each day
# window18 = df[((df['year']=='2018'))] # selecting just the data from the year 2018
accepted_dates_list = [] # creating an empty list to store the dates that we're interested in
for d in df['dmy'].unique(): # this will pass through each day, the .unique() ensures that it doesnt go over the same days
acceptable_date = {} # creating a dictionary to store the valid dates
period = df[df.dmy==d] # defining each period from the dmy column
p = (period['total_co2'])**-1
q = period['d13C']
c,m = polyfit(p,q,1) # intercept and gradient calculation of the regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(p, q) # getting some statistical properties of the regression line
if r_value**2 >= 0.8:
acceptable_date['period'] = d # populating the dictionary with the accpeted dates and corresponding other values
acceptable_date['r-squared'] = r_value**2
acceptable_date['intercept'] = intercept
accepted_dates_list.append(acceptable_date) # sending the valid stuff in the dictionary to the list
else:
pass
accepted_dates18 = pd.DataFrame(accepted_dates_list) # converting the list to a df
print(accepted_dates18)
But now I want to do the same thing, just over three day periods which I'm trying to select from the day of year column (unsure if this is the best way or not). For example, I would want to fit the regression line using all the rows with dayofyear=5, dayofyear=6, dayofyear=7, then for the next three days until the end of the data. There are some days missing, but essentially I just need to do this for every 3 days in the data.
The output dataframe I am then trying to get would have the list of the three day intervals with the r^2 >0.8, so anything like this that will show the valid date range:
Accepted dates
0 23-08-2018 - 25-08-2018
1 26-08-2018 - 28-08-2018
2 31-08-2018 - 02-09-2018
3 15-09-2018 - 17-09-2018
4 24-09-2018 - 26-09-2018
I'm not too sure what to do to iterate over every three days. Any help would go a long way, thanks!
Your code loops through a list of unique dates and filters the dataframe on each iteration.
Pandas implemented this with df.groupby(). It can be used to loop and get each group or it can be combined with aggregations, function applications, and transformations. You can read more about it on the user guide. This function can return groups according to any the columns (or set of columns) in df, levels of the index, or any other exogenous list-like with the same length as df (we are grouping rows, but note it can also group columns). It even has implementations for the most common statistical aggregations like mean, stdev, and corr, among many others.
Now to your problem. You not only want the correlation but the equation, so you do need to loop. And to get three-day groups you can use that dayofyear column with a twist.
Take this data
import io
fo = io.StringIO(
'''datetime,d13C
2018-01-05 15:22:30,-8.88
2018-01-05 15:23:30,-9.08
2018-01-06 15:24:30,-10.0
2018-01-06 15:25:30,-9.51
2018-01-07 15:26:30,-9.61
2018-01-07 15:27:30,-9.61
2018-01-08 15:28:30,-9.61
2018-01-08 15:29:30,-9.61
2018-01-09 15:26:30,-9.61
2018-01-09 15:27:30,-9.61
''')
df = pd.read_csv(fo)
df.datetime = pd.to_datetime(df.datetime)
fo.close()
With the code for grouping and looping
first_day = 5
days_to_group = 3
for doy, gdf in df.groupby((df.datetime.dt.dayofyear.sub(first_day) // days_to_group)
* days_to_group + first_day):
print(gdf, '\n')
print(doy, '\n')
Output
datetime d13C
0 2018-01-05 15:22:30 -8.88
1 2018-01-05 15:23:30 -9.08
2 2018-01-06 15:24:30 -10.00
3 2018-01-06 15:25:30 -9.51
4 2018-01-07 15:26:30 -9.61
5 2018-01-07 15:27:30 -9.61
5
datetime d13C
6 2018-01-08 15:28:30 -9.61
7 2018-01-08 15:29:30 -9.61
8 2018-01-09 15:26:30 -9.61
9 2018-01-09 15:27:30 -9.61
8
Now you can plug your code into this loop and get what you need.
PS
You can also use df.datetime.dt.floor('3d') as the grouper but I am not aware of how to control the first_day, so use it with caution.
Here is one approach. As I understand it, the primary goal is to get from current observations (multiple per day) to a 3-day moving average. First, I created a smaller, simpler data set:
import pandas as pd
df = pd.DataFrame({'counter': [*range(100)],
'date': pd.date_range('2020-01-01', periods=100, freq='7H')})
df = df.set_index('date')
print(df.head())
counter
date
2020-01-01 00:00:00 0
2020-01-01 07:00:00 1
2020-01-01 14:00:00 2
2020-01-01 21:00:00 3
2020-01-02 04:00:00 4
Second, I re-sampled on a daily basis:
df2 = df['counter'].resample('1D').mean() # <-- called df2
print(df2.head())
date
2020-01-01 1.5
2020-01-02 5.0
2020-01-03 8.5
2020-01-04 12.0
2020-01-05 15.5
Freq: D, Name: counter, dtype: float64
Third, I computed mean value for a 3-day moving window:
print(df2.rolling(3).mean().head())
date
2020-01-01 NaN
2020-01-02 NaN
2020-01-03 5.0
2020-01-04 8.5
2020-01-05 12.0
Freq: D, Name: counter, dtype: float64
Seems like resample().mean() and rolling().mean() would be useful in this case.

Plot frequency of dataframe value per year

I have a dataframe that contains hourly temperature data from 1990-2019 for 25 different locations. I want to count the amount of hours that a value is above or below a certain threshold and then plot that amount as a sum of the hours for every year. I know I can use a bar chart or histogram to plot, but am unsure how to aggregate the data to perform this task.
Dataframe:
time Antwerp Rotterdam ...
1990-01-01 00:00:00 2 4 ...
1990-01-01 01:00:00 3 4 ...
1990-01-01 02:00:00 2 4 ...
...
Do I need to use the groupby function?
Sample data to demonstrate:
time Antwerp Rotterdam Los Angeles
0 1990-01-01 00:00:00 0 2 15
1 1990-01-01 01:00:00 1 4 14
2 1990-01-01 02:00:00 3 5 15
3 1990-01-01 03:00:00 2 6 16
Now I am looking for the amount of hours that one city is equal to or less than 5 degrees during the year 1990. Expected output:
time Antwerp Rotterdam Los Angeles
1990 4 3 0
Ideally I would want to be able to select whatever temperature value I want.
I think you need DatetimeIndex, compare, e.g. for greater by DataFrame.gt and then count Trues values by aggregate sum:
df['time'] = pd.to_datetime(df['time'])
df = df.set_index('time')
N = 2
df = df.gt(N).groupby(df.index.year).sum()
print (df)
Antwerp Rotterdam
time
1990 0.0 1.0
1991 1.0 2.0
If want low or equal use DataFrame.le:
N = 3
df = df.le(N).groupby(df.index.year).sum()
print (df)
Antwerp Rotterdam
time
1990 1.0 0.0
1991 2.0 0.0
This is without using pandas functions.
# get the time column as a list by timelist = list(df['time'])
def get_hour_ud(df, threshold):
# timelist = list(df['time'])
# df['time'] = ['1990-01-01 00:00:00', '1990-01-01 01:00:00', '1990-01-01 02:00:00'] # remove this line
timelist = list(df['time'])
hour_list = [int(a.split(' ')[1].split(':')[0]) for a in timelist]
up_cnt = sum(a>threshold for a in hour_list)
low_cnt = sum(a<threshold for a in hour_list)
print(up_cnt)
print(low_cnt)
return up_cnt, low_cnt

Finding average data per hour based on the day of the week to simulate data for missing days

I have a set of hourly data taken from 07-Feb-19 to 17-Feb-19:
t v_amm v_alc v_no2
0 2019-02-07 08:00:00+00:00 0.320000 0.344000 1.612000
1 2019-02-07 09:00:00+00:00 0.322889 0.391778 1.580889
2 2019-02-07 10:00:00+00:00 0.209375 0.325208 2.371250
...
251 2019-02-17 19:00:00+00:00 1.082041 0.652041 0.967143
252 2019-02-17 20:00:00+00:00 0.936923 0.598654 1.048077
253 2019-02-17 21:00:00+00:00 0.652553 0.499574 1.184894
and another similar set of hourly data taken from 01-Mar-19 to 11-Mar-19:
t v_amm v_alc v_no2
0 2019-03-01 00:00:00+00:00 0.428222 0.384444 1.288222
1 2019-03-01 01:00:00+00:00 0.398600 0.359600 1.325800
2 2019-03-01 02:00:00+00:00 0.365682 0.352273 1.360000
...
244 2019-03-11 04:00:00+00:00 0.444048 0.415238 1.265000
245 2019-03-11 05:00:00+00:00 0.590698 0.591395 1.156977
246 2019-03-11 06:00:00+00:00 0.497872 0.465319 1.228298
However, there is no data available between 17-Feb-19 and 01-Mar-19.
Hence, I'd like to find the hourly average data based on the day of the week to simulate the missing hourly data between 17-Feb-19 and 01-Mar-19.
In other words, using all the hourly data from the same day of the week and find the average for each hour for that day. Expected output for 17-Feb-19 to 01-Mar-19 is something like:
t v_amm v_alc v_no2
0 2019-02-17 22:00:00+00:00 1.082041 0.652041 0.967143
1 2019-02-17 23:00:00+00:00 0.936923 0.598654 1.048077
2 2019-02-18 00:00:00+00:00 0.652553 0.499574 1.184894
...
250 2019-02-29 21:00:00+00:00 0.428222 0.384444 1.288222
251 2019-02-29 22:00:00+00:00 0.398600 0.359600 1.325800
252 2019-02-29 23:00:00+00:00 0.365682 0.352273 1.360000
Does anyone know how to obtain this in pandas?
I'd solve this problem by adding a temporary column "day_of_week". You can generate this value easily with pandas using:
df['day_of_week'] = df.t.dt.dayofweek
(pandas.DatetimeIndex.dayofweek documentation)
Next you would need to generate the average value for each weekday:
daily_mean = df.groupby(by='day_of_week').mean()
pandas.DataFrame.groupby documentation
from here on the next steps depend on which values you need. The daily_means variable has all mean values you need.
The next step would likely be to create the missing rows by generating the date values, generating the corresponding weekday and inserting the generated mean values.

Categories