I have a list of account numbers and transaction dates. I would like to calculate the variance in transaction dates per account number. So if there are 10 transactions with dates on one account I would like to know the interval variance. For the amounts in the list I calculated several statistics via groupby:
df.groupby('AcctNr').agg({'Amount': [np.count_nonzero, np.sum, np.min, np.max, np.std, np.mean], 'Date': [np.min, np.max]})
I succeeded in the min and max date per account number but I can't calculate the variance in intervals.
I think you are looking for numpy.var.
Related
I am new to python and using it to analyse climate data in NetCDF. I am wanting to calculate the total precipitation for each season in each year and then average these seasonal totals across the time period (i.e. an average for DJF over all years in the file and an average for MAM etc.).
Here is what I thought to do:
fn1 = 'cru_fixed.nc'
ds1 = xr.open_dataset(fn1)
ds1_season = ds1['pre'].groupby('time.season').mean('time')
#Then plot each season
ds1_season.plot(col='season')
plt.show()
The original file contains monthly totals of precipitation. This is calculating an average for each season and I need the sum of Dec, Jan and Feb and the sum of Mar, Apr, May etc. for each season in each year. How do I sum and then average over the years?
If I'm not mistaking, you need to first resample you data to have the sum of each seasons on a DataArray, then to average theses sum on multiple years.
To resample:
sum_of_seasons = ds1['pre'].resample(time='Q').sum(dim="time")
resample is an operator to upsample or downsample time series, it uses time offsets of pandas.
However be careful to choose the right offset, it will define the month included in each season. Depending on your needs, you may want to use "Q", "QS" or an anchored offset like "QS-DEC".
To have the same splitting as "time.season", the offset is "QS-DEC" I believe.
Then to group over multiple years, same as you did above:
result = sum_of_seasons.groupby('time.season').mean('time')
I am trying to calculate night-time averages of a dataframe except that what I need is a mix between daily average and hour range average.
More specifically, I have a dataframe storing day and night hours and I want to use it as a boolean key to calculate night-time averages of another dataframe.
I cannot use daily averages because each night spreads over two calendar days, and I cannot use by hour range either because hours change by season.
Thanks for your help!
Dariush.
Based on comments received here is what I am looking for - see spreadsheet below. I need to calculate the average of 'Value' during nighttime using the Nighttime flag, and then repeat the average value for all time stamps until the following night, at which time the average is updated and repeated until the next nighttime flag.
I have a sequence of timestamps (in Unix milliseconds timebase) stored in a pandas Series. Each timestamp belongs to a sensor measurement. To get the sampling frequency I can just subtract the last timestamp from the first one and then divide by the amount of timestamps:
# assuming df is my Series
sf = (df.iloc[-1] - df.iloc[1]) / len(df)
But this does not provide me insights about the variation of the sampling frequency.
How can I calculate the standard deviation of the sampling frequency?
If you have the timestamps stored in numerical form, I'd propose simply checking the std of the interval between two timestamps.
In your example:
df.diff().std()
I have a DataFrame series with day resolution. I want to transform the series to a series of monthly averages. Ofcourse I can apply rolling mean and select only every 30th of means but it would not precise. I want to get series which contains mean value from the previous month on every first day of a month. For example, on February 1 I want to have daily average for the January. How can I do this in pythonic way?
data.resample('M', how='mean')
How would I use pandas to calculate a cumulative deviation from a mean monthly rainfall value?
I am given daily rainfall data (e.g. s, below) which I can convert to a pd.Series and resample into monthly periods (sum; e.g. sm, below). But I then want to calculate the difference between each monthly value and the mean for the month. I have added a synthetic example:
rng = pd.period_range(20010101, 20131231, freq='D')
s = pd.Series(np.random.normal(2.5,2,size=len(rng)), index=rng)
sm = s.resample('M', how='sum')
For example, for January 2010 I would like to calculate the difference between the value for that month and the average monthly rainfall for January (over a long period). Then I want a cumulative sum of that difference.
I have tried to use the groupby function:
sm.groupby(lambda x: x.month).mean()
But not successfully. I want each monthly value in 'sm' to have the average for all similar months to be subtracted, then a cumulative sum of that series created. This could be in one step I guess.
How could I achieve this efficiently?
Thanks
This is closely related to an example in the docs. This is untested code, but you want something like this:
monthly_rainfall = daily_rainfall.resample('D', how=np.sum)
To group all Januarys over all the years together (and so on for each month):
grouped = monthly_rainfall.groupby(lambda x: x.month)
Then
deviation = grouped.transform(lambda x: x - x.mean())
deviation.cumsum()