I ended up figuring it out while writing out this question so I'll just post anyway and answer my own question in case someone else needs a little help.
Problem
Suppose we have a DataFrame, df, containing this data.
import pandas as pd
from io import StringIO
data = StringIO(
"""\
date spendings category
2014-03-25 10 A
2014-04-05 20 A
2014-04-15 10 A
2014-04-25 10 B
2014-05-05 10 B
2014-05-15 10 A
2014-05-25 10 A
"""
)
df = pd.read_csv(data,sep="\s+",parse_dates=True,index_col="date")
Goal
For each row, sum the spendings over every row that is within one month of it, ideally using DataFrame.rolling as it's a very clean syntax.
What I have tried
df = df.rolling("M").sum()
But this throws an exception
ValueError: <MonthEnd> is a non-fixed frequency
version: pandas==0.19.2
Use the "D" offset rather than "M" and specifically use "30D" for 30 days or approximately one month.
df = df.rolling("30D").sum()
Initially, I intuitively jumped to using "M" as I figured it stands for one month, but now it's clear why that doesn't work.
To address why you cannot use things like "AS" or "Y", in this case, "Y" offset is not "a year", it is actually referencing YearEnd (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases), and therefore the rolling function does not get a fixed window (e.g. you get a 365 day window if your index falls on Jan 1, and 1 day if Dec 31).
The proposed solution (offset by 30D) works if you do not need strict calendar months. Alternatively, you would iterate over your date index, and slice with an offset to get more precise control over your sum.
If you have to do it in one line (separated for readability):
df['Sum'] = [
df.loc[
edt - pd.tseries.offsets.DateOffset(months=1):edt, 'spendings'
].sum() for edt in df.index
]
spendings category Sum
date
2014-03-25 10 A 10
2014-04-05 20 A 30
2014-04-15 10 A 40
2014-04-25 10 B 50
2014-05-05 10 B 50
2014-05-15 10 A 40
2014-05-25 10 A 40
Related
I'm trying to create a forecast which takes the previous day's 'Forecast' total and adds it to the current day's 'Appt'. Something which is straightforward in Excel but I'm struggling in pandas. At the moment all I can get in pandas using .loc is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,0,0,0,0]
})
What I'm looking for it to do is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,47,52,56,69]
})
E.g. 'Forecast' total on the 1st December is 37. On the 2nd December the value in the 'Appt' column in 10. I want it to select 37 and + 10, then put this in the 'Forecast' column for the 2nd December. Then iterate over the rest of the column.
I've tied using .loc() with the index, and experimented with .shift() but neither seem to work for what I'd like. Also looked into .rolling() but I think that's not appropriate either.
I'm sure there must be a simple way to do this?
Apologies, the original df has 'Date' as a datetime column.
You can use mask and cumsum:
df['Forecast'] = df['Forecast'].mask(df['Forecast'].eq(0), df['Appt']).cumsum()
# or
df['Forecast'] = np.where(df['Forecast'].eq(0), df['Appt'], df['Forecast']).cumsum()
Output:
Date Appt Forecast
0 2022-12-01 12 37
1 2022-12-01 10 47
2 2022-12-01 5 52
3 2022-12-01 4 56
4 2022-12-01 13 69
You have to make sure that your column has datetime/date type, then you may filter df like this:
# previous code&imports
yesterday = datetime.now().date() - timedelta(days=1)
df[df["date"] == yesterday]["your_column"].sum()
I'm sure that this question is not really helpful and could mean a lot of thinks so I'll try to explain the question with an example.
So my goal is to delete rows in a DataFrame like the following one if the row can't be part in a line of consecutive days which are as big as a given time period t. If t for example is 3, then the last row needs to be deleted, because there is a gap between the last and the row before. If t would be 4 then also the first three rows must be deleted, hence the 07.04.2012 or 03.04.2012 is missing. Hopefully you can understand what I try to explain here.
Date
Value
04.04.2012
24
05.04.2012
21
06.04.2012
20
08.04.2012
21
09.04.2012
23
10.04.2012
21
11.04.2012
26
13.04.2012
24
My attempt was to iterate over the values in the column 'Date' and check for every element x in the column if the value of the element x subtracted by the value of element x + t = -t. If this is not the case the whole row of the element should be deleted. But while I was searching how you can iterate over a DataFrame I read several times that it is not recommended to do that, because this needs a lot of computing time for big DataFrames. Unfortunately I couldn't find any other method or function which could do this. Therefore, I would be really glad if someone could help me out here. Thank you! :)
With the dates as index you can expand the index of the dataframe to include the missing days. The new dates will create nan values. Create groups for every nan value with .isna().cumsum() and count the size of each groups. Finally select the rows with a count larger or equal to the desired time period.
period = 3
df.set_index('Date', inplace=True)
df[df.groupby(df.reindex(pd.date_range(df.index.min(), df.index.max()))
.Value.isna().cumsum())
.transform('count').ge(period).Value].reset_index()
Output
Date Value
0 2012-04-04 24
1 2012-04-05 21
2 2012-04-06 20
3 2012-04-08 21
4 2012-04-09 23
5 2012-04-10 21
6 2012-04-11 26
To create the dataframe used in this solution
t = '''
Date Value
04.04.2012 24
05.04.2012 21
06.04.2012 20
08.04.2012 21
09.04.2012 23
10.04.2012 21
11.04.2012 26
13.04.2012 24
'''
import pandas as pd
from datetime import datetime
df = pd.read_csv(io.StringIO(t), sep='\s+', parse_dates=['Date'],
date_parser=lambda x: datetime.strptime(x, '%d.%m.%Y'))
First post: I apologize in advance for sloppy wording (and possibly poor searching if this question has been answered ad nauseum elsewhere - maybe I don't know the right search terms yet).
I have data in 10-minute chunks and I want to perform calculations on a column ('input') grouped by minute (i.e. 10 separate 60-second blocks - not a rolling 60 second period) and then store all ten calculations in a single list called output.
The 'seconds' column records the second from 1 to 600 in the 10-minute period. If no data was entered for a given second, there is no row for that number of seconds. So, some minutes have 60 rows of data, some have as few as one or two.
Note: the calculation (my_function) is not basic so I can't use groupby and np.sum(), np.mean(), etc. - or at least I can't figure out how to use groupby.
I have code that gets the job done but it looks ugly to me so I am sure there is a better way (probably several).
output=[]
seconds_slicer = 0
for i in np.linspace(1,10,10):
seconds_slicer += 60
minute_slice = df[(df['seconds'] > (seconds_slicer - 60)) &
(df['seconds'] <= seconds_slicer)]
calc = my_function(minute_slice['input'])
output.append(calc)
If there is a cleaner way to do this, please let me know. Thanks!
Edit: Adding sample data and function details:
seconds input
1 1 0.000054
2 2 -0.000012
3 3 0.000000
4 4 0.000000
5 5 0.000045
def realized_volatility(series_log_return):
return np.sqrt(np.sum(series_log_return**2))
For this answer, we're going to repurpose Bin pandas dataframe by every X rows
We'll create a dataframe with missing data in the 'seconds' column, as I understand your data to be based on the description given
secs=[1,2,3,4,5,6,7,8,9,11,12,14,15,17,19]
data = [np.random.randint(-25,54)/100000 for _ in range(15)]
df=pd.DataFrame(data=zip(secs,data), columns=['seconds','input'])
df
seconds input
0 1 0.00017
1 2 -0.00020
2 3 0.00033
3 4 0.00052
4 5 0.00040
5 6 -0.00015
6 7 0.00001
7 8 -0.00010
8 9 0.00037
9 11 0.00050
10 12 0.00000
11 14 -0.00009
12 15 -0.00024
13 17 0.00047
14 19 -0.00002
I didn't create 600 rows, but that's okay, we'll say we want to bin every 5 seconds instead of every 60. Now, because we're just trying to use equal time measures for grouping, we can just use floor division to see which bin each time interval would end up in. (In your case, you'd divide by 60 instead)
grouped=df.groupby(df['seconds'] // 5).apply(realized_volatility).drop('seconds', axis=1) #we drop the extra 'seconds' column because we don;t care about the root sum of squares of seconds in the df
grouped
input
seconds
0 0.000441
1 0.000372
2 0.000711
3 0.000505
I have the following df:
Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10
I would like to group every 3 weeks and sum up sales. I want so start with the bottom 3 weeks. If there are less than 3 weeks left at the top like in this example, these weeks should be ignored. Desired output is this:
Week Sales
5-3 50
8-6 35
I tried this on my original df df.reset_index(drop=True).groupby(by=lambda x: x/N, axis=0).sum()
but this solution is not starting from the bottom rows.
Can anyone point me into the right direction here? Thanks!
You can try inverse the data with .iloc[::-1]:
N=3
(df.iloc[::-1].groupby(np.arange(len(df))//N)
.agg({'Week': lambda x: f'{x.iloc[0]}-{x.iloc[-1]}',
'Sales': 'sum'
})
)
Output:
Week Sales
0 8-6 35
1 5-3 50
2 2-1 25
When dealing with period aggregation, I usually use .resample as it is fixable in binning data with different time periods
import io
from datetime import timedelta
import pandas as pd
dataf = pd.read_csv(io.StringIO("""Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10"""), sep='\s+',).astype(int)
# reverse data and transform int weeks to actual date time
dataf = dataf.iloc[::-1]
dataf['Week'] = dataf['Week'].map(lambda x: timedelta(weeks=x))
# set date object to index for resampling
dataf = dataf.set_index('Week')
# now we resample
dataf.resample('21d').sum() # 21days
::: Note: the label is misleading. And setting kind='period' does raises error
I have a large df with many entries per month. I would like to see the average entries per month as to see as an example if there are any months that normally have more entries. (Ideally I'd like to plot this with a line of the over all mean to compare with but that is maybe a later question).
My df is something like this:
ufo=pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo['Time']=pd.to_datetime(ufo.Time)
Where the head looks like this:
So if I'd like to see if there are more ufo-sightings in the summer as an example, how would I go about?
I have tried:
ufo.groupby(ufo.Time.month).mean()
But it does only work if I am calculating a numerical value. If I use count()instead I get the sum of all entries for all months.
EDIT: To clarify, I would like to have the mean of entries - ufo-sightings - per month.
You could do something like this:
# count the total months in the records
def total_month(x):
return x.max().year -x.min().year + 1
new_df = ufo.groupby(ufo.Time.dt.month).Time.agg(['size', total_month])
new_df['mean_count'] = new_df['size'] /new_df['total_month']
Output:
size total_month mean_count
Time
1 862 57 15.122807
2 817 70 11.671429
3 1096 55 19.927273
4 1045 68 15.367647
5 1168 53 22.037736
6 3059 71 43.084507
7 2345 65 36.076923
8 1948 64 30.437500
9 1635 67 24.402985
10 1723 65 26.507692
11 1509 50 30.180000
12 1034 56 18.464286
I think this what you are looking for, still please ask for clarification if i didn't reached what you are looking for.
# Add a new column instance, this adds a value to each instance of ufo sighting
ufo['instance'] = 1
# set index to time, this makes df a time series df and then you can apply pandas time series functions.
ufo.set_index(ufo['Time'], drop=True, inplace=True)
# create another df by resampling the original df and counting the instance column by Month ('M' is resample by month)
ufo2 = pd.DataFrame(ufo['instance'].resample('M').count())
# just to find month of resampled observation
ufo2['Time'] = pd.to_datetime(ufo2.index.values)
ufo2['month'] = ufo2['Time'].apply(lambda x: x.month)
and finally you can groupby month :)
ufo2.groupby(by='month').mean()
and this is the output which looks like this:
month mean_instance
1 12.314286
2 11.671429
3 15.657143
4 14.928571
5 16.685714
6 43.084507
7 33.028169
8 27.436620
9 23.028169
10 24.267606
11 21.253521
12 14.563380
Do you mean you want to group your data by month? I think we can do this
ufo['month'] = ufo['Time'].apply(lambda t: t.month)
ufo['year'] = ufo['Time'].apply(lambda t: t.year)
In this way, you will have 'year' and 'month' to group your data.
ufo_2 = ufo.groupby(['year', 'month'])['place_holder'].mean()