I have a lot of data in a Pandas dataframe:
Timestamp Value
2015-07-15 07:16:39.034 49.960
2015-07-15 07:16:39.036 49.940
......
2015-08-12 23:16:39.235 42.958
I have about 50 000 entries per day, and I would like to perform different operations on this data, day by day.
For example, if I would like to find the rolling mean, I would enter this:
df['rm5000'] = pd.rolling_mean(df['Value'], window=5000)
But that would give me the rolling mean across dates. The first rolling mean datapoint August 12th would contain 4999 datapoints from August 11th. However, I would like to start all over each day, so as the first 4999 datapoints on each day do not contain a rolling mean of 5000, as there might be a large difference between the last data one date and the first data the next day.
Do I have to slice the data into separate dataframes for each date for Pandas to do certain operations on the data for each separate date?
If you set the timestamps as a index, you can groupby a TimeGrouper with a frequency code to partition the data by days, like below
In [2]: df = pd.DataFrame({'Timestamp': pd.date_range('2015-07-15', '2015-07-18', freq='10min'),
'Value': np.linspace(49, 51, 433)})
In [3]: df = df.set_index('Timestamp')
In [4]: df.groupby(pd.TimeGrouper('D'))['Value'].apply(lambda x: pd.rolling_mean(x, window=15))
Out[4]:
Timestamp
2015-07-15 00:00:00 NaN
2015-07-15 00:10:00 NaN
.....
2015-07-15 23:30:00 49.620370
2015-07-15 23:40:00 49.625000
2015-07-15 23:50:00 49.629630
2015-07-16 00:00:00 NaN
2015-07-16 00:10:00 NaN
Related
I have multiple Dataframes for macroeconomic timeseries. In each of these Dataframes I want to add a column showing the Year over Year percentage change. Ideally I would do this with a for loop so I don't have to repeat the process multiple times. However, the series do not have the same frequency. For example, GDP is quarterly, PCE is monthly and S&P returns are daily. So, I cannot specify the number of periods. Since my dataframe is already in Datetime index I would like to specify that I want to the percentage change to be calculated based on the dates. Is that possible?
Please see examples of my Dataframes below:
print(gdp):
Date GDP
1947-01-01 2.034450e+12
1947-04-01 2.029024e+12
1947-07-01 2.024834e+12
1947-10-01 2.056508e+12
1948-01-01 2.087442e+12
...
2021-04-01 1.936831e+13
2021-07-01 1.947889e+13
2021-10-01 1.980629e+13
2022-01-01 1.972792e+13
2022-04-01 1.969946e+13
[302 rows x 1 columns]
print(pce):
Date PCE
1960-01-01 1.695549
1960-02-01 1.706421
1960-03-01 1.692806
1960-04-01 1.863354
1960-05-01 1.911975
...
2022-02-01 6.274030
2022-03-01 6.638595
2022-04-01 6.269216
2022-05-01 6.324989
2022-06-01 6.758935
[750 rows x 1 columns]
print(spx):
Date SPX
1928-01-03 17.76
1928-01-04 17.72
1928-01-05 17.55
1928-01-06 17.66
1928-01-09 17.59
...
2022-08-19 4228.48
2022-08-22 4137.99
2022-08-23 4128.73
2022-08-24 4140.77
2022-08-25 4199.12
[24240 rows x 1 columns]
Instead of doing this:
gdp['GDP] = gdp['GDP'].pct_change(4)
pce['PCE'] = pce['PCE'].pct_change(12)
spx['SPX'] = spx['SPX'].pct_change(252)
I would like a for loop to do it for all Dataframes without specifying the periods but specifying that I want the percentage change from Year to Year.
Given:
d = {'Date': [ '2021-02-01',
'2021-03-01',
'2021-04-01',
'2021-05-01',
'2021-06-01',
'2022-02-01',
'2022-03-01',
'2022-04-01',
'2022-05-01',
'2022-06-01'],
'PCE': [ 1.695549, 1.706421, 1.692806, 1.863354, 1.911975,
6.274030, 6.638595, 6.269216, 6.324989, 6.758935]}
pce = pd.DataFrame(d)
pce = pce.set_index('Date')
pce.index = pce.to_datetime(pce.index)
You could create a new dataframe with a copy of the datetime index as a new column, resample the new dataframe with annual frequency ('A') and count all unique values in the Date column.
pce_annual_rows = pce.index.to_frame()
resampled_annual = pce_annual_rows.resample('A').count()
Next you can get the second last Date-count value and use that as your periods values in the pct_change method.
The second last, because if there is an incomplete year at the end, you probably end up with a wrong periods value. This assumes, that you have more than 1 year of data in every dataframe, otherwise you'll get an IndexError.
periods_per_year = resampled_annual['Date'].iloc[-2]
pce['ROC'] = pce['PCE'].pct_change(periods_per_year)
This produces the following output:
PCE ROC
Date
2021-02-01 1.695549 NaN
2021-03-01 1.706421 NaN
2021-04-01 1.692806 NaN
2021-05-01 1.863354 NaN
2021-06-01 1.911975 NaN
2022-02-01 6.274030 2.700294
2022-03-01 6.638595 2.890362
2022-04-01 6.269216 2.703446
2022-05-01 6.324989 2.394411
2022-06-01 6.758935 2.535054
This solution isn't very nice, maybe someone comes up with another, less complicated idea.
To build your for-loop to do this for every dataframe, you'd probably better use the same column name for the columns you want to apply the pct_change method on.
At the moment I am working on a time series project.
I have Daily Data points over a 5 year timespan. In between there a some days with 0 values and some days are missing.
For example:
2015-01-10 343
2015-03-10 128
Day 2 of october is missing.
In order to build a good Time Series Model I want to resample the Data to Monthly:
df.individuals.resample("M").sum()
but I am getting the following output:
2015-01-31 343.000000
2015-02-28 NaN
2015-03-31 64.500000
Somehow the months are completely wrong.
The expected output would look like this:
2015-31-10 Sum of all days
2015-30-11 Sum of all days
2015-31-12 Sum of all days
Pandas is interpreting your date as %Y-%m-%d.
You should explicitly specify your date format before doing the resample.
Try this:
df.index = pd.to_datetime(df.index, format="%Y-%d-%m")
>>> df.resample("M").sum()
2015-10-31 471
I have such a dataframe:
ds y
2018-07-25 22:00:00 1
2018-07-25 23:00:00 2
2018-07-26 00:00:00 3
2018-07-26 01:00:00 4
2018-07-26 02:00:00 5
What I want to get is a new dataframe which looks like this
ds y
2018-07-25 3
2018-07-26 12
I want to get a new dataframe df1 where all the entries of one day are summed up in y and I only want to keep one column of this day without a timestamp.
What I did so far is this:
df1 = df.groupby(df.index.date).transform(lambda x: x[:24].sum())
24 because I have 24 entries every day (for every hour). I get the correct sum for every day but I also get 24 rows for every day together with the existing timestamps. How can I achieve what I want?
If need sum all values per days then filtering first 24 rows is not necessary:
df1 = df.groupby(df.index.date)['y'].sum().reset_index()
Try out:
df.groupby([df.dt.year, df.dt.month, df.dt.day])['y'].sum()
I have a pandas-dataframe which contains the temperature for every hour. I already grouped to the mean temperature of the day with:
weather = weather.groupby(pd.Grouper(key='date', freq='D')).mean()
to:
temp
date
2007-01-01 11.457143
2007-01-02 9.229167
2007-01-03 9.085106
2007-01-04 11.234043
2007-01-05 11.239130
... ...
2016-12-27 8.437500
2016-12-28 5.145833
2016-12-29 3.739130
2016-12-30 7.020833
2016-12-31 3.729167
[3653 rows x 1 columns]
how can I get the mean temperature of the same date over the years?
For example the mean temperature from 2007-01-01 / 2008-01-01 / 2009-01-01 and so on?
My data needs to look something like this, with 01-01 being the mean temperature from the first of january over the years:
01-01 12
01-02 15
01-03 13
Thank you in advance!
You can group by month and day:
weather = weather.groupby([weather.index.month, weather.index.day])[['val']].mean()
You obtain a series indexed with pairs (month, day). You can go one step further if you want the index to be strings 'month-day':
weather.index = pd.Series(weather.index.values).apply(lambda x: '{0:02d}-{1:02d}'.format(*x))
Create a dataframe:
rng = pd.date_range('2015-01-01', periods=1000, freq='D')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randint(low=12, high=100, size=len(rng))})
Get a month date column
df['month_day'] = df['Date'].map(lambda x: x.strftime('%m-%d'))
Groupby the month_day
df.groupby('month_day').mean()
I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))