Pandas day for day

Pandas day for day - python

I have a lot of data in a Pandas dataframe:
Timestamp Value
2015-07-15 07:16:39.034 49.960
2015-07-15 07:16:39.036 49.940
......
2015-08-12 23:16:39.235 42.958
I have about 50 000 entries per day, and I would like to perform different operations on this data, day by day.
For example, if I would like to find the rolling mean, I would enter this:
df['rm5000'] = pd.rolling_mean(df['Value'], window=5000)
But that would give me the rolling mean across dates. The first rolling mean datapoint August 12th would contain 4999 datapoints from August 11th. However, I would like to start all over each day, so as the first 4999 datapoints on each day do not contain a rolling mean of 5000, as there might be a large difference between the last data one date and the first data the next day.
Do I have to slice the data into separate dataframes for each date for Pandas to do certain operations on the data for each separate date?

If you set the timestamps as a index, you can groupby a TimeGrouper with a frequency code to partition the data by days, like below
In [2]: df = pd.DataFrame({'Timestamp': pd.date_range('2015-07-15', '2015-07-18', freq='10min'),
'Value': np.linspace(49, 51, 433)})
In [3]: df = df.set_index('Timestamp')
In [4]: df.groupby(pd.TimeGrouper('D'))['Value'].apply(lambda x: pd.rolling_mean(x, window=15))
Out[4]:
Timestamp
2015-07-15 00:00:00 NaN
2015-07-15 00:10:00 NaN
.....
2015-07-15 23:30:00 49.620370
2015-07-15 23:40:00 49.625000
2015-07-15 23:50:00 49.629630
2015-07-16 00:00:00 NaN
2015-07-16 00:10:00 NaN

Related

How to Calculate Year over Year Percentage Change in Dataframe with Datetime Index based on Date and not number of Periods

I have multiple Dataframes for macroeconomic timeseries. In each of these Dataframes I want to add a column showing the Year over Year percentage change. Ideally I would do this with a for loop so I don't have to repeat the process multiple times. However, the series do not have the same frequency. For example, GDP is quarterly, PCE is monthly and S&P returns are daily. So, I cannot specify the number of periods. Since my dataframe is already in Datetime index I would like to specify that I want to the percentage change to be calculated based on the dates. Is that possible?
Please see examples of my Dataframes below:
print(gdp):
Date GDP
1947-01-01 2.034450e+12
1947-04-01 2.029024e+12
1947-07-01 2.024834e+12
1947-10-01 2.056508e+12
1948-01-01 2.087442e+12
...
2021-04-01 1.936831e+13
2021-07-01 1.947889e+13
2021-10-01 1.980629e+13
2022-01-01 1.972792e+13
2022-04-01 1.969946e+13
[302 rows x 1 columns]
print(pce):
Date PCE
1960-01-01 1.695549
1960-02-01 1.706421
1960-03-01 1.692806
1960-04-01 1.863354
1960-05-01 1.911975
...
2022-02-01 6.274030
2022-03-01 6.638595
2022-04-01 6.269216
2022-05-01 6.324989
2022-06-01 6.758935
[750 rows x 1 columns]
print(spx):
Date SPX
1928-01-03 17.76
1928-01-04 17.72
1928-01-05 17.55
1928-01-06 17.66
1928-01-09 17.59
...
2022-08-19 4228.48
2022-08-22 4137.99
2022-08-23 4128.73
2022-08-24 4140.77
2022-08-25 4199.12
[24240 rows x 1 columns]
Instead of doing this:
gdp['GDP] = gdp['GDP'].pct_change(4)
pce['PCE'] = pce['PCE'].pct_change(12)
spx['SPX'] = spx['SPX'].pct_change(252)
I would like a for loop to do it for all Dataframes without specifying the periods but specifying that I want the percentage change from Year to Year.

Given:
d = {'Date': [ '2021-02-01',
'2021-03-01',
'2021-04-01',
'2021-05-01',
'2021-06-01',
'2022-02-01',
'2022-03-01',
'2022-04-01',
'2022-05-01',
'2022-06-01'],
'PCE': [ 1.695549, 1.706421, 1.692806, 1.863354, 1.911975,
6.274030, 6.638595, 6.269216, 6.324989, 6.758935]}
pce = pd.DataFrame(d)
pce = pce.set_index('Date')
pce.index = pce.to_datetime(pce.index)
You could create a new dataframe with a copy of the datetime index as a new column, resample the new dataframe with annual frequency ('A') and count all unique values in the Date column.
pce_annual_rows = pce.index.to_frame()
resampled_annual = pce_annual_rows.resample('A').count()
Next you can get the second last Date-count value and use that as your periods values in the pct_change method.
The second last, because if there is an incomplete year at the end, you probably end up with a wrong periods value. This assumes, that you have more than 1 year of data in every dataframe, otherwise you'll get an IndexError.
periods_per_year = resampled_annual['Date'].iloc[-2]
pce['ROC'] = pce['PCE'].pct_change(periods_per_year)
This produces the following output:
PCE ROC
Date
2021-02-01 1.695549 NaN
2021-03-01 1.706421 NaN
2021-04-01 1.692806 NaN
2021-05-01 1.863354 NaN
2021-06-01 1.911975 NaN
2022-02-01 6.274030 2.700294
2022-03-01 6.638595 2.890362
2022-04-01 6.269216 2.703446
2022-05-01 6.324989 2.394411
2022-06-01 6.758935 2.535054
This solution isn't very nice, maybe someone comes up with another, less complicated idea.
To build your for-loop to do this for every dataframe, you'd probably better use the same column name for the columns you want to apply the pct_change method on.

Time Series Resampling with wrong out and without Frequency

At the moment I am working on a time series project.
I have Daily Data points over a 5 year timespan. In between there a some days with 0 values and some days are missing.
For example:
2015-01-10 343
2015-03-10 128
Day 2 of october is missing.
In order to build a good Time Series Model I want to resample the Data to Monthly:
df.individuals.resample("M").sum()
but I am getting the following output:
2015-01-31 343.000000
2015-02-28 NaN
2015-03-31 64.500000
Somehow the months are completely wrong.
The expected output would look like this:
2015-31-10 Sum of all days
2015-30-11 Sum of all days
2015-31-12 Sum of all days

Pandas is interpreting your date as %Y-%m-%d.
You should explicitly specify your date format before doing the resample.
Try this:
df.index = pd.to_datetime(df.index, format="%Y-%d-%m")
>>> df.resample("M").sum()
2015-10-31 471

Pandas dataframe: Sum up rows by date and keep only one row per day without timestamp

I have such a dataframe:
ds y
2018-07-25 22:00:00 1
2018-07-25 23:00:00 2
2018-07-26 00:00:00 3
2018-07-26 01:00:00 4
2018-07-26 02:00:00 5
What I want to get is a new dataframe which looks like this
ds y
2018-07-25 3
2018-07-26 12
I want to get a new dataframe df1 where all the entries of one day are summed up in y and I only want to keep one column of this day without a timestamp.
What I did so far is this:
df1 = df.groupby(df.index.date).transform(lambda x: x[:24].sum())
24 because I have 24 entries every day (for every hour). I get the correct sum for every day but I also get 24 rows for every day together with the existing timestamps. How can I achieve what I want?

If need sum all values per days then filtering first 24 rows is not necessary:
df1 = df.groupby(df.index.date)['y'].sum().reset_index()

Try out:
df.groupby([df.dt.year, df.dt.month, df.dt.day])['y'].sum()

Pandas groupby same date over the years (group 2007-01-01 with 2008-01-01 etc.)

I have a pandas-dataframe which contains the temperature for every hour. I already grouped to the mean temperature of the day with:
weather = weather.groupby(pd.Grouper(key='date', freq='D')).mean()
to:
temp
date
2007-01-01 11.457143
2007-01-02 9.229167
2007-01-03 9.085106
2007-01-04 11.234043
2007-01-05 11.239130
... ...
2016-12-27 8.437500
2016-12-28 5.145833
2016-12-29 3.739130
2016-12-30 7.020833
2016-12-31 3.729167
[3653 rows x 1 columns]
how can I get the mean temperature of the same date over the years?
For example the mean temperature from 2007-01-01 / 2008-01-01 / 2009-01-01 and so on?
My data needs to look something like this, with 01-01 being the mean temperature from the first of january over the years:
01-01 12
01-02 15
01-03 13
Thank you in advance!

You can group by month and day:
weather = weather.groupby([weather.index.month, weather.index.day])[['val']].mean()
You obtain a series indexed with pairs (month, day). You can go one step further if you want the index to be strings 'month-day':
weather.index = pd.Series(weather.index.values).apply(lambda x: '{0:02d}-{1:02d}'.format(*x))

Create a dataframe:
rng = pd.date_range('2015-01-01', periods=1000, freq='D')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randint(low=12, high=100, size=len(rng))})
Get a month date column
df['month_day'] = df['Date'].map(lambda x: x.strftime('%m-%d'))
Groupby the month_day
df.groupby('month_day').mean()

Pandas: How to group the non-continuous date column?

I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?

Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas day for day - python

Related

How to Calculate Year over Year Percentage Change in Dataframe with Datetime Index based on Date and not number of Periods

Time Series Resampling with wrong out and without Frequency

Pandas dataframe: Sum up rows by date and keep only one row per day without timestamp

Pandas groupby same date over the years (group 2007-01-01 with 2008-01-01 etc.)

Pandas: How to group the non-continuous date column?

Categories

Resources