Rolling mean without sorting descending time series - python

is there a way to calculate a rolling mean on a descending time series without sorting it into a ascending one?
Original time series with same timestamp order as in csv file.
pd.read_csv(data_dir+items+extension, parse_dates=True, index_col='timestamp').sort_index(ascending=False)
timestamp open
2021-05-06 90.000
2021-05-05 93.600
2021-05-04 90.840
2021-05-03 91.700
2021-04-30 91.355
Rolling mean
stock_dict[items]["SMA100"]=pd.Series(stock_dict[items]["close"]).rolling(window=100).mean()
ascending = False
open high low close volume SMA100
timestamp
2021-05-06 90.000 93.5200 89.64 93.03 8024053 NaN
2021-05-05 93.600 94.7700 90.00 90.08 13079308 NaN
2021-05-04 90.840 90.9700 87.44 88.69 15147509 NaN
2021-05-03 91.700 92.0200 90.79 91.15 6641764 NaN
2021-04-30 91.355 91.9868 90.89 91.19 6614347 NaN
... ... ... ... ... ... ...
1999-11-05 14.560 15.5000 14.50 15.38 1308267 14.9245
1999-11-04 14.690 14.7500 14.25 14.62 207033 14.9395
1999-11-03 14.310 14.5000 14.12 14.50 61600 14.9526
1999-11-02 14.250 15.0000 14.16 14.25 128817 14.9639
1999-11-01 14.190 14.3800 13.94 14.06 173233 14.9682
ascending = True
open high low close volume SMA100
timestamp
1999-11-01 14.190 14.3800 13.94 14.06 173233 NaN
1999-11-02 14.250 15.0000 14.16 14.25 128817 NaN
1999-11-03 14.310 14.5000 14.12 14.50 61600 NaN
1999-11-04 14.690 14.7500 14.25 14.62 207033 NaN
1999-11-05 14.560 15.5000 14.50 15.38 1308267 NaN
... ... ... ... ... ... ...
2021-04-30 91.355 91.9868 90.89 91.19 6614347 93.1148
2021-05-03 91.700 92.0200 90.79 91.15 6641764 93.2036
2021-05-04 90.840 90.9700 87.44 88.69 15147509 93.2542
2021-05-05 93.600 94.7700 90.00 90.08 13079308 93.3292
2021-05-06 90.000 93.5200 89.64 93.03 8024053 93.4284
As time series goes from 1999 to 2012 rolling mean is correct in case of ascending = True.
So either I have to change sorting of data, which I would like to avoid, or I have somehow to tell rolling mean function to start with last entry and calculate backwards.

Related

Actual interpolation based on date

This question here is a follow up question occurred in the comments of Resampling on a multi index.
We start with following data:
data=pd.DataFrame({'dates':['2004','2008','2012'],'values':[k*(1+4*365) for k in range(3)]})
data['dates']=pd.to_datetime(data['dates'])
data=data.set_index('dates')
That is what it produces:
Now, when I resample and interpolate by
data.resample('A').mean().interpolate()
I obtain the following:
But what I want (and the problem is already the resampling and not the interpolation step) is
2004-12-31 365
2005-12-31 730
2006-12-31 1095
2007-12-31 1460
2008-12-31 1826
2009-12-31 2191
2010-12-31 2556
2011-12-31 2921
2012-12-31 3287
So I want an actual linear interpolation on the given data.
To make it even clearer I wrote a function which does the job. However, I'm still looking for a build in solution (my own function is bad coding because of a very ugly runtime):
def fillResampleCorrectly(data,resample):
for i in range(len(resample)):
currentDate=resample.index[i]
for j in range(len(data)):
if currentDate>=data.index[j]:
if j<len(data)-1:
continue
valueBefore=data[data.columns[0]].iloc[j-1]
valueAfter=data[data.columns[0]].iloc[j]
dateBefore=data.index[j-1]
dateAfter=data.index[j]
currentValue=valueBefore+(valueAfter-valueBefore)*((currentDate-dateBefore)/(dateAfter-dateBefore))
resample[data.columns[0]].iloc[i]=currentValue
break
I don't find a direct way for your exact output. The issue is the resampling between the 01-01 and 31-12 of the first year.
You can however mimick the result with:
out = data.resample('A', label='right').mean().interpolate(method='time') + 365
Or:
s = data.resample('A', label='right').mean().interpolate(method='time')
out = s + (s.index[0] - data.index[0]).days
Output:
values
dates
2004-12-31 365.0
2005-12-31 730.0
2006-12-31 1095.0
2007-12-31 1460.0
2008-12-31 1826.0
2009-12-31 2191.0
2010-12-31 2556.0
2011-12-31 2921.0
2012-12-31 3287.0
What is “actual” interpolation? You are considering leap years, which makes this a non-linear relationship.
Generating a df that starts with the end of the year (and accounts for 2004 as a leap year):
data = pd.DataFrame({'dates': ['2004-12-31', '2008-12-31', '2012-12-31'], 'values': [
366 + k * (1 + 4 * 365) for k in range(3)]})
data['dates'] = pd.to_datetime(data['dates'])
data = data.set_index('dates')
values
dates
2004-12-31 366
2008-12-31 1827
2012-12-31 3288
Resample and interpolation as before (data = data.resample('A').mean().interpolate()). By the way, A in resample is end of year, and AS is start of year.
If we look at the difference between each step (data - data.shift(1)), we get:
values
dates
2004-12-31 NaN
2005-12-31 365.25
2006-12-31 365.25
2007-12-31 365.25
2008-12-31 365.25
2009-12-31 365.25
2010-12-31 365.25
2011-12-31 365.25
2012-12-31 365.25
As we would expect from a linear interpolation.
The desired result can be achieved by applying np.floor to the results:
data.resample('A').mean().interpolate().apply(np.floor)
values
dates
2004-12-31 366.0
2005-12-31 731.0
2006-12-31 1096.0
2007-12-31 1461.0
2008-12-31 1827.0
2009-12-31 2192.0
2010-12-31 2557.0
2011-12-31 2922.0
2012-12-31 3288.0
And the difference data - data.shift(1):
values
dates
2004-12-31 NaN
2005-12-31 365.0
2006-12-31 365.0
2007-12-31 365.0
2008-12-31 366.0
2009-12-31 365.0
2010-12-31 365.0
2011-12-31 365.0
2012-12-31 366.0
A non-linear relationship caused by the leap year.
I just came up with an idea and it works:
dailyData=data.asfreq('D').interpolate()
dailyData.groupby(dailyData.index.year).tail(1)
Only for the last year the wrong date is chosen, but that is completely fine for me. The important thing is that the days match to the values.

For loops in pandas dataframe, how to filter by days of hourly range and associate a value?

I have a dataframe named df_sub like this:
date open high low close volume
405 2022-01-03 08:00:00 4293.5 4295.5 4291.5
406 2022-01-03 08:01:00 4294.0 4295.5 4294.0
407 2022-01-03 08:02:00 4295.5 4297.5 4295.5
408 2022-01-03 08:03:00 4297.0 4298.0 4296.0
409 2022-01-03 08:04:00 4296.5 4296.5 4295.0
... ... ... ... ... ... ... ... ...
5460 2022-01-07 08:55:00 4311.0 4312.0 4310.5
5461 2022-01-07 08:56:00 4311.5 4311.5 4311.0
5462 2022-01-07 08:57:00 4311.0 4312.0 4310.0
I need to create a loop of this type:
for row in df_sub:
take a single day (so, in this case 2022-01-03, 04...07) and create a column with df_sub["high"].max() value,
so i will have the maximum value of the high in all the rows of the same day,
naturally, this implies that in other day the maximum value will be different from the
previews one because the high will be different.
You can use resample:
df_sub=df_sub.set_index('date')
df_new=df_sub.resample('d')['high'].max()

Grouping by into 15mins

I try to groupby my dataset in sets of 15mins instead of hours. I currently have my yearly data grouped into dayhour 0 to 23, but in order to have more datapoints, I would like to do this for every 15 mins, such that I end up with 00:00,00:15.... 23:45
This is the first part of my inital dataframe merged:
Price Afnemen Invoeden ... Temperature Precipation NWSE
StartTime ...
2018-06-13 00:00:00 42.30 34.02 34.02 ... 13.60 0.0 N
2018-06-13 00:15:00 42.30 42.57 42.57 ... 13.60 0.0 N
2018-06-13 00:30:00 42.30 42.02 42.02 ... 13.60 0.0 N
2018-06-13 00:45:00 42.30 46.09 46.09 ... 13.60 0.0 N
With this line merged= merged.groupby(merged.index.hour).mean()
I get the hourly means
StartTime Price Afnemen ... Windspeed Temperature Precipation
0 47.163836 47.910985 ... 3.508562 9.591096 0.045890
1 44.473082 46.274221 ... 3.500000 9.265582 0.041438
2 42.862123 43.309392 ... 3.445205 8.974658 0.060959
However, I would like to get something like:
StartTime Price Afnemen ... Windspeed Temperature Precipation
00:00 (Some value here)
00:15
...
23:45
I thought about using merged.groupby(merged.index.hour,merged.index.minute).mean()
But in this way I would get two index columns. This is not desirable, as the final goal is to plot the datapoints.
I hope, this question is clear and thanks in advance!
Assuming your index is a DateTimeIndex you can use Grouper:
agg_15m = df.groupby(pd.Grouper(freq='15Min')).mean()
link

How to apply a function which gets the dataframe sliced with sliding window as a parameter in pandas?

I've a time series data stored in pandas dataframe, which looks like this:
Date Open High Low Close Volume
0 2016-01-19 22.86 22.92 22.36 22.60 838024
1 2016-01-20 22.19 22.98 21.87 22.77 796745
2 2016-01-21 22.75 23.10 22.62 22.76 573068
3 2016-01-22 23.13 23.35 22.96 23.33 586967
4 2016-01-25 23.22 23.42 23.01 23.26 645551
5 2016-01-26 23.28 23.85 23.22 23.74 592658
6 2016-01-27 23.68 23.78 18.76 20.09 5351850
7 2016-01-28 20.05 20.69 19.11 19.37 2255635
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
9 2016-02-01 19.77 19.80 19.13 19.14 1203375
I want to create an applicable function, which gets a slice of the original dataset what it can aggregate by any custom defined aggregation operator.
Lets say, the function is applied like this:
aggregated_df = data.apply(calculateMySpecificAggregation, axis=1)
where the calculateMySpecificAggregation gets a 3-sized slice of the original dataframe for each row of the original dataframe.
For each row the parameter dataframe of function contains the previous and the next rows of the original dataframe.
#pseudocode example
def calculateMySpecificAggregation(df_slice):
# I want to know which row was this function applied on (an index I would like to have here)
ri= ??? # index of the row where was this function applied
# where df_slice contains 3 rows and all columns
return float(df_slice["Close"][ri-1] + \
((df_slice["High"][ri] + df_slice["Low"][ri]) / 2) + \
df_slice["Open"][ri+1])
# this line will fail on the borders, but don't worry, I will handle it later...
I want to have the sliding window size parametrized, access to other columns of the row and know the row index of the original line where the function was applied on.
That means, in case of slidingWindow = 3, I want to have parameter dataframes:
#parameter dataframe when the function is applied on row[0]:
Date Open High Low Close Volume
0 2016-01-19 22.86 22.92 22.36 22.60 838024
1 2016-01-20 22.19 22.98 21.87 22.77 796745
#parameter dataframe when the function is applied on row[1]:
Date Open High Low Close Volume
0 2016-01-19 22.86 22.92 22.36 22.60 838024
1 2016-01-20 22.19 22.98 21.87 22.77 796745
2 2016-01-21 22.75 23.10 22.62 22.76 573068
#parameter dataframe when the function is applied on row[2]:
Date Open High Low Close Volume
1 2016-01-20 22.19 22.98 21.87 22.77 796745
2 2016-01-21 22.75 23.10 22.62 22.76 573068
3 2016-01-22 23.13 23.35 22.96 23.33 586967
#parameter dataframe when the function is applied on row[3]:
Date Open High Low Close Volume
2 2016-01-21 22.75 23.10 22.62 22.76 573068
3 2016-01-22 23.13 23.35 22.96 23.33 586967
4 2016-01-25 23.22 23.42 23.01 23.26 645551
...
#parameter dataframe when the function is applied on row[7]:
Date Open High Low Close Volume
6 2016-01-27 23.68 23.78 18.76 20.09 5351850
7 2016-01-28 20.05 20.69 19.11 19.37 2255635
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
#parameter dataframe when the function is applied on row[8]:
Date Open High Low Close Volume
7 2016-01-28 20.05 20.69 19.11 19.37 2255635
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
9 2016-02-01 19.77 19.80 19.13 19.14 120375
#parameter dataframe when the function is applied on row[9]:
Date Open High Low Close Volume
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
9 2016-02-01 19.77 19.80 19.13 19.14 1203375
I don't want to use a cycle combined with iloc indexing if possible.
I've experimented with pandas.DataFrame.rolling and pandas.rolling_apply with no success.
Does anyone know how to solve this problem?
Ok, after a long suffering I've solved the problem.
I couldn't avoid iloc (which is not a big problem in this case), but at least cycle is not being used here.
contextSizeLeft = 2
contextSizeRight = 3
def aggregateWithContext(df, row, func, contextSizeLeft, contextSizeRight):
leftBorder = max(0, row.name - contextSizeLeft)
rightBorder = min(len(df), row.name + contextSizeRight) + 1
'''
print("pos: ", row.name, \
"\t", (row.name-contextSizeLeft, row.name+contextSizeRight), \
"\t", (leftBorder, rightBorder), \
"\t", len(df.loc[:][leftBorder : rightBorder]))
'''
return func(df.iloc[:][leftBorder : rightBorder], row.name)
def aggregate(df, center):
print()
print("center", center)
print(df["Date"])
return len(df)
df.apply(lambda x: aggregateWithContext(df, x, aggregate, contextSizeLeft, contextSizeRight), axis=1)
and the same for dates if anyone would need it:
def aggregateWithContext(df, row, func, timedeltaLeft, timedeltaRight):
dateInRecord = row["Date"]
leftBorder = pd.to_datetime(dateInRecord - timedeltaLeft)
rightBorder = pd.to_datetime(dateInRecord + timedeltaRight)
dfs = df[(df['Date'] >= leftBorder) & (df['Date'] <= rightBorder)]
#print(dateInRecord, ":\t", leftBorder, "\t", rightBorder, "\t", len(dfs))
return func(dfs, row.name)
def aggregate(df, center):
#print()
#print("center", center)
#print(df["Date"])
return len(df)
timedeltaLeft = timedelta(days=2)
timedeltaRight = timedelta(days=2)
df.apply(lambda x: aggregateWithContext(df, x, aggregate, timedeltaLeft, timedeltaRight), axis=1)

How do I stack two DataFrames next to each other in Pandas?

I have two sets of stock data in DataFrames:
> GOOG.head()
Open High Low
Date
2011-01-03 21.01 21.05 20.78
2011-01-04 21.12 21.20 21.05
2011-01-05 21.19 21.21 20.90
2011-01-06 20.67 20.82 20.55
2011-01-07 20.71 20.77 20.27
AAPL.head()
Open High Low
Date
2011-01-03 596.48 605.59 596.48
2011-01-04 605.62 606.18 600.12
2011-01-05 600.07 610.33 600.05
2011-01-06 610.68 618.43 610.05
2011-01-07 615.91 618.25 610.13
and I would like to stack them next two each other in a single DataFrame so I can access and compare columns (e.g. High) across stocks (GOOG vs. AAPL)? What is the best way to do this in Pandas and access the subsequent columns (e.g. GOOG's High column and AAPL's High column). Thanks!
pd.concat is also an option
In [17]: pd.concat([GOOG, AAPL], keys=['GOOG', 'AAPL'], axis=1)
Out[17]:
GOOG AAPL
Open High Low Open High Low
Date
2011-01-03 21.01 21.05 20.78 596.48 605.59 596.48
2011-01-04 21.12 21.20 21.05 605.62 606.18 600.12
2011-01-05 21.19 21.21 20.90 600.07 610.33 600.05
2011-01-06 20.67 20.82 20.55 610.68 618.43 610.05
2011-01-07 20.71 20.77 20.27 615.91 618.25 610.13
Have a look at the join method of dataframes, use the lsuffix and rsuffix attributes to create new names for the joined columns. It works like this:
>>> x
A B C
0 0.838119 -1.116730 0.167998
1 -1.143761 0.051970 0.216113
2 -0.614441 0.208978 -0.630988
3 0.114902 -0.248791 -0.503172
4 0.836523 -0.802074 1.478333
>>> y
A B C
0 -0.455859 -0.488645 -1.618088
1 -2.295255 0.524681 1.021320
2 -0.484612 1.101463 -0.081476
3 -0.475076 0.915797 -0.998777
4 -0.847538 0.057044 1.053533
>>> x.join(y, lsuffix="_x", rsuffix="_y")
A_x B_x C_x A_y B_y C_y
0 0.838119 -1.116730 0.167998 -0.455859 -0.488645 -1.618088
1 -1.143761 0.051970 0.216113 -2.295255 0.524681 1.021320
2 -0.614441 0.208978 -0.630988 -0.484612 1.101463 -0.081476
3 0.114902 -0.248791 -0.503172 -0.475076 0.915797 -0.998777
4 0.836523 -0.802074 1.478333 -0.847538 0.057044 1.053533

Categories