I have a time series data similar to this
val
2015-10-15 7.85
2015-10-16 8
2015-10-19 8.18
2015-10-20 5.39
2015-10-21 2.38
2015-10-22 1.98
2015-10-23 9.25
2015-10-26 14.29
2015-10-27 15.52
2015-10-28 15.93
2015-10-29 15.79
2015-10-30 13.83
How can i find the slope of the adjecent rows (eg 8 and 7.85) of val variable and print it in a different column in R or python
I know the formula for a slope that is
but the problem is how we will take difference of x (that is date) values in a time series data
(Here x is date and y is val)
If by slope you mean (value(y)-value(x))/(y-x), then your slope should have at least one value less than your data frame, so it will be difficult to show it in the same data frame.
In R, this would be my answer:
slope<-numeric(length = nrow(df))
for(i in 2:(nrow(df)){
slope[i-1]<-(df[i-1,"val"]-df[i,"val"])/(as.numeric(df[i-1,1]-df[i,1]))
}
slope[nrow(df)]<-NA
df$slope<-slope
Edit (answering to your edition)
In R, dates is a class of data (like integers, numeric, or characters).
For example I can define a vector of dates:
x<-as.Date(c("2015-10-15","2015-10-16"))
print( x )
[1] "2015-10-15" "2015-10-16"
And the difference of 2 dates returns:
x[2]-x[1]
Time difference of 1 days
As you mentioned, you cannot divide by a date:
2/(x[2]-x[1])
Error in `/.difftime`(2, (x[2] - x[1])) :
second argument cannot be "difftime"
That is why I used as.numeric, which forces the vector to be a numeric value (in days):
2/as.numeric(x[2]-x[1])
[1] 2
To prove that it works:
as.numeric(as.Date("2016-10-15")-as.Date("2015-10-16"))
[1] 365
(2016 being a bissextile year, this is correct!)
Related
my sql Table
SDATETIME FE014BPV FE011BPV
0 2022-05-28 5.770000 13.735000
1 2022-05-30 16.469999 42.263000
2 2022-05-31 56.480000 133.871994
3 2022-06-01 49.779999 133.561996
4 2022-06-02 45.450001 132.679001
.. ... ... ...
93 2022-09-08 0.000000 0.050000
94 2022-09-09 0.000000 0.058000
95 2022-09-10 0.000000 0.051000
96 2022-09-11 0.000000 0.050000
97 2022-09-12 0.000000 0.038000
My code:
import pandas as pd
import pyodbc
monthSQL = pd.read_sql_query('SELECT SDATETIME,max(FE014BPV) as flow,max(FE011BPV) as steam FROM [SCADA].[dbo].[TOTALIZER] GROUP BY SDATETIME ORDER BY SDATETIME ASC', conn)
monthdata = monthSQL.groupby(monthSQL['SDATETIME'].dt.strftime("%b-%Y"), sort=True).sum()
print(monthdata)
Produces this incorrect output
flow steam
SDATETIME
Aug-2022 1800.970001 2580.276996
Jul-2022 1994.300014 2710.619986
Jun-2022 3682.329998 7633.660018
May-2022 1215.950003 3098.273025
Sep-2022 0.000000 1.705000
I want output some thing like below
SDATETIME flow steam
May-2022 1215.950003 3098.273025
Jun-2022 3682.329998 7633.660018
Jul-2022 1994.300014 2710.619986
Aug-2022 1800.970001 2580.276996
Sep-2022 0.000000 1.705000
Also, need a sum of last 12 month data
The output is correct, just not in the order you expect. Try this:
# This keep SDATETIME as datetime, not string
monthdata = monthSQL.groupby(pd.Grouper(key="SDATETIME", freq="MS")).sum()
# Rolling sum of the last 12 months
monthdata = pd.concat(
[
monthdata,
monthdata.add_suffix("_LAST12").rolling("366D").sum(),
],
axis=1,
)
# Keep SDATETIME as datetime for as long as you need to manipulate the
# dataframe in Python. When you need to export it, convert it to
# string
monthdata.index = monthdata.index.strftime("%b-%Y")
About the rolling(...) operation: it's easy to think that rolling(12) gives you the rolling sum of the last 12 months, given that each row represents a month. In fact, it returns the rolling sum of the last 12 rows. This is important, because if there are gaps in your data, 12 rows may cover more than 12 months. rolling("366D") makes sure that it only count rows within the last 366 days, which is the maximum length of any 12-month period.
We can't use rolling("12M") because months do not have fixed durations. There are between 28 to 31 days in a month.
You are sorting the date names in alphabetical order - you need to specify which column to sort. You can see that because it goes (the starting letters of the dates):
SDATETIME
Aug-2022 # A goes before J, M, S in the alphabet
Jul-2022 # J goes after A, but before M and S in the alphabet
Jun-2022 # J goes after A, but before M and S in the alphabet
May-2022 # M goes after A, J but before S in the alphabet
Sep-2022 # S goes after A, J, M in the alphabet
To sort by months in reality, you have to make a dictionary and then sort by the .apply() method:
month_dict = {'Jan-2022':1,'Feb-2022':2,'Mar-2022':3, 'Apr-2022':4, 'May-2022':5, 'Jun-2022':6, 'Jul-2022':7, 'Aug-2022':8, 'Sep-2022':9, 'Oct-2022':10, 'Nov-2022':11, 'Dec-2022':12}
df = df.sort_values('SDATETIME', key= lambda x : x.apply (lambda x : month_dict[x]))
print(df)
SDATETIME flow steam
May-2022 1215.950003 3098.273025
Jun-2022 3682.329998 7633.660018
Jul-2022 1994.300014 2710.619986
Aug-2022 1800.970001 2580.276996
Sep-2022 0.000000 1.705000
I have a pandas dataset with date being the index and a number in the value column. There is one year's worth of data.
How can I find find the area (integral) below and above each date's value for the next two months using scipy.integrate?
E.g. If 2009-01-01 has 5 as the value, I am trying to find the integral below and above 5 for the next two months, depending on the points for the next two months.
EDIT: I guess I don't know what to use as the function since the function is unknown and I only have points to use to integrate. I am thinking I may have to integrate for each day and sum up for the two months?
Below is a sample of my dataset:
DATE Y
2008-01-01 4
2008-01-02 10.4
2008-01-03 2
2008-01-04 9
2008-01-05 4.3
2008-01-06 7
2008-01-07 8.2
2008-01-08 5
2008-01-09 6.5
2008-01-10 2.3
...
2008-02-28 6.6
2008-03-01 7
2008-03-02 5.4
My objective is to start from 2008-01-01 with a value of 4 and use that as the reference point and find the integral below and above 4 (i.e. 4 to each day's y value) for the next two months. So it will not be a rolling integral but a looking-forward one.
Maximum Drawdown is a common risk metric used in quantitative finance to assess the largest negative return that has been experienced.
Recently, I became impatient with the time to calculate max drawdown using my looped approach.
def max_dd_loop(returns):
"""returns is assumed to be a pandas series"""
max_so_far = None
start, end = None, None
r = returns.add(1).cumprod()
for r_start in r.index:
for r_end in r.index:
if r_start < r_end:
current = r.ix[r_end] / r.ix[r_start] - 1
if (max_so_far is None) or (current < max_so_far):
max_so_far = current
start, end = r_start, r_end
return max_so_far, start, end
I'm familiar with the common perception that a vectorized solution would be better.
The questions are:
can I vectorize this problem?
What does this solution look like?
How beneficial is it?
Edit
I modified Alexander's answer into the following function:
def max_dd(returns):
"""Assumes returns is a pandas Series"""
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
df_returns is assumed to be a dataframe of returns, where each column is a seperate strategy/manager/security, and each row is a new date (e.g. monthly or daily).
cum_returns = (1 + df_returns).cumprod()
drawdown = 1 - cum_returns.div(cum_returns.cummax())
I had first suggested using .expanding() window but that's obviously not necessary with the .cumprod() and .cummax() built ins to calculate max drawdown up to any given point:
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)}, index=pd.date_range(start=date(2016,1,1), periods=1000, freq='D'))
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)},
index=pd.date_range(start=date(2016, 1, 1), periods=1000, freq='D'))
df['cumulative_return'] = df.returns.add(1).cumprod().subtract(1)
df['max_drawdown'] = df.cumulative_return.add(1).div(df.cumulative_return.cummax().add(1)).subtract(1)
returns cumulative_return max_drawdown
2016-01-01 -0.014522 -0.014522 0.000000
2016-01-02 -0.022769 -0.036960 -0.022769
2016-01-03 0.026735 -0.011214 0.000000
2016-01-04 0.054129 0.042308 0.000000
2016-01-05 -0.017562 0.024004 -0.017562
2016-01-06 0.055254 0.080584 0.000000
2016-01-07 0.023135 0.105583 0.000000
2016-01-08 -0.072624 0.025291 -0.072624
2016-01-09 -0.055799 -0.031919 -0.124371
2016-01-10 0.129059 0.093020 -0.011363
2016-01-11 0.056123 0.154364 0.000000
2016-01-12 0.028213 0.186932 0.000000
2016-01-13 0.026914 0.218878 0.000000
2016-01-14 -0.009160 0.207713 -0.009160
2016-01-15 -0.017245 0.186886 -0.026247
2016-01-16 0.003357 0.190869 -0.022979
2016-01-17 -0.009284 0.179813 -0.032050
2016-01-18 -0.027361 0.147533 -0.058533
2016-01-19 -0.058118 0.080841 -0.113250
2016-01-20 -0.049893 0.026914 -0.157492
2016-01-21 -0.013382 0.013173 -0.168766
2016-01-22 -0.020350 -0.007445 -0.185681
2016-01-23 -0.085842 -0.092648 -0.255584
2016-01-24 0.022406 -0.072318 -0.238905
2016-01-25 0.044079 -0.031426 -0.205356
2016-01-26 0.045782 0.012917 -0.168976
2016-01-27 -0.018443 -0.005764 -0.184302
2016-01-28 0.021461 0.015573 -0.166797
2016-01-29 -0.062436 -0.047836 -0.218819
2016-01-30 -0.013274 -0.060475 -0.229189
... ... ... ...
2018-08-28 0.002124 0.559122 -0.478738
2018-08-29 -0.080303 0.433921 -0.520597
2018-08-30 -0.009798 0.419871 -0.525294
2018-08-31 -0.050365 0.348359 -0.549203
2018-09-01 0.080299 0.456631 -0.513004
2018-09-02 0.013601 0.476443 -0.506381
2018-09-03 -0.009678 0.462153 -0.511158
2018-09-04 -0.026805 0.422960 -0.524262
2018-09-05 0.040832 0.481062 -0.504836
2018-09-06 -0.035492 0.428496 -0.522411
2018-09-07 -0.011206 0.412489 -0.527762
2018-09-08 0.069765 0.511031 -0.494817
2018-09-09 0.049546 0.585896 -0.469787
2018-09-10 -0.060201 0.490423 -0.501707
2018-09-11 -0.018913 0.462235 -0.511131
2018-09-12 -0.094803 0.323611 -0.557477
2018-09-13 0.025736 0.357675 -0.546088
2018-09-14 -0.049468 0.290514 -0.568542
2018-09-15 0.018146 0.313932 -0.560713
2018-09-16 -0.034118 0.269104 -0.575700
2018-09-17 0.012191 0.284576 -0.570527
2018-09-18 -0.014888 0.265451 -0.576921
2018-09-19 0.041180 0.317562 -0.559499
2018-09-20 0.001988 0.320182 -0.558623
2018-09-21 -0.092268 0.198372 -0.599348
2018-09-22 -0.015386 0.179933 -0.605513
2018-09-23 -0.021231 0.154883 -0.613888
2018-09-24 -0.023536 0.127701 -0.622976
2018-09-25 0.030160 0.161712 -0.611605
2018-09-26 0.025528 0.191368 -0.601690
Given a time series of returns, we need to evaluate the aggregate return for every combination of starting point to ending point.
The first trick is to convert a time series of returns into a series of return indices. Given a series of return indices, I can calculate the return over any sub-period with the return index at the beginning ri_0 and at the end ri_1. The calculation is: ri_1 / ri_0 - 1.
The second trick is to produce a second series of inverses of return indices. If r is my series of return indices then 1 / r is my series of inverses.
The third trick is to take the matrix product of r * (1 / r).Transpose.
r is an n x 1 matrix. (1 / r).Transpose is a 1 x n matrix. The resulting product contains every combination of ri_j / ri_k. Just subtract 1 and I've actually got returns.
The fourth trick is to ensure that I'm constraining my denominator to represent periods prior to those being represented by the numerator.
Below is my vectorized function.
import numpy as np
import pandas as pd
def max_dd(returns):
# make into a DataFrame so that it is a 2-dimensional
# matrix such that I can perform an nx1 by 1xn matrix
# multiplication and end up with an nxn matrix
r = pd.DataFrame(returns).add(1).cumprod()
# I copy r.T to ensure r's index is not the same
# object as 1 / r.T's columns object
x = r.dot(1 / r.T.copy()) - 1
x.columns.name, x.index.name = 'start', 'end'
# let's make sure we only calculate a return when start
# is less than end.
y = x.stack().reset_index()
y = y[y.start < y.end]
# my choice is to return the periods and the actual max
# draw down
z = y.set_index(['start', 'end']).iloc[:, 0]
return z.min(), z.argmin()[0], z.argmin()[1]
How does this perform?
for the vectorized solution I ran 10 iterations over the time series of lengths [10, 50, 100, 150, 200]. The time it took is below:
10: 0.032 seconds
50: 0.044 seconds
100: 0.055 seconds
150: 0.082 seconds
200: 0.047 seconds
The same test for the looped solution is below:
10: 0.153 seconds
50: 3.169 seconds
100: 12.355 seconds
150: 27.756 seconds
200: 49.726 seconds
Edit
Alexander's answer provides superior results. Same test using modified code
10: 0.000 seconds
50: 0.000 seconds
100: 0.004 seconds
150: 0.007 seconds
200: 0.008 seconds
I modified his code into the following function:
def max_dd(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = drawdown.min()
end = drawdown.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
I recently had a similar issue, but instead of a global MDD, I was required to find the MDD for the interval after each peak. Also, in my case, I was supposed to take the MDD of each strategy alone and thus wasn't required to apply the cumprod. My vectorized implementation is also based on Investopedia.
def calc_MDD(networth):
df = pd.Series(networth, name="nw").to_frame()
max_peaks_idx = df.nw.expanding(min_periods=1).apply(lambda x: x.argmax()).fillna(0).astype(int)
df['max_peaks_idx'] = pd.Series(max_peaks_idx).to_frame()
nw_peaks = pd.Series(df.nw.iloc[max_peaks_idx.values].values, index=df.nw.index)
df['dd'] = ((df.nw-nw_peaks)/nw_peaks)
df['mdd'] = df.groupby('max_peaks_idx').dd.apply(lambda x: x.expanding(min_periods=1).apply(lambda y: y.min())).fillna(0)
return df
Here is an sample after running this code:
nw max_peaks_idx dd mdd
0 10000.000 0 0.000000 0.000000
1 9696.948 0 -0.030305 -0.030305
2 9538.576 0 -0.046142 -0.046142
3 9303.953 0 -0.069605 -0.069605
4 9247.259 0 -0.075274 -0.075274
5 9421.519 0 -0.057848 -0.075274
6 9315.938 0 -0.068406 -0.075274
7 9235.775 0 -0.076423 -0.076423
8 9091.121 0 -0.090888 -0.090888
9 9033.532 0 -0.096647 -0.096647
10 8947.504 0 -0.105250 -0.105250
11 8841.551 0 -0.115845 -0.115845
And here is an image of the complete applied to the complete dataset.
Although vectorized, this code is probably slower than the other, because for each time-series, there should be many peaks, and each one of these requires calculation, and so O(n_peaks*n_intervals).
PS: I could have eliminated the zero values in the dd and mdd columns, but I find it useful that these values help indicate when a new peak was observed in the time-series.
I have a pandas dataframe with dates and strings similar to this:
Start End Note Item
2016-10-22 2016-11-05 Z A
2017-02-11 2017-02-25 W B
I need to expand/transform it to the below, filling in weeks (W-SAT) in between the Start and End columns and forward filling the data in Note and Items:
Start Note Item
2016-10-22 Z A
2016-10-29 Z A
2016-11-05 Z A
2017-02-11 W B
2017-02-18 W B
2017-02-25 W B
What's the best way to do this with pandas? Some sort of multi-index apply?
You can iterate over each row and create a new dataframe and then concatenate them together
pd.concat([pd.DataFrame({'Start': pd.date_range(row.Start, row.End, freq='W-SAT'),
'Note': row.Note,
'Item': row.Item}, columns=['Start', 'Note', 'Item'])
for i, row in df.iterrows()], ignore_index=True)
Start Note Item
0 2016-10-22 Z A
1 2016-10-29 Z A
2 2016-11-05 Z A
3 2017-02-11 W B
4 2017-02-18 W B
5 2017-02-25 W B
You don't need iteration at all.
df_start_end = df.melt(id_vars=['Note','Item'],value_name='date')
df = df_start_end.groupby('Note').apply(lambda x: x.set_index('date').resample('W').pad()).drop(columns=['Note','variable']).reset_index()
If the number of unique values of df['End'] - df['Start'] is not too large, but the number of rows in your dataset is large, then the following function will be much faster than looping over your dataset:
def date_expander(dataframe: pd.DataFrame,
start_dt_colname: str,
end_dt_colname: str,
time_unit: str,
new_colname: str,
end_inclusive: bool) -> pd.DataFrame:
td = pd.Timedelta(1, time_unit)
# add a timediff column:
dataframe['_dt_diff'] = dataframe[end_dt_colname] - dataframe[start_dt_colname]
# get the maximum timediff:
max_diff = int((dataframe['_dt_diff'] / td).max())
# for each possible timediff, get the intermediate time-differences:
df_diffs = pd.concat([pd.DataFrame({'_to_add': np.arange(0, dt_diff + end_inclusive) * td}).assign(_dt_diff=dt_diff * td)
for dt_diff in range(max_diff + 1)])
# join to the original dataframe
data_expanded = dataframe.merge(df_diffs, on='_dt_diff')
# the new dt column is just start plus the intermediate diffs:
data_expanded[new_colname] = data_expanded[start_dt_colname] + data_expanded['_to_add']
# remove start-end cols, as well as temp cols used for calculations:
to_drop = [start_dt_colname, end_dt_colname, '_to_add', '_dt_diff']
if new_colname in to_drop:
to_drop.remove(new_colname)
data_expanded = data_expanded.drop(columns=to_drop)
# don't modify dataframe in place:
del dataframe['_dt_diff']
return data_expanded
So I recently spent a bit of time trying to figure out an efficient pandas-based approach to this issue (which is very trivial with data.table in R) and wanted to share the approach I came up with here:
df.set_index("Note").apply(
lambda row: pd.date_range(row["Start"], row["End"], freq="W-SAT").values, axis=1
).explode()
Note: using .values makes a big difference in performance!
There are quite a few solutions here already and I wanted to compare the speed for different numbers of rows and periods - see results (in seconds) below:
n_rows is the number of initial rows and n_periods is the number of periods per row i.e. the windows size: the combinations below always result in 1 million rows when expanded
the other columns are named after the posters of the solutions
note I made a slight tweak to Gen's approach whereby, after pd.melt(), I do df.set_index("date").groupby("Note").resample("W-SAT").ffill() - I labelled this Gen2 and it seems to perform slightly better and gives the same result
each n_rows, n_periods combination was ran 10 times and results were then averaged
Anyway, jwdink's solution looks like a winner when there are many rows and few periods, whereas my solution seems to better on the other end of the spectrum, though only marginally ahead of the others as the number of rows decreases:
n_rows
n_periods
jwdink
TedPetrou
Gen
Gen2
robbie
250
4000
6.63
0.33
0.64
0.45
0.28
500
2000
3.21
0.65
1.18
0.81
0.34
1000
1000
1.57
1.28
2.30
1.60
0.48
2000
500
0.83
2.57
4.68
3.24
0.71
5000
200
0.40
6.10
13.26
9.59
1.43
If you want to run your own tests on this, my code is available in my GitHub repo - note I created a DateExpander class object that wraps all the functions to make it easier to scale the simulation.
Also, for reference, I used a 2-core STANDARD_DS11_V2 Azure VM - only for about 10 minutes, so this literally me giving my 2 cents on the issue!
I'm trying to plot RINEX (GPS) data and am very new to Pandas, numpy. Here is a snippet of my code:
#Plotting of the data
pr1 = sat['P1']
pr2 = sat['P2']
calc_pr = pr1 - (((COEFF_3)**2) * pr2)
plt.plot(calc_pr,label='calc_pr')
where "sat" is a Dataframe as follows:
sat:
Panel: <class 'pandas.core.panel.Panel'>
Dimensions: 32 (items) x 2880 (major_axis) x 7 (minor_axis)
Items axis: G01 to G32
Major_axis axis: 0.0 to 23.9916666667
Minor_axis axis: L1 to S2
where each Item (G01, G02, etc) corresponds to:
(G01)
DataFrame: L1 L2 P1 P2 C1 \
0.000000 669486.833 530073.330 24568752.516 24568762.572 24568751.442
0.008333 786184.519 621006.551 24590960.634 24590970.218 24590958.374
0.016667 902916.181 711966.252 24613174.234 24613180.219 24613173.065
0.025000 1019689.006 802958.016 24635396.428 24635402.410 24635395.627
Within the first column (I assume this is the major axis, which I manipulated with: epoch_time = int((hour * 3600) + (minute * 60) + second)), states the time. These are 30 second intervals, over 24h. They were originally epochs (0 to 2880). The first epochs "calc_pr" are shown below:
Series: 0.000000 26529.507524
0.008333 31432.322196
0.016667 36336.563310
0.025000 41242.536096
0.033333 46149.208022
0.041667 51057.059006
0.050000 55965.873639
0.058333 60875.510720
0.066667 65785.965112
0.075000 70697.114838
However when plotting these (plt.plot(calc_pr,label='calc_pr')) instead of the x-axis being the time in hours, it is displayed in epochs. I've tried different permutations of trying to manipulate this
"calc_pr" so that the times are displayed, not epoch numbers, but so far to no avail. Could someone indicate where/how I can manipulate this?
Looks like I solved this myself in the end. I realised I just need to plot the "index", like so:
plt.plot(sat.index, calc_pr, label='calc_pr')
I love it when I solve my own problems myself. It means I'm getting even more awesome.