Python - Find percent change for previous 7-day period's average - python

I have time-series data in a dataframe. Is there any way to calculate for each day the percent change of that day's value from the average of the previous 7 days?
I have tried
df['Change'] = df['Column'].pct_change(periods=7)
However, this simply finds the difference between t and t-7 days. I need something like:
For each value of Ti, find the average of the previous 7 days, and subtract from Ti

Sure, you can for example use:
s = df['Column']
n = 7
mean = s.rolling(n, closed='left').mean()
df['Change'] = (s - mean) / mean
Note on closed='left'
There was a bug prior to pandas=1.2.0 that caused incorrect handling of closed for fixed windows. Make sure you have pandas>=1.2.0; for example, pandas=1.1.3 will not give the result below.
As described in the docs:
closed: Make the interval closed on the ‘right’, ‘left’, ‘both’ or ‘neither’ endpoints. Defaults to ‘right’.
A simple way to understand is to try with some very simple data and a small window:
a = pd.DataFrame(range(5), index=pd.date_range('2020', periods=5))
b = a.assign(
sum_left=a.rolling(2, closed='left').sum(),
sum_right=a.rolling(2, closed='right').sum(),
sum_both=a.rolling(2, closed='both').sum(),
sum_neither=a.rolling(2, closed='neither').sum(),
)
>>> b
0 sum_left sum_right sum_both sum_neither
2020-01-01 0 NaN NaN NaN NaN
2020-01-02 1 NaN 1.0 1.0 NaN
2020-01-03 2 1.0 3.0 3.0 NaN
2020-01-04 3 3.0 5.0 6.0 NaN
2020-01-05 4 5.0 7.0 9.0 NaN

Related

how to detect a braking process in python dataframe

I have some trips, and for each trip contains different steps, the data frame looks like following:
tripId duration (s) distance (m) speed Km/h
1819714 NaN NaN NaN
1819714 6.0 8.511452 5.106871
1819714 10.0 6.908963 2.487227
1819714 5.0 15.960625 11.491650
1819714 6.0 26.481649 15.888989
... ... ... ... ...
1865507 6.0 16.280313 9.768188
1865507 5.0 17.347482 12.490187
1865507 5.0 14.266625 10.271970
1865507 6.0 22.884008 13.730405
1865507 5.0 21.565655 15.527271
I want to know if, on a trip X, the cyclist has braked (speed has decreased by at least 30%).
The problem is that the duration between every two steps is each time different.
For example, in 6 seconds, the speed of a person X has decreased from 28 km/h to 15 km/h.. here we can say, he has braked, but if the duration was high, we will not be able to say that
My question is if there is a way to apply something to know if there is a braking process, for all data frame in a way that makes sense
The measure of braking is the "change in speed" relative to "change in time". From your data, I created a column 'acceleration', which is change in speed (Km/h) divided by duration (seconds). Then the final column to detect braking if the value is less than -1 (Km/h/s).
Note that you need to determine if a reduction of 1km/h per second is good enough to be considered as braking.
df['speedChange'] = df['speedKm/h'].diff()
df['acceleration'] = df['speedChange'] / df['duration(s)']
df['braking'] = df['acceleration'].apply(lambda x: 'yes' if x<-1 else 'no')
print(df)
Output:
tripId duration(s) distance(m) speedKm/h speedChange acceleration braking
0 1819714.0 6.0 8.511452 5.106871 NaN NaN no
1 1819714.0 10.0 6.908963 2.487227 -2.619644 -0.261964 no
2 1819714.0 5.0 15.960625 11.491650 9.004423 1.800885 no
3 1819714.0 6.0 26.481649 15.888989 4.397339 0.732890 no
4 1865507.0 6.0 16.280313 9.768188 -6.120801 -1.020134 yes
5 1865507.0 5.0 17.347482 12.490187 2.721999 0.544400 no
6 1865507.0 5.0 14.266625 10.271970 -2.218217 -0.443643 no
7 1865507.0 6.0 22.884008 13.730405 3.458435 0.576406 no

Pandas indexed Series Subset (of a DataFrame) not changing values

I have the following table:
df = pd.DataFrame(({'code':['A121','A121','A121','H812','H812','H812','Z198','Z198','Z198','S222','S222','S222'],
'mode':['stk','sup','cons','stk','sup','cons','stk','sup','cons','stk','sup','cons'],
datetime.date(year=2021,month=5,day=1):[4,2,np.nan,2,2,np.nan,6,np.nan,np.nan,np.nan,2,np.nan],
datetime.date(year=2021,month=5,day=2):[1,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan],
datetime.date(year=2021,month=5,day=3):[12,5,np.nan,13,5,np.nan,12,np.nan,np.nan,np.nan,5,np.nan],
datetime.date(year=2021,month=5,day=4):[np.nan,1,np.nan,np.nan,4,np.nan,np.nan,np.nan,np.nan,np.nan,7,np.nan]}))
df = df.set_index('mode')
I want to achieve the following, I want the the rows wherever cons to be set according to some arithemetic calculations:
cons for the corresponding date and code needs to be set to the following calculation prev_date stk - current_date stk + sup
I have tried the code below:
dates = list(df.columns)
dates.remove('code')
for date in dates:
prev_date = date - datetime.timedelta(days=1)
if(df.loc["stk"].get(prev_date,None) is not None):
opn_stk = df.loc["stk",prev_date].reset_index(drop=True)
cls_stk = df.loc["stk",date].reset_index(drop=True)
sup = df.loc["sup",date].fillna(0).reset_index(drop=True)
cons = opn_stk - cls_stk + sup
df.loc["cons",date] = cons
I do not receive any error, however the cons values does not change at all.
I suspect this is probably because df.loc["cons",date] is an indexed Series and the calculation opn_stk - cls_stk + sup is an unindexed Series.
Any idea how to fix this?
P.S Also I am using loops to calculate this, is there any other vectorized way that would be more efficient
Expected Output
Let's try a groupby apply instead:
def calc_cons(g):
# Transpose
t = g[g.columns[g.columns != 'code']].T
# Update Cons
g.loc[g.index == 'cons', g.columns != 'code'] = (-t['stk'].diff() +
t['sup'].fillna(0)).to_numpy()
return g
df = df.groupby('code', as_index=False, sort=False).apply(calc_cons)
# print(df[df.index == 'cons'])
print(df)
code 2021-05-01 2021-05-02 2021-05-03 2021-05-04
mode
stk A121 4.0 1.0 12.0 NaN
sup A121 2.0 NaN 5.0 1.0
cons A121 NaN 3.0 -6.0 NaN
stk H812 2.0 3.0 13.0 NaN
sup H812 2.0 NaN 5.0 4.0
cons H812 NaN -1.0 -5.0 NaN
stk Z198 6.0 2.0 12.0 NaN
sup Z198 NaN NaN NaN NaN
cons Z198 NaN 4.0 -10.0 NaN
stk S222 NaN NaN NaN NaN
sup S222 2.0 NaN 5.0 7.0
cons S222 NaN NaN NaN NaN
*Assumes columns are in sorted order by date in 1 day intervals.
Although #Henry Ecker's answer is very elegant, it is very slow compared to what I have done (over 10x slower), so I would like to go ahead with my implementation fixed
My implementation fixed as per Henry Ecker's suggestion df.loc["cons",date] = cons.to_numpy()
dates = list(df.columns)
dates.remove('code')
for date in dates:
prev_date = date - datetime.timedelta(days=1)
if(df.loc["stk"].get(prev_date,None) is not None):
opn_stk = df.loc["stk",prev_date].reset_index(drop=True) # gets the stock of prev date
cls_stk = df.loc["stk",date].reset_index(drop=True) # gets the stock of current date
sup = df.loc["sup",date].fillna(0).reset_index(drop=True) # gets suplly of current date
cons = opn_stk - cls_stk + sup
df.loc["cons",date] = cons.to_numpy()
Just as a sidenote:
My implementation runs on the full data (not this, I created this as a toy example) in 0:00:00.053309 seconds and Henry Ecker's implementation run in 0:00:00.568888 seconds so more than 10x slower.
This is probably because he is iterating over the codes whereas I am iterating over dates. At any given point of time I will have at most 30 dates, but there can be more that 500 codes

a dynamic dataframe range

I am trying to loop through a dataframe creating a dynamic ranges that are limited to the last 6 months of every row index.
Because I am looking back 6 months, I start from the first index row that has a date >= the first date in row index 0 of the dataframe. The condition which I have managed to create is shown below:
for i in df.index:
if datetime.strptime(df['date'][i], '%Y-%m-%d %H:%M:%S') >= (datetime.strptime(df['date'].iloc[0], '%Y-%m-%d %H:%M:%S') + dateutil.relativedelta.relativedelta(months=6)):
However, this merely creates ranges that grow in size incorporating, all data that is indexed after
the first index row that has a date >= the first date in row index 0 of the dataframe.
How can I limit the condition statement to only the last 6 months of each row index?
I'm not sure what exactly you want to do once you have your "dynamic ranges".
You can obtain a list of intervals (t - 6mo, t) for each t in your DatetimeIndex):
intervals = [(t - pd.DateOffset(months=6), t) for t in df.index]
But doing selection operations in a big for-loop might be slow.
Instead, you might be interested in Pandas's rolling operations. It can even use a date offset (as long as it is fixed-frequency) instead of a fixed-sized int window width. However, "6 months" is a non-fixed frequency, and as such the regular rolling won't accept it.
Still, if you are ok with an approximation, say "182 days", then the following might work well.
# setup
n = 10
df = pd.DataFrame(
{'a': np.arange(n), 'b': np.ones(n)},
index=pd.date_range('2019-01-01', freq='M', periods=n))
# example: sum
df.rolling('182D', min_periods=0).sum()
# out:
a b
2019-01-31 0.0 1.0
2019-02-28 1.0 2.0
2019-03-31 3.0 3.0
2019-04-30 6.0 4.0
2019-05-31 10.0 5.0
2019-06-30 15.0 6.0
2019-07-31 21.0 7.0
2019-08-31 27.0 6.0
2019-09-30 33.0 6.0
2019-10-31 39.0 6.0
If you want to be strict on the 6 months windows, you can implement your own pandas.api.indexers.BaseIndexer and use that as arg of rolling.

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

I have a pandas Dataframe containing EOD financial data (OHLC) for analysis.
I'm using https://github.com/cirla/tulipy library to generate technical indicator values, that have a certain timeperiod as option. For Example. ADX with timeperiod=5 shows ADX for last 5 days.
Because of this timeperiod, the generated array with indicator values is always shorter in length than the Dataframe. Because the prices of first 5 days are used to generate ADX for day 6..
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=14)
df['mdi_14'] = mdi14
df['pdi_14'] = pdi14
>> ValueError: Length of values does not match length of index
Unfortunately, unlike TA-LIB for example, this tulip library does not provide NaN-values for these first couple of empty days...
Is there an easy way to prepend these NaN to the ndarray?
Or insert into df at a certain index & have it create NaN for the rows before it automatically?
Thanks in advance, I've been researching for days!
Maybe make the shift yourself in the code ?
period = 14
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=period
)
df['mdi_14'] = np.NAN
df['mdi_14'][period - 1:] = mdi14
I hope they will fill the first values with NAN in the lib in the future. It's dangerous to leave time series data like this without any label.
Full MCVE
df = pd.DataFrame(1, range(10), list('ABC'))
a = np.full((len(df) - 6, df.shape[1]), 2)
b = np.full((6, df.shape[1]), np.nan)
c = np.row_stack([b, a])
d = pd.DataFrame(c, df.index, df.columns)
d
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 2.0 2.0 2.0
7 2.0 2.0 2.0
8 2.0 2.0 2.0
9 2.0 2.0 2.0
The C version of the tulip library includes a start function for each indicator (reference: https://tulipindicators.org/usage) that can be used to determine the output length of an indicator given a set of input options. Unfortunately, it does not appear that the python bindings library, tulipy, includes this functionality. Instead you have to resort to dynamically reassigning your index values to align the output with the original DataFrame.
Here is an example that uses the price series from the tulipy docs:
#Create the dataframe with close prices
prices = pd.DataFrame(data={81.59, 81.06, 82.87, 83, 83.61, 83.15, 82.84, 83.99, 84.55,
84.36, 85.53, 86.54, 86.89, 87.77, 87.29}, columns=['close'])
#Compute the technical indicator using tulipy and save the result in a DataFrame
bbands = pd.DataFrame(data=np.transpose(ti.bbands(real = prices['close'].to_numpy(), period = 5, stddev = 2)))
#Dynamically realign the index; note from the tulip library documentation that the price/volume data is expected be ordered "oldest to newest (index 0 is oldest)"
bbands.index += prices.index.max() - bbands.index.max()
#Put the indicator values with the original DataFrame
prices[['BBANDS_5_2_low', 'BBANDS_5_2_mid', 'BBANDS_5_2_up']] = bbands
prices.head(15)
close BBANDS_5_2_low BBANDS_5_2_mid BBANDS_5_2_up
0 81.06 NaN NaN NaN
1 81.59 NaN NaN NaN
2 82.87 NaN NaN NaN
3 83.00 NaN NaN NaN
4 83.61 80.530042 82.426 84.321958
5 83.15 81.494061 82.844 84.193939
6 82.84 82.533343 83.094 83.654657
7 83.99 82.471983 83.318 84.164017
8 84.55 82.417750 83.628 84.838250
9 84.36 82.435203 83.778 85.120797
10 85.53 82.511331 84.254 85.996669
11 86.54 83.142618 84.994 86.845382
12 86.89 83.536488 85.574 87.611512
13 87.77 83.870324 86.218 88.565676
14 87.29 85.288871 86.804 88.319129

Pandas apply based on conditional from another column

I'm looking to adjust values of one column based on a conditional in another column.
I'm using np.busday_count, but I don't want the weekend values to behave like a Monday (Sat to Tues is given 1 working day, I'd like that to be 2)
dispdf = df[(df.dispatched_at.isnull()==False) & (df.sold_at.isnull()==False)]
dispdf["dispatch_working_days"] = np.busday_count(dispdf.sold_at.tolist(), dispdf.dispatched_at.tolist())
for i in range(len(dispdf)):
if dispdf.dayofweek.iloc[i] == 5 or dispdf.dayofweek.iloc[i] == 6:
dispdf.dispatch_working_days.iloc[i] +=1
Sample:
dayofweek dispatch_working_days
43159 1.0 3
48144 3.0 3
45251 6.0 1
49193 3.0 0
42470 3.0 1
47874 6.0 1
44500 3.0 1
43031 6.0 3
43193 0.0 4
43591 6.0 3
Expected Results:
dayofweek dispatch_working_days
43159 1.0 3
48144 3.0 3
45251 6.0 2
49193 3.0 0
42470 3.0 1
47874 6.0 2
44500 3.0 1
43031 6.0 2
43193 0.0 4
43591 6.0 4
At the moment I'm using this for loop to add a working day to Saturday and Sunday values. It's slow!
Can I use a vectorization instead to speed this up. I tried using .apply but to no avail.
Pretty sure this works, but there are more optimized implementations:
def adjust_dispatch(df_line):
if df_line['dayofweek'] >= 5:
return df_line['dispatch_working_days'] + 1
else:
return df_line['dispatch_working_days']
df['dispatch_working_days'] = df.apply(adjust_dispatch, axis=1)
for in you code could be replaced by that line:
dispdf.loc[dispdf.dayofweek>5,'dispatch_working_days']+=1
or you could use numpy.where
https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html

Categories