Biggest rise in a time window in a time series - python

Wondering if there is a fast way of getting the biggest rise in a time series within a window.
Intended code is...
import datetime
import numpy as np
import pandas as pd
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]
data = np.random.randint(low=1, high=10, size=len(date_list))
df = pd.DataFrame({'date': date_list, 'value': data})
def biggest_rise(df, windowsize = 10):
'''gets the biggest rise within a window size specified
'''
# Some magic code here
return df.rolling_max(window=10, ...)

I don't really get what you mean 'biggest rise', but using rolling may be helpful. For example with that code you can get the difference of the maximum and minimum value within a 10-day window:
df.sort_values(['date']).set_index('date').rolling('10d').max() - df.sort_values(['date']).set_index('date').rolling('10d').min()

I think I found the answer... as per code below. Upped the high to 10K to really see the changes:
import datetime
import numpy as np
import pandas as pd
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]
data = np.random.randint(low=1, high=10000, size=len(date_list))
df = pd.DataFrame({'date': date_list, 'value': data})
window = 10
dfs = [df.iloc[i: i+window] for i in range(0, len(df)) if i+window < len(df)]
biggest_rise = max([d.value.max()-d.value.min() for d in dfs])
Takes 112 ms for 365 datapoints. Anything better is welcome.
The biggest_rise could be the biggest_fall in the window. Don't know how to differentiate.

Here is a better answer to get the maximum rise using #TywinLannister88 suggestion:
import numpy as np
import pandas as pd
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]
data = np.random.randint(low=1, high=10000, size=len(date_list))
df = pd.DataFrame({'date': date_list, 'value': data})
# 10-day rolling window
df1 = df.sort_values(['date']).set_index('date').rolling('10d').max() - \
df.sort_values(['date']).set_index('date').rolling('10d').min()
# percent change to see if there is a rise or fall
df2 = df.sort_values(['date']).set_index('date').value.pct_change(periods=10)
# filter out the rises (pctchange > 0) and find the maximum rise
df3 = df.sort_values(['date']).set_index('date').assign(delta=df1, pctchange=df2)
biggest_rise = df3[df3.pctchange>0].pctchange.max()

Related

Compute the rolling mean in pandas

I ran the following code:
import numpy as np
import pandas as pd
#make this example reproducible
np.random.seed(0)
#create dataset
period = np.arange(1, 101, 1)
leads = np.random.uniform(1, 20, 100)
sales = 60 + 2*period + np.random.normal(loc=0, scale=.5*period, size=100)
df = pd.DataFrame({'period': period, 'leads': leads, 'sales': sales})
#view first 10 rows
df.head(10)
df['rolling_sales_5'] = df['sales'].rolling(5,center=True, min_periods=1).mean()
df.head(10)
But I do not understand how the first two obs and last two obs for the rolling_sales_5 variable are generated. Any idea?

How can I a weighted moving average using yfinance and pandas

I want to compare the 50 day moving average and 50 day weighted moving average of a company.
import yfinance as yf
import datetime as dt
start = '2021-05-01' # format: YYYY-MM-DD
end = dt.datetime.now() # today
stock='AMD'
df = yf.download(stock,start, end, interval='1h')
This is just to set up the data frame.
The code below adds a column to the data frame with the moving average, but I have been unsuccessful trying to do the same for a weighted moving average.
df['50MA']= df.iloc[:, 4].rolling(window=50).mean()
This is what I have which is incorrect
for i in range(len(df.index)):
df['W50MA']=(df.iloc[i, 4]) * (df.iloc[i, 5]/sum(df.iloc[:, 5]))
You could try something like this:
weights = np.array(list(range(1, 51))) / 100
sum_weights = np.sum(weights)
def weighted_ma(value):
return np.sum(weights*value) / sum_weights
df['50WMA'] = df.iloc[:, 4].rolling(window=50).apply(weighted_ma)

pandas calculate delta time

Here's some code where that will generate some random data, and chart plus lines representing 30th & 90th percentiles.
import pandas as pd
import numpy as np
from numpy.random import randint
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(10) # added for reproductibility
rng = pd.date_range('10/9/2018 00:00', periods=10, freq='1H')
df = pd.DataFrame({'Random_Number':randint(1, 100, 10)}, index=rng)
df.plot()
plt.axhline(df.quantile(0.3)[0], linestyle="--", color="g")
plt.axhline(df.quantile(0.90)[0], linestyle="--", color="r")
plt.show()
Outputs: (minus the highlighted part of the chart)
Im trying to figure out if its possible to calculate the time in the data it takes to reach (highlighted yellow) from green to the red line.
I can manually enter in the data:
minStart = df.loc[df['Random_Number'] < 18].index[0]
maxStart = df.loc[df['Random_Number'] > 90].index[0]
hours = maxStart - minStart
hours
Which will output:
Timedelta('0 days 05:00:00')
But if I attempt to use:
minStart = df.loc[df['Random_Number'] < df.quantile(0.3)].index[0]
maxStart = df.loc[df['Random_Number'] > df.quantile(0.90)].index[0]
hours = maxStart - minStart
hours
This will throw an ValueError: Can only compare identically-labeled Series objects
Would there be a better method to madness? Ideally it would be nice to create some sort of an algorithm that can calculate delta Time to it takes to go from 30th - 90th percentile and then delta back from 90th - 30th.. But I may have to put some thought towards how that could be accomplished..
minStart = df.loc[df['Random_Number'] < df.quantile(0.3)[0]].index[0]
maxStart = df.loc[df['Random_Number'] > df.quantile(0.90)[0]].index[0]
hours = maxStart - minStart
hours
df.quantile doesn't return a number so you need to get the first entry of it

How to apply euclidean distance to dataframe. Calculate each row

Please help me, I have the problem. It's been about 2 weeks but I don't get it yet.
So, I want to use "apply" in dataframe, which I got from Alphavantage API.
I want to apply euclidean distance to each row of dataframe.
import math
import numpy as np
import pandas as pd
from scipy.spatial import distance
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.neighbors import KNeighborsRegressor
from alpha_vantage.timeseries import TimeSeries
from services.KEY import getApiKey
ts = TimeSeries(key=getApiKey(), output_format='pandas')
And in my picture I got this
My chart (sorry can't post image because of my reputation)
In my code
stock, meta_data = ts.get_daily_adjusted(symbol, outputsize='full')
stock = stock.sort_values('date')
open = stock['1. open'].values
low = stock['3. low'].values
high = stock['2. high'].values
close = stock['4. close'].values
sorted_date = stock.index.get_level_values(level='date')
stock_numpy_format = np.stack((sorted_date, open, low
,high, close), axis=1)
df = pd.DataFrame(stock_numpy_format, columns=['date', 'open', 'low', 'high', 'close'])
df = df[df['open']>0]
df = df[(df['date'] >= "2016-01-01") & (df['date'] <= "2018-12-31")]
df = df.reset_index(drop=True)
df['close_next'] = df['close'].shift(-1)
df['daily_return'] = df['close'].pct_change(1)
df['daily_return'].fillna(0, inplace=True)
stock_numeric_close_dailyreturn = df['close', 'daily_return']
stock_normalized = (stock_numeric_close_dailyreturn - stock_numeric_close_dailyreturn.mean()) / stock_numeric_close_dailyreturn.std()
euclidean_distances = stock_normalized.apply(lambda row: distance.euclidean(row, date_normalized) , axis=1)
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx":euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_date = df.loc[int(second_smallest)]["date"]
And I want that my chart like this
The chart that I want
And the code from this picture
distance_columns = ['Close', 'DailyReturn']
stock_numeric = stock[distance_columns]
stock_normalized = (stock_numeric - stock_numeric.mean()) / stock_numeric.std()
stock_normalized.fillna(0, inplace = True)
date_normalized = stock_normalized[stock["Date"] == "2016-06-29"]
euclidean_distances = stock_normalized.apply(lambda row: distance.euclidean(row, date_normalized), axis = 1)
distance_frame = pandas.DataFrame(data = {"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_date = stock.loc[int(second_smallest)]["Date"]
I tried to figure it out, the "apply" in the df.apply from pandas format and from pandas.csv_reader is different.
Is there any alternative to have same output in different format (pandas and csv)
Thank you!
nb: sorry if my english bad.

How to more efficiently calculate a rolling ratio

i have data length is over 3000.
below are code for making 20days value ( Volume Ration in Stock market)
it took more than 2 min.
is there any good way to reduce running time.
import pandas as pd
import numpy as np
from pandas.io.data import DataReader
import matplotlib.pylab as plt
data = DataReader('047040.KS','yahoo',start='2010')
data['vr']=0
data['Volume Ratio']=0
data['acend']=0
data['vr'] = np.sign(data['Close']-data['Open'])
data['vr'] = np.where(data['vr']==0,0.5,data['vr'])
data['vr'] = np.where(data['vr']<0,0,data['vr'])
data['acend'] = np.multiply(data['Volume'],data['vr'])
for i in range(len(data['Open'])):
if i<19:
data['Volume Ratio'][i]=0
else:
data['Volume Ratio'][i] = ((sum(data['acend'][i-19:i]))/((sum(data['Volume'][i-19:i])-sum(data['acend'][i-19:i]))))*100
Consider using conditional row selection and rolling.sum():
data.loc[data.index[:20], 'Volume Ratio'] = 0
data.loc[data.index[20:], 'Volume Ratio'] = (data.loc[:20:, 'acend'].rolling(window=20).sum() / (data.loc[:20:, 'Volume'].rolling(window=20).sum() - data.loc[:20:, 'acend'].rolling(window=20).sum()) * 100
or, simplified - .rolling.sum() will create np.nan for the first 20 values so just use .fillna(0):
data['new_col'] = data['acend'].rolling(window=20).sum().div(data['Volume'].rolling(window=20).sum().subtract(data['acend'].rolling(window=20).sum()).mul(100).fillna(0)

Categories