This is related to another question I asked earlier. I want to run the newton method on a large dataset. Below is the code I created using a loop. I need to run it on ~50 million lines and the loop is quite unwieldy. Is there a more efficient way to run it using Pandas/Numpy/ect? Thanks in advance
In:
from pandas import *
from pylab import *
import pandas as pd
import pylab as plt
import numpy as np
from scipy import *
import scipy
df = DataFrame(list([100,2,34.1556,9,105,-100]))
df = DataFrame.transpose(df)
df = df.rename(columns={0:'Face',1:'Freq',2:'N',3:'C',4:'Mkt_Price',5:'Yield'})
df2= df
df = concat([df, df2])
df = df.reset_index(drop=True)
df
Out:
Face Freq N C Mkt_Price Yield
0 100 2 34.1556 9 105 -100
1 100 2 34.1556 9 105 -100
In:
def Px(Rate):
return Mkt_Price - (Face * ( 1 + Rate / Freq ) ** ( - N ) + ( C / Rate ) * ( 1 - (1 + ( Rate / Freq )) ** -N ) )
for count, row in df.iterrows():
Face = row['Face']
Freq = row['Freq']
N = row['N']
C = row['C']
Mkt_Price = row['Mkt_Price']
row['Yield'] = scipy.optimize.newton(Px, .1, tol=.0001, maxiter=100)
df
Out:
Face Freq N C Mkt_Price Yield
0 100 2 34.1556 9 105 0.084419
1 100 2 34.1556 9 105 0.084419
One possibility that pops into my mind is that you might do it vectorized. However, you must then throw away all conditional code, and just run the required amount of iterations.
The basic step in Newton-Raphson is always the same, so you do not need to have any conditional code. Your function Px looks as if it could be vectorized without any extra effort.
The steps are roughly:
def Px(Rate, Mkt_Price, Face, Freq, N, C):
return Mkt_Price - (Face * ( 1 + Rate / Freq ) ** ( - N ) + ( C / Rate ) * ( 1 - (1 + ( Rate / Freq )) ** -N ) )
# initialize the iteration vector
y = 0.1 * np.zeros(num_rows)
# just a guess for the differentiation, might be smaller
h = 1e-6
# then iterate for a suitable number of iterations
for i in range(100):
f = Px(y, Mkt_Price, Face, Freq, N, C)
fp = Px(y+h, Mkt_Price, Face, Freq, N, C)
y -= h * f / (fp - f)
After this you have the iteration results in y. I have assumed Mkt_Price, Face, etc. are 50-million-row vectors.
There will be billions of calculations, so this will still take maybe a dozen seconds. Also, there is no error checking, so if something goes wildly oscillating, there is nothing to warn you about it.
One way to make this better is to calculate the first differential analytically, as it can be done. The practical improvement may be small, though. You will have to experiment to find the best number of iterations. If the function converges fast (as I suppose), 20 iterations will be plenty.
The code is completely untested, but it should illustrate the idea.
Related
I was trying to obtain the Mathieu characteristic values for a specific problem. I do not have any problem obtaining them, and I have read the documentation from Scipy regarding these functions. The problem is that I know for a fact that the points I am obtaining are not right. My script to obtain the characteristic values I need is below:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import mathieu_a, mathieu_b, mathieu_cem, mathieu_sem
M = 1.0
g = 1.0
l = 1.0
h = 0.06
U0 = M * g * l
q = 4 * M * l**2 * U0 / h**2
def energy(n, q):
if n % 2 == 0:
return (h**2 / (8 * M * l**2)) * mathieu_a(n, q) + U0
else:
return (h**2 / (8 * M * l**2)) * mathieu_b(n + 1, q) + U0
n_list = np.arange(0, 80, 1)
e_n = [energy(i, q) for i in n_list]
plt.plot(n_list, e_n, '.')
The resulting plot of these values is this one. There is a zone where it appears to be "noise" or a numerical error, and I know that those jumps must not occur. In reality, around x= 40 to x > 40, the points should behave like a staircase of two consecutive points, similar to what can be seen between 70 < x < 80. And the values that x can take for this case are only positive integers.
I saw that the implementation of the Mathieu function has some problems, see here. But this was six years ago! In the answer to this question they use the NAG Library for Python, but it is not exactly open-source.
Is there a way I can still use these functions from Scipy without having this problem? Or is it related to the precision I am using to obtain the Mathieu characteristic value?
I have a coupled system of differential equations that I've already solved with Euler in Excel. Now I want to make it more precise with an ODE-solver in python.
However, there must be a mistake in my code because the curves look different than in Excel. I don't expect the curves to reach 1 and 0 in the end.
import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
# define reactor
def reactor(x,z):
n_a = x[0]
n_b = x[1]
n_c = x[2]
dn_adz = A * (-1) * B * (n_a/(n_a + n_b + n_c)) / (1 + C * (n_c/(n_a + n_b + n_c)))
dn_bdz = A * (1) * B * (n_a/(n_a + n_b + n_c)) / (1 + C * (n_c/(n_a + n_b + n_c)))
dn_cdz = A * (1) * B * (n_a/(n_a + n_b + n_c)) / (1 + C * (n_c/(n_a + n_b + n_c)))
dxdz = [dn_adz,dn_bdz,dn_cdz]
return dxdz
# initial conditions
n_a0 = 0.5775
n_b0 = 0.0
n_c0 = 0.0
x0 = [n_a0, n_b0, n_c0]
# parameters
A = 0.12
B = 3.1e-9
C = 4.02e15
# number of steps
n = 100
# z step interval (m)
z = np.linspace(0,0.0274,n)
# solve ODEs
x = odeint(reactor,x0,z)
# Plot the results
plt.plot(z,x[:,0],'b-')
plt.plot(z,x[:,1],'r--')
plt.plot(z,x[:,2],'k:')
plt.show()
Is is a problem with the initial condition that stays constant and does not change from step to step?
Should it be like in Excel with Euler, where the next step uses the conditions/values of the precious step?
From the structure of the right sides you get constant combinations of the state variables, n_a+n_b=n_a0+n_b0 and n_a+n_c=n_a0+n_c0. This means that the dynamic reduces to the one-dimensional dynamic of n_a.
By the first equation, the derivative of n_a is negative for positive n_a, so that the solution is falling towards n_a=0. By the constants of the dynamics, n_b converges to n_a0+n_b0 and n_c converges to n_a0+n_c0.
It is unclear how you get convergence towards 1 in some components, as that is not supported by the initial conditions. Apart from that, the described odeint result fits this qualitative behavior.
I have data that has 800,000+ rows. I want to take an Exponential Moving Average (EMA) of one of the columns. The times are not evenly sampled and I want to decay the EMA on each update (row). The code I have is this:
window = 5
for i in range(1, len(series)):
dt = series['datetime'][i] - series['datetime'][i - 1]
decay = 1 - numpy.exp(-dt / window)
result[i] = (1 - decay) * result[i - 1] + decay * series['midpoint'].iloc[i]
return pandas.Series(result, index=series.index)
The problem is, for 800,000 rows, this is very slow. Is there anyway to optimize this using some other features of numpy? I can't vectorize it because results[i] is dependent on results[i-1].
sample data here:
Timestamp Midpoint
1559655000001096130 2769.125
1559655000001162260 2769.127
1559655000001171688 2769.154
1559655000001408734 2769.138
1559655000001424200 2769.123
1559655000001433128 2769.110
1559655000001541560 2769.125
1559655000001640406 2769.125
1559655000001658436 2769.127
1559655000001755924 2769.129
1559655000001793266 2769.125
1559655000001878688 2769.143
1559655000002061024 2769.125
How about something like the following which takes me 0.34 seconds to run on a series of irregularly spaced data with 900k rows? I am assuming the window of 5 implies a 5 day span.
First, let's create some sample data.
# Create sample data for a price stream of 2.6m price observations sampled 1 second apart.
seconds_per_day = 60 * 60 * 24 # 60 seconds / minute * 60 minutes / hour * 24 hours / day
starting_value = 100
annualized_vol = .3
sampling_percentage = .35 # 35%
start_date = '2018-12-01'
end_date = '2018-12-31'
np.random.seed(0)
idx = pd.date_range(start=start_date, end=end_date, freq='s') # One second intervals.
periodic_vol = annualized_vol * (1/ 252 / seconds_per_day) ** 0.5
daily_returns = np.random.randn(len(idx)) * periodic_vol
cumulative_indexed_return = (1 + daily_returns).cumprod() * starting_value
index_level = pd.Series(cumulative_indexed_return, index=idx)
# Sample 35% of the simulated prices to create a time series of 907k rows with irregular time intervals.
s = index_level.sample(frac=sampling_percentage).sort_index()
Now let's create a generator function to store the latest value of the exponentially weighted time series. This can run c. 4x faster by installing numba, importing it and then adding the single decorator line above the function definition #jit(nopython=True).
from numba import jit # Optional, see below.
#jit(nopython=True) # Optional, see below.
def ewma(vals, decay_vals):
result = vals[0]
yield result
for val, decay in zip(vals[1:], decay_vals[1:]):
result = result * (1 - decay) + val * decay
yield result
Now let's run this generator on the irregularly spaced series s. For this sample with 900k rows, it takes me 1.2 seconds to run the following code. I can further cut down the execution time to 0.34 seconds by optionally using the the just in time compiler from numba. You first need to install that package, e.g. conda install numba. Note that I used a list compehension to populate the ewma values from the generator, and then I assign these values back to the original series after first converting it to a dataframe.
# Assumes time series data is now named `s`.
window = 5 # Span of 5 days?
dt = pd.Series(s.index).diff().dt.total_seconds().div(seconds_per_day) # Measured in days.
decay = (1 - (dt / -window).apply(np.exp))
g = ewma_generator(s.values, decay.values)
result = s.to_frame('midpoint').assign(
ewma=pd.Series([next(g) for _ in range(len(s))], index=s.index))
>>> result.tail()
midpoint ewma
2018-12-30 23:59:45 103.894471 105.546004
2018-12-30 23:59:49 103.914077 105.545929
2018-12-30 23:59:50 103.901910 105.545910
2018-12-30 23:59:53 103.913476 105.545853
2018-12-31 00:00:00 103.910422 105.545720
>>> result.shape
(907200, 2)
To make sure the numbers follow our intuition, let's visualize the result taking hourly samples. This looks good to me.
obs_per_day = 24 # 24 hourly observations per day.
step = int(seconds_per_day / obs_per_day)
>>> result.iloc[::step, :].plot()
A slight improvement may be obtained by iterating on the underlying numpy arrays instead of on pandas DataFrames and Series:
result = np.ndarray(len(series))
window = 5
serdt = series['datetime'].values
sermp = series['midpoint'].values
for i in range(1, len(series)):
dt = serdt[i] - serdt[i - 1]
decay = 1 - numpy.exp(-dt / window)
result[i] = (1 - decay) * result[i - 1] + decay * sermp[i]
return pandas.Series(result, index=series.index)
With your sample data it is about 6 times faster that the original method.
I've got a squared signal with a frequency f, and I'm interested in the time at which the square starts.
def time_builder(f, t0=0, tf=300):
"""
Function building the time line in ms between t0 and tf with a frequency f.
f: Hz
t0 and tf: ms
"""
time = [t0] # /!\ time in ms
i = 1
while time[len(time)-1] < tf:
if t0 + (i/f)*1000 < tf:
time.append(t0 + (i/f)*1000)
else:
break
i += 1
return time
So this function loops between t0 and tf to create a list in which is the timing at which a square starts. I'm quite sure it's not the best way to do it, and I'd like to know how to improve it.
Thanks.
If I am interpreting this correct, you are looking for a list of the times of the waves, starting at t0 and ending at tf.
def time_builder(f, t0=0, tf=300):
"""
Function building the time line in ms between t0 and tf with a frequency f.
f: Hz
t0 and tf: ms
"""
T = 1000 / f # period [ms]
n = int( (tf - t0) / T + 0.5 ) # n integer number of wavefronts, +0.5 added for rounding consistency
return [t0 + i*T for i in range(n)]
Using standard library python for this might not be the best approach... particularly considering that you might want to do other things later on.
An alternative is to use numpy. This will let you to do the following
from numpy import np
from scipy import signal
t = np.linspace(0, 1, 500, endpoint=False)
s = signal.square(2 * np.pi * 5 * t) # we create a square signal usign scipy
d = np.diff(s) # obtaining the differences, this tell when there is a step.
# In this particular case, 2 means step up -2 step down.
starts = t[np.where(d == 2)] # take the times array t filtered by which
# elements in the differences array d equal to 2
I'm having trouble getting my code to work. Im coding python in a backtesting environment called "Quantopian". Regardless, the .apply(), series, .pd or whatever terminology is beyond my skill level. (assuming I'm even on the right track lol)
What I'm trying to accomplish:
Taking a couple stocks and constantly calculating the MACD. Then when the indicator meets a certain condition, the algo purchases or sells that specific stock.
What the MACD is simplistically:
A momentum indicator that looks at historical data, using 12, 26 and 9 day Exponential Moving Averages and comparing them with each other.
I've designed my own function, thats not my problem....
Help:
I'm trying to apply it to the pool of stocks in my universe to constantly calculate the MACD every minute.
Where I'm specifically confused:
I defined a MACD function but don't know how to get it to calculate every minute for whatever stocks are in my pool.
CODE:
import numpy as np
import math
import talib as ta
import pandas as pd
def initialize(context):
set_commission(commission.PerTrade(cost=10))
context.stocks = symbols('AAPL', 'GOOG_L')
def handle_data(context, data):
for stock in context.stocks:
prices_fast = data.history(context.stocks, "close", 390, "1m").resample("30min").dropna()
prices_slow = data.history(context.stocks, "close", 390, "1m").resample("30min").dropna()
prices_signal = data.history(context.stocks, "close", 390, "1m").resample("30min").dropna()
curr_price = data.history(context.stocks, "price", 30, "1m").resample("30min")[-1:].dropna()
series = pd.Series([stock]).dropna()
macd = series.apply(MACD)
macd_func = stock.apply(MACD)
if macd_func[stock] > 0:
order(stock, 1)
print macd_func
record(macd=macd_func[stock])
def MACD(prices_fast, prices_slow, prices_signal, curr_price):
# Setting MACD Conditions:
slow = 26
fast = 12
signal = 9
# Calcualting Averages:
avg_fast = pd.rolling_sum(prices_fast[:fast], fast)[-1:] / fast
avg_slow = pd.rolling_sum(prices_slow[:slow], slow)[-1:] / slow
avg_signal = pd.rolling_sum(prices_signal[:signal], signal)[-1:] / signal
# Calculating the Weighting Multipliers:
A = 2 / (fast + 1)
B = 2 / (slow + 1)
C = 2 / (signal + 1)
# Calculating the Exponential Moving Averages:
EMA_fast = (curr_price * A) + [avg_fast * (1 - A)]
EMA_slow = (curr_price * B) + [avg_slow * (1 - B)]
EMA_signal = (curr_price * C) + [avg_signal * (1 - C)]
# Calculating MACD Histogram:
macd = EMA_fast - EMA_slow - EMA_signal
If someone could give me a handle, I would GREATLY appreciate it!
Thank you very VERY much,
Mike