I am given the following bond:
and need to fit the Vasicek model to this data.
My attempt is the following:
# ... imports
years = np.array([1, 2, 3, 4, 7, 10])
pric = np.array([0, .93, .85, .78, .65, .55, .42])
X = sympy.symbols("a b sigma")
a, b, s = X
rt1_rt = np.diff(pric)
ab_rt = np.array([a*(b-r) for r in pric[1:] ])
term = rt1_rt - ab_rt
def normpdf(x, mean, sd):
var = sd**2
denom = (2*sym.pi*var)**.5
num = sym.E**(-(x-mean)**2/(2*var))
return num/denom
pdfs = np.array([sym.log(normpdf(x, 0, s)) for x in term])
func = 0
for el in pdfs:
func += el
func = func.factor()
lmd = sym.lambdify(X, func)
def target_fun(params):
return lmd(*params)
result = scipy.optimize.least_squares(target_fun, [10, 10, 10])
I don't think that it outputs correct solution.
Your code is almost correct.
You want to maximize your function, therefore you need to place minus sign in front of lmd in your function.
def target_fun(params):
return -lmd(*params)
Additionally, the initial values are usually set to less than 1. Picking 10 is not the best choice as the algorithm might converge to a saddle point.
Consider [0.01, 0.01, 0.01].
The objective is to find the point of intersection of two linear equations. These two linear equation are derived using the Numpy polyfit functions.
Given two time series (xLeft, yLeft) and (xRight, yRight), the linear least suqares fit to each of them was calculated using polyfit as shown below:
xLeft = [
6168, 6169, 6170, 6171, 6172, 6173, 6174, 6175, 6176, 6177,
6178, 6179, 6180, 6181, 6182, 6183, 6184, 6185, 6186, 6187
]
yLeft = [
0.98288751, 1.3639959, 1.7550986, 2.1539073, 2.5580614,
2.9651523, 3.3727503, 3.7784295, 4.1797948, 4.5745049,
4.9602985, 5.3350167, 5.6966233, 6.0432272, 6.3730989,
6.6846867, 6.9766307, 7.2477727, 7.4971657, 7.7240791
]
xRight = [
6210, 6211, 6212, 6213, 6214, 6215, 6216, 6217, 6218, 6219,
6220, 6221, 6222, 6223, 6224, 6225, 6226, 6227, 6228, 6229,
6230, 6231, 6232, 6233, 6234, 6235, 6236, 6237, 6238, 6239,
6240, 6241, 6242, 6243, 6244, 6245, 6246, 6247, 6248, 6249,
6250, 6251, 6252, 6253, 6254, 6255, 6256, 6257, 6258, 6259,
6260, 6261, 6262, 6263, 6264, 6265, 6266, 6267, 6268, 6269,
6270, 6271, 6272, 6273, 6274, 6275, 6276, 6277, 6278, 6279,
6280, 6281, 6282, 6283, 6284, 6285, 6286, 6287, 6288]
yRight=[
7.8625913, 7.7713094, 7.6833806, 7.5997391, 7.5211883,
7.4483986, 7.3819046, 7.3221073, 7.2692747, 7.223547,
7.1849418, 7.1533613, 7.1286001, 7.1103559, 7.0982385,
7.0917811, 7.0904517, 7.0936642, 7.100791, 7.1111741,
7.124136, 7.1389918, 7.1550579, 7.1716633, 7.1881566,
7.2039142, 7.218349, 7.2309117, 7.2410989, 7.248455,
7.2525721, 7.2530937, 7.249711, 7.2421637, 7.2302341,
7.213747, 7.1925621, 7.1665707, 7.1356878, 7.0998487,
7.0590014, 7.0131001, 6.9621005, 6.9059525, 6.8445964,
6.7779589, 6.7059474, 6.6284504, 6.5453324, 6.4564347,
6.3615761, 6.2605534, 6.1531439, 6.0391097, 5.9182019,
5.7901659, 5.6547484, 5.5117044, 5.360805, 5.2018456,
5.034656, 4.8591075, 4.6751242, 4.4826899, 4.281858,
4.0727611, 3.8556159, 3.6307325, 3.3985188, 3.1594861,
2.9142516, 2.6635408, 2.4081881, 2.1491354, 1.8874279,
1.6242117,1.3607255,1.0982931,0.83831298
]
left_line = np.polyfit(xleft, yleft, 1)
right_line = np.polyfit(xRight, yRight, 1)
In this case, polyfit outputs the coeficients m and b for y = mx + b, respectively.
The intersection of the two linear equations then can be calculated as follows:
x0 = -(left_line[1] - right_line[1]) / (left_line[0] - right_line[0])
y0 = x0 * left_line[0] + left_line[1]
However, I wonder whether there exist Numpy build-in approach to calculate the last two steps?
Not exactly a built-in approach, but you can simplify the problem. Say I have lines given my y = m1 * x + b1 and y = m2 * x + b2. You can trivially find an equation for the difference, which is also a line:
y = (m1 - m2) * x + (b1 - b2)
Notice that this line will have a root at the intersection of the two original lines, if they intersect. You can use the numpy.polynomial.Polynomial class to perform these operations:
>>> (np.polynomial.Polynomial(left_line[::-1]) - np.polynomial.Polynomial(right_line[::-1])).roots()
array([6192.0710885])
Notice that I had to swap the order of the coefficients, since Polynomial expects smallest to largest, while np.polyfit returns the opposite. In fact, np.polyfit is not recommended. Instead, you can get Polynomial obejcts directly using np.polynomial.Polynomial.fit class method. Your code would then look like:
left_line = np.polynomial.Polynomial.fit(xLeft, yLeft, 1, domain=[-1, 1])
right_line = np.polynomial.Polynomial.fit(xRight, yRight, 1, domain=[-1, 1])
x0 = (left_line - right_line).roots()
y0 = left_line(x0)
The domain is mapped to the window [-1, 1]. If you do not specify a domain, the peak-to-peak of the x-values will be used instead. You do not want this, since it will result in a mapping of the input values. Instead, we explicitly specify that the domain [-1, 1] maps to the same window. An alternative would be to use the default domain and set e.g. window=[xLeft.min(), xLeft.max()]. The problem with this approach is that it would then create different domains for the polynomials, preventing the operation left_line - right_line.
See https://numpy.org/doc/stable/reference/routines.polynomials.classes.html for more information.
You can model it as a linear system and use simple linear algebra:
def get_intersection(m1,b1,m2,b2):
A = np.array([[-m1, 1], [-m2, 1]])
b = np.array([[b1], [b2]])
# you have to solve linear System AX = b where X = [x y]'
X = np.linalg.pinv(A) # b
x, y = np.round(np.squeeze(X), 4)
return x, y # returns point of intersection (x,y) with 4 decimal precision
m1,b1,m2,b2 = left_line(0), left_line(1), right_line(0), right_line(1)
print(get_intersection(m1,b1,m2,b2))
As an example, for lines y - x = 1, and y + x = 1, we expect the intersection as (0,1):
m1,b1,m2,b2 = 1, 1, -1, 1
print(get_intersection(m1,b1,m2,b2))
Output: (0.0, 1.0) as expected.
I have a Python list containing continuous values (from 0 to 1020) that I'd like to descritize in ordinal values from 0 to 5 using K-Means strategy.
I have used the new class sklearn.preprocessing.KBinsDiscretizer to perform that:
def descritise_kmeans(python_arr, num_bins):
X = np.array(python_arr).reshape(-1, 1)
est = KBinsDiscretizer(n_bins=num_bins, encode='ordinal', strategy='kmeans')
est.fit(X)
Xt = est.transform(X)
return Xt
When running this method, I got error:
/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/preprocessing/_discretization.py in transform(self, X)
262 atol = 1.e-8
263 eps = atol + rtol * np.abs(Xt[:, jj])
--> 264 Xt[:, jj] = np.digitize(Xt[:, jj] + eps, bin_edges[jj][1:])
265 np.clip(Xt, 0, self.n_bins_ - 1, out=Xt)
266
ValueError: bins must be monotonically increasing or decreasing
When looking closely at this, seems like numpy.descritize method is the one that throws the error. This seems to be a bug of Sklearn library.
When number of bins n_bins is 6, the error is thrown. However, when n_bins is 5, it works.
I faced a similar problem and I find my mistake in setting values for the bins. My code is simple
bins = np.array([0.0, .33, 66, 1])
data = [0.1, .2, .4, .5, .7, 8]
inds = np.digitize(data, bins, right=False)
I missed a dot before .66 and my bins were not monotonic. While it may not be the source of the problem in this question, I hope it helps someone.
Makeshift solution:
Edit sklearns sourcecode with this transform function: sklearn/preprocessing/_discretization.py
It is at line 237 as of version '0.20.2'
def transform(self, X):
"""Discretizes the data.
Parameters
----------
X : numeric array-like, shape (n_samples, n_features)
Data to be discretized.
Returns
-------
Xt : numeric array-like or sparse matrix
Data in the binned space.
"""
check_is_fitted(self, ["bin_edges_"])
Xt = check_array(X, copy=True, dtype=FLOAT_DTYPES)
n_features = self.n_bins_.shape[0]
if Xt.shape[1] != n_features:
raise ValueError("Incorrect number of features. Expecting {}, "
"received {}.".format(n_features, Xt.shape[1]))
def ensure_monotic_increase(array):
"""
add small noise to the bin_edges[i]
when bin_edges[i] !> bin_edges[i-1]
"""
noise_overlay = np.zeros(array.shape)
for i in range(1,len(array)):
bigger = array[i]>array[i-1]
if bigger:
pass
else:
noise_overlay[i] = abs(array[i-1] * 0.0001)
return(array+noise_overlay)
bin_edges = self.bin_edges_
for jj in range(Xt.shape[1]):
# Values which are close to a bin edge are susceptible to numeric
# instability. Add eps to X so these values are binned correctly
# with respect to their decimal truncation. See documentation of
# numpy.isclose for an explanation of ``rtol`` and ``atol``.
rtol = 1.e-5
atol = 1.e-8
eps = atol + rtol * np.abs(Xt[:, jj])
old_bin_edges = bin_edges[jj][1:]
try:
Xt[:, jj] = np.digitize(Xt[:, jj] + eps, old_bin_edges)
except ValueError:
new_bin_edges = ensure_monotic_increase(old_bin_edges)
#print(old_bin_edges)
#print(new_bin_edges)
try:
Xt[:, jj] = np.digitize(Xt[:, jj] + eps, new_bin_edges)
except:
raise
np.clip(Xt, 0, self.n_bins_ - 1, out=Xt)
if self.encode == 'ordinal':
return Xt
return self._encoder.transform(Xt)
The issue (that I encountered)
The bin edges were too close to each other. Possibly, by some kind of floating point error, the prior bin edge ends up larger than the next bin edge.
When printing the edges, (uncomment the print statements in the above function), the first 2 bin edges were observably equal to each other. The printed bin_edges were:
[-0.1025641 -0.1025641 0.82793522] # ValueError
[-0.1025641 -0.10255385 0.82793522] # After fix
[0.2075 0.2075 0.88798077] # ValueError
[0.2075 0.20752075 0.88798077] # After fix
[ 0.7899066 0.7899066 24.31967669] # ValueError
[ 0.7899066 0.78998559 24.31967669] # After fix
[5.47545572e-18 5.47545572e-18 2.36842105e-01] # ValueError
[5.47545572e-18 5.47600326e-18 2.36842105e-01] # After fix
[5.47545572e-18 5.47545572e-18 2.82894737e-01] # ValueError
[5.47545572e-18 5.47600326e-18 2.82894737e-01] # After fix
[-0.46762302 -0.46762302 -0.00969465] # ValueError
[-0.46762302 -0.46757626 -0.00969465] # After fix
So I want to implement a matrix standardisation method.
To do that, I've been told to
subtract the mean and divide by the standard deviation for each dimension
And to verify:
after this processing, each dimension has zero mean and unit variance.
That sounds simple enough ...
import numpy as np
def standardize(X : np.ndarray,inplace=True,verbose=False,check=False):
ret = X
if not inplace:
ret = X.copy()
ndim = np.ndim(X)
for d in range(ndim):
m = np.mean(ret,axis=d)
s = np.std(ret,axis=d)
if verbose:
print(f"m{d} =",m)
print(f"s{d} =",s)
# TODO: handle zero s
# TODO: subtract m along the correct axis
# TODO: divide by s along the correct axis
if check:
means = [np.mean(X,axis=d) for d in range(ndim)]
stds = [np.std(X,axis=d) for d in range(ndim)]
if verbose:
print("means=\n",means)
print("stds=\n",stds)
assert all(all(m < 1e-15 for m in mm) for mm in means)
assert all(all(s == 1.0 for s in ss) for ss in stds)
return ret
e.g. for ndim == 2, we could get something like
A=
[[ 0.40923704 0.91397416 0.62257397]
[ 0.15614258 0.56720836 0.80624135]]
m0 = [ 0.28268981 0.74059126 0.71440766] # can broadcast with ret -= m0
s0 = [ 0.12654723 0.1733829 0.09183369] # can broadcast with ret /= s0
m1 = [ 0.33333333 -0.33333333] # ???
s1 = [ 0.94280904 0.94280904] # ???
How do I do that?
Judging by Broadcast an operation along specific axis in python , I thought I may be looking for a way to create
m[None, None, None, .., None, : , None, None, .., None]
Where there is exactly one : at index d.
But even if I knew how to do that, I'm not sure it'd work.
You can swap your axes such that the first axes is the one you want to normalize. This should also work inplace, since swapaxes just returns a view on your data.
Using the numpy command swapaxes:
for d in range(ndim):
m = np.mean(ret,axis=d)
s = np.std(ret,axis=d)
ret = np.swapaxes(ret, 0, d)
# Perform Normalisation of Axis
ret -= m
ret /= s
ret = np.swapaxes(ret, 0, d)
In python there is the distance_transform_edt function in the scipy.ndimage.morphology module. I applied it to a simple case, to compute the distance from a single cell in a masked numpy array.
However the function remove the mask of the array and compute, as expected, the Euclidean distance for each cell, with non null value, from the reference cell, with the null value.
Below is an example I gave in my blog post:
%pylab
from scipy.ndimage.morphology import distance_transform_edt
l = 100
x, y = np.indices((l, l))
center1 = (50, 20)
center2 = (28, 24)
center3 = (30, 50)
center4 = (60,48)
radius1, radius2, radius3, radius4 = 15, 12, 19, 12
circle1 = (x - center1[0])**2 + (y - center1[1])**2 < radius1**2
circle2 = (x - center2[0])**2 + (y - center2[1])**2 < radius2**2
circle3 = (x - center3[0])**2 + (y - center3[1])**2 < radius3**2
circle4 = (x - center4[0])**2 + (y - center4[1])**2 < radius4**2
# 3 circles
img = circle1 + circle2 + circle3 + circle4
mask = ~img.astype(bool)
img = img.astype(float)
m = ones_like(img)
m[center1] = 0
#imshow(distance_transform_edt(m), interpolation='nearest')
m = ma.masked_array(distance_transform_edt(m), mask)
imshow(m, interpolation='nearest')
However I want to compute the geodesic distance transform that take into account the masked elements of the array. I do not want to compute the Euclidean distance along a straight line that go through masked elements.
I used The Dijkstra algorithm to obtain the result I want. Below is the implementation I proposed:
def geodesic_distance_transform(m):
mask = m.mask
visit_mask = mask.copy() # mask visited cells
m = m.filled(numpy.inf)
m[m!=0] = numpy.inf
distance_increments = numpy.asarray([sqrt(2), 1., sqrt(2), 1., 1., sqrt(2), 1., sqrt(2)])
connectivity = [(i,j) for i in [-1, 0, 1] for j in [-1, 0, 1] if (not (i == j == 0))]
cc = unravel_index(m.argmin(), m.shape) # current_cell
while (~visit_mask).sum() > 0:
neighbors = [tuple(e) for e in asarray(cc) - connectivity
if not visit_mask[tuple(e)]]
tentative_distance = [distance_increments[i] for i,e in enumerate(asarray(cc) - connectivity)
if not visit_mask[tuple(e)]]
for i,e in enumerate(neighbors):
d = tentative_distance[i] + m[cc]
if d < m[e]:
m[e] = d
visit_mask[cc] = True
m_mask = ma.masked_array(m, visit_mask)
cc = unravel_index(m_mask.argmin(), m.shape)
return m
gdt = geodesic_distance_transform(m)
imshow(gdt, interpolation='nearest')
colorbar()
The function implemented above works well but is too slow for the application I developed which needs to compute the geodesic distance transform several times.
Below is the time benchmark of the euclidean distance transform and the geodesic distance transform:
%timeit distance_transform_edt(m)
1000 loops, best of 3: 1.07 ms per loop
%timeit geodesic_distance_transform(m)
1 loops, best of 3: 702 ms per loop
How can I obtained a faster geodesic distance transform?
First of all, thumbs up for a very clear and well written question.
There is a very good and fast implementation of a Fast Marching method called scikit-fmm to solve this kind of problem. You can find the details here:
http://pythonhosted.org//scikit-fmm/
Installing it might be the hardest part, but on Windows with Conda its easy, since there is 64bit Conda package for Py27:
https://binstar.org/jmargeta/scikit-fmm
From there on, just pass your masked array to it, as you do with your own function. Like:
distance = skfmm.distance(m)
The results looks similar, and i think even slightly better. Your approach searches (apparently) in eight distinct directions resulting in a bit of a 'octagonal-shaped` distance.
On my machine the scikit-fmm implementation is over 200x faster then your function.
64-bit Windows binaries for scikit-fmm are now available from Christoph Gohlke.
http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-fmm
A slightly faster (about 10x) implementation that achieves the same result as your geodesic_distance_transform:
def getMissingMask(slab):
nan_mask=numpy.where(numpy.isnan(slab),1,0)
if not hasattr(slab,'mask'):
mask_mask=numpy.zeros(slab.shape)
else:
if slab.mask.size==1 and slab.mask==False:
mask_mask=numpy.zeros(slab.shape)
else:
mask_mask=numpy.where(slab.mask,1,0)
mask=numpy.where(mask_mask+nan_mask>0,1,0)
return mask
def geodesic(img,seed):
seedy,seedx=seed
mask=getMissingMask(img)
#----Call distance_transform_edt if no missing----
if mask.sum()==0:
slab=numpy.ones(img.shape)
slab[seedy,seedx]=0
return distance_transform_edt(slab)
target=(1-mask).sum()
dist=numpy.ones(img.shape)*numpy.inf
dist[seedy,seedx]=0
def expandDir(img,direction):
if direction=='n':
l1=img[0,:]
img=numpy.roll(img,1,axis=0)
img[0,:]==l1
elif direction=='s':
l1=img[-1,:]
img=numpy.roll(img,-1,axis=0)
img[-1,:]==l1
elif direction=='e':
l1=img[:,0]
img=numpy.roll(img,1,axis=1)
img[:,0]=l1
elif direction=='w':
l1=img[:,-1]
img=numpy.roll(img,-1,axis=1)
img[:,-1]==l1
elif direction=='ne':
img=expandDir(img,'n')
img=expandDir(img,'e')
elif direction=='nw':
img=expandDir(img,'n')
img=expandDir(img,'w')
elif direction=='sw':
img=expandDir(img,'s')
img=expandDir(img,'w')
elif direction=='se':
img=expandDir(img,'s')
img=expandDir(img,'e')
return img
def expandIter(img):
sqrt2=numpy.sqrt(2)
tmps=[]
for dirii,dd in zip(['n','s','e','w','ne','nw','sw','se'],\
[1,]*4+[sqrt2,]*4):
tmpii=expandDir(img,dirii)+dd
tmpii=numpy.minimum(tmpii,img)
tmps.append(tmpii)
img=reduce(lambda x,y:numpy.minimum(x,y),tmps)
return img
#----------------Iteratively expand----------------
dist_old=dist
while True:
expand=expandIter(dist)
dist=numpy.where(mask,dist,expand)
nc=dist.size-len(numpy.where(dist==numpy.inf)[0])
if nc>=target or numpy.all(dist_old==dist):
break
dist_old=dist
return dist
Also note that if the mask forms more than 1 connected regions (e.g. adding another circle not touching the others), your function will fall into an endless loop.
UPDATE:
I found one Cython implementation of Fast Sweeping method in this notebook, which can be used to achieve the same result as scikit-fmm with probably comparable speed. One just need to feed a binary flag matrix (with 1s as viable points, inf otherwise) as the cost to the GDT() function.