Why scipy's splrep shows error on input data? - python

While using scipy's splrep function to fit a cubic B-Spline for the below given data points, the output comes out to be an array of zeros and it says error with input data. I have checked the conditions written in the doc and input seems sane accordingly.
knot = [70.0]
X= [65. , 67.5, 70. , 72.5]
Y= [70.9277775 , 50.40025663 , 42.45372799 , 57.39316434]
Weight= [0.13514246 , 0.33885943 , 0.87606185 , 0.31531958]
SplineOutput=intp.splrep(X, Y, task=-1, t=knot, full_output=1, w=Weight)
SplineOutput
>>>((array([65. , 65. , 65. , 65. , 70. , 72.5, 72.5, 72.5, 72.5]), array([0., 0., 0., 0., 0., 0., 0., 0., 0.]), 3), 0.0, 10, 'Error on input data')
Any help about the source of this error and its cure would be appreciated. Thanks in advance!

From the documentation, under Notes
If provided, knots t must satisfy the Schoenberg-Whitney conditions, i.e., there must be a subset of data points x[j] such that t[j] < x[j] < t[j+k+1], for j=0, 1,...,n-k-2.
This effectively means that if k is 3, which I believe is the default, n must be at least 5. In your case, n is 4, hence why the error. Either provide an additional entry to x, y and w or decrease k. If you opt for the latter, keep the following in mind:
k : int, optional
The degree of the spline fit. It is recommended to use cubic splines. Even values of k should be avoided especially with small s values. 1 <= k <= 5

Related

Numpy largest singular value larger than greatest eigenvalue

Let
import numpy as np
M = np.array([[ 1., -0.5301332 , 0.80512845],
[ 0., 0., 0.],
[ 0., 0., 0.]])
M is rank one, its only non zero eigenvalue is 1 (its trace). However np.linalg.norm(M, ord=2) returns 1.39 which is strictly greater than 1. Why?
The eigenvalues of M, returned by np.linalg.eigvals are 1, 0, 0, but the singular values of M are 1.39, 0, 0, which is a surprise to me. What did I miss?
In this particular case the 2-norm of M coincides with the Frobenius norm, which is given by the formula (np.sum(np.abs(M**2)))**(1/2), therefore we can see that:
import numpy as np
M = np.array([[ 1., -0.5301332 , 0.80512845],
[ 0., 0., 0.],
[ 0., 0., 0.]])
np.sqrt(np.sum(np.abs(M**2)))
1.388982732341062
np.sqrt(np.sum(np.abs(M**2))) == np.linalg.norm(M,ord=2) == np.linalg.norm(M, ord='fro')
True
In particular one can prove that the 2-norm is the square root of the largest eigenvalue of M.T#M i.e.
np.sqrt(np.linalg.eigvals(M.T#M)[0])
1.388982732341062
And this is its relation with eigenvalues of a matrix. Now recall that the singular values are the square root of the eigenvalues of M.T#M and we unpack the mistery.
Using a characterization of the Frobenius norm (square root of the sum of the trace of M.T#M):
np.sqrt(np.sum(np.diag(M.T#M)))
1.388982732341062
Confronting the results:
np.sqrt(np.linalg.eigvals(M.T#M)[0]) == np.sqrt(np.sum(np.diag(M.T#M))) == np.linalg.svd(M)[1][0]
True
second norm of a matrix the square root of the sum of all elements squared
norm(M, ord=2) = (1.**2 + 0.5301332**2 + 0.80512845**2)**0.5 = 1.39
to get the relation between the eigen values and singular values you need to calculate the eigen values of M^H.M and square root it
eigV = np.linalg.eigvals(M.T.dot(M))
array([1.92927303, 0. , 0. ])
eigV**0.5
array([1.38898273, 0. , 0. ])
This is perfectly normal. In the general case, the singular values are not equals to the eigen values. This is true only for positive Hermitian matrices.
For squared matrices, you have the following relationship:
M = np.matrix([[ 1., -0.5301332 , 0.80512845],
[ 0., 0., 0.],
[ 0., 0., 0.]])
u, v= np.linalg.eig(M.H # M) # M.H # M is Hermitian
print(np.sqrt(u)) # [1.38898273 0. 0. ]
u,s,v = lin.svd(M)
print(s) # [1.38898273 0. 0. ]

How to do element-wise rounding of NumPy array to first non-zero digit?

I would like to "round" (not exact a mathematical rounding) the elements of a numpy array in the following way:
Given a numpy NxN or NxM 2D array with digit between 0.00001 to 9.99999 like
a=np.array([[1.232, 1.872,2.732,0.123],
[0.0019, 0.025, 1.854, 0.00017],
[1.457, 0.0021, 2.34 , 9.99],
[1.527, 3.3, 0.012 , 0.005]]
)
I would like basically to "round" this numpy array by selecting the first non-zero digit (irregardless of the digit that follows the first non-zero digit) of each element
giving the output:
output =np.array([[1.0, 1.0, 2.0, 0.1],
[0.001, 0.02, 1.0, 0.0001],
[1.0, 0.002, 2 , 9.0],
[1, 3, 0.01 , 0.005]]
)
thanks for any help!
You could use np.logspace and np.seachsorted to determine the order of magnitude of each element and then floor divide and multiply back
po10 = np.logspace(-10,10,21)
oom = po10[po10.searchsorted(a)-1]
a//oom*oom
# array([[1.e+00, 1.e+00, 2.e+00, 1.e-01],
# [1.e-03, 2.e-02, 1.e+00, 1.e-04],
# [1.e+00, 2.e-03, 2.e+00, 9.e+00],
# [1.e+00, 3.e+00, 1.e-02, 5.e-03]])
What you would want to do is to keep a fixed number of significant figures.
This functionality is not integrated into NumPy.
To get only the 1 significant figure, you could look into either #PaulPanzer or #darcamo answers (assuming that you only have positive values).
If you want something that works a specified number of significant figures, you could use something like:
def significant_figures(arr, num=1):
# : compute the order of magnitude
order = np.zeros_like(arr)
mask = arr != 0
order[mask] = np.floor(np.log10(np.abs(arr[mask])))
del mask # free unused memory
# : compute the corresponding precision
prec = num - order - 1
return np.round(arr * 10.0 ** prec) / 10.0 ** prec
print(significant_figures(a, 1))
# [[1.e+00 2.e+00 3.e+00 1.e-01]
# [2.e-03 2.e-02 2.e+00 2.e-04]
# [1.e+00 2.e-03 2.e+00 1.e+01]
# [2.e+00 3.e+00 1.e-02 5.e-03]]
print(significant_figures(a, 2))
# [[1.2e+00 1.9e+00 2.7e+00 1.2e-01]
# [1.9e-03 2.5e-02 1.9e+00 1.7e-04]
# [1.5e+00 2.1e-03 2.3e+00 1.0e+01]
# [1.5e+00 3.3e+00 1.2e-02 5.0e-03]]
EDIT
For truncated output use np.floor() instead of np.round() just before the return.
First get the powers of 10 for each number in the array with
powers = np.floor(np.log10(a))
In your example this gives us
array([[ 0., 0., 0., -1.],
[-3., -2., 0., -4.],
[ 0., -3., 0., 0.],
[ 0., 0., -2., -3.]])
Now, if we divide the i-th element in the array by 10**power_i we essentially move each number non-zero element in the array to the first position. Now we can simple take the floor to remove the other non-zero digits and then multiply the result by 10**power_i to get back to the original scale.
The complete solution is then only the code below
powers = np.floor(np.log10(a))
10**powers * np.floor(a/10**powers)
What about numbers greater than or equal to 10?
For this you can simply take np.floor of the original value in the array. We can do this easily with a mask. You can modify the answer as below
powers = np.floor(np.log10(a))
result = 10**powers * np.floor(a/10**powers)
mask = a >= 10
result[mask] = np.floor(a[mask])
You can also use a mask to avoid computing the powers and logarithm for numbers that will just be replaced later.

Smooth line with spline + datetime objects doesn't work

I have been trying to make a plot smoother like it is done here, but my Xs are datetime objects that are not compatible with linspace..
I convert the Xs to matplotlib dates:
Xnew = matplotlib.dates.date2num(X)
X_smooth = np.linspace(Xnew.min(), Xnew.max(), 10)
Y_smooth = spline(Xnew, Y, X_smooth)
But then I get an empty plot, as my Y_smooth is
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
for some unknown reason.
How can I make this work?
EDIT
Here's what I get when I print the variables, I see nothing abnormal :
X : [datetime.date(2016, 7, 31), datetime.date(2016, 7, 30), datetime.date(2016, 7, 29)]
X new: [ 736176. 736175. 736174.]
X new max: 736176.0
X new min: 736174.0
XSMOOTH [ 736174. 736174.22222222 736174.44444444 736174.66666667
736174.88888889 736175.11111111 736175.33333333 736175.55555556
736175.77777778 736176. ]
Y [711.74, 730.0, 698.0]
YSMOOTH [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Your X values are reversed, scipy.interpolate.spline requires the independent variable to be monotonically increasing, and this method is deprecated - use interp1d instead (see below).
>>> from scipy.interpolate import spline
>>> import numpy as np
>>> X = [736176.0, 736175.0, 736174.0] # <-- your original X is decreasing
>>> Y = [711.74, 730.0, 698.0]
>>> Xsmooth = np.linspace(736174.0, 736176.0, 10)
>>> spline(X, Y, Xsmooth)
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
reverse X and Y first and it works
>>> spline(
... list(reversed(X)), # <-- reverse order of X so also
... list(reversed(Y)), # <-- reverse order of Y to match
... Xsmooth
... )
array([ 698. , 262.18297973, 159.33767533, 293.62017489,
569.18656683, 890.19293934, 1160.79538066, 1285.149979 ,
1167.41282274, 711.74 ])
Note that many spline interpolation methods require X to be monotonically increasing:
UnivariateSpline
x : (N,) array_like - 1-D array of independent input data. Must be increasing.
InterpolatedUnivariateSpline
x : (N,) array_like - Input dimension of data points – must be increasing
The default order of scipy.interpolate.spline is cubic. Because there are only 3 data points there are large differences between a cubic spline (order=3) and a quadratic spline (order=2). The plot below shows the difference between different order splines; note: 100 points were used to smooth the fitted curve more.
The documentation for scipy.interpolate.splineis vague and suggests it may not be supported. For example, it is not listed on the scipy.interpolate main page or on the interploation tutorial. The source for spline shows that it actually calls spleval and splmake which are listed under Additional Tools as:
Functions existing for backward compatibility (should not be used in new code).
I would follow cricket_007's suggestion and use interp1d. It is the currently suggested method, it is very well documented with detailed examples in both the tutorial and API, and it allows the independent variable to be unsorted (any order) by default (see assume_sorted argument in API).
>>> from scipy.interpolate import interp1d
>>> f = interp1d(X, Y, kind='quadratic')
>>> f(Xsmooth)
array([ 711.74 , 720.14123457, 726.06049383, 729.49777778,
730.45308642, 728.92641975, 724.91777778, 718.4271605 ,
709.4545679 , 698. ])
Also it will raise an error if the data is rank deficient.
>>> f = interp1d(X, Y, kind='cubic')
ValueError: x and y arrays must have at least 4 entries

Sorting array based two goals

I have a list of vectors (each vectors only contain 0 or 1) :
In [3]: allLabelPredict
Out[3]: array([[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 0., 0., ..., 0., 0., 1.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 0., 0., ..., 0., 0., 1.]])
In [4]: allLabelPredict.shape
Out[4]: (5000, 190)
As you can see I have 190 different vectors each vector is a result of one classifier, now I want to select some of these output based on proximity of each vector to my original label
In [7]: myLabel
Out[7]: array([ 0., 0., 0., ..., 1., 1., 1.])
In [8]: myLabel.shape
Out[8]: (5000,)
For this purpose I've defined two different criteria for each vector; Zero Hamming Distance and One Hamming Distance.
"One Hamming Distance": hamming distance between the sub-array of myLabel which are equal to "1" and sub-array of each vector (I have created sub-array of each vector by selecting value from each vector based on indices of "myLabel" where the value is '1')
"zero Hamming Distance": hamming distance between the sub-array of myLabel which are equal to "0" and sub-array of each vector (I have created sub-array of each vector by selecting value from each vector based on indices of "myLabel" where the value is '0')
To make it more clear will give you a small example:
MyLabel [1,1,1,1,0,0,0,0]
V1 [1,1,0,1,0,0,1,1]
sub-array1 [1,1,0,1]
sub-array0 [0,0,1,1]
"zero Hamming Distance": hamming(sub-array0, MyLabel[4:])
"one Hamming Distance": hamming(sub-array1, MyLabel[:4])
Now I want to select some vectors from 'allLabelPredict' based on "One Hamming Distance" and
"zero Hamming Distance"
I want to select those vectors which have the minimum "One Hamming Distance" and
"zero Hamming Distance". (by minimum I mean both criteria for this vector be the lowest amongst others)
If above request is not possible how can I do something like this sort somehow that always sort first based on "One Hamming Distance" and after that try to minimize "Zero Hamming Distance"
OK, so first I'd split up the entire allLabelPredict into two subarrays based on the values in myLabel:
import numpy as np
allLabelPredict = np.random.randint(0, 2, (5000, 190))
myLabel = np.random.randint(0, 2, 5000)
sub0 = allLabelPredict[myLabel==0]
sub1 = allLabelPredict[myLabel==1]
ham0 = np.abs(sub0 - 0).mean(0)
ham1 = np.abs(sub1 - 1).mean(0)
hamtot = np.abs(allLabelPredict - myLabel[:, None]).mean(0) # if they're not split
This is the same as scipy.spatial.distance.hamming, but that can only be applied to one vector at a time:
>>> np.allclose(scipy.spatial.distance.hamming(allLabelPredict[:,0], myLabel),
... np.abs(allLabelPredict[:,0] - myLabel).mean(0))
True
Now, the indices in either ham array will be the indices in the second axis of the allLabelPredict array. If you want to sort your vectors by hamming distance:
sortby0 = allLabelPredict[:, ham0.argsort()]
sortby1 = allLabelPredict[:, ham1.argsort()]
Or if you want the lowest zero (or one) hamming, you would look at
best0 = allLabelPredict[:, ham0.argmin()]
best1 = allLabelPredict[:, ham1.argmin()]
Or if you want the lowest one hamming with zero hamming near 0.1, you could say something like
hamscore = (ham0 - 0.1)**2 + ham1**2
best = allLabelPredict[:, hamscore.argmin()]
The crux of the answer should include this: use sorted(allLabelPredict, key=<criteria>)
It will let you sort the list based on the criteria you defined as a function and passed to keys argument.
To do this, first let's convert your 190 vectors into pair of (0-H Dist, 1-H Dist). Then you'll have something like this:
(0.10, 0.15)
(0.12, 0.09)
(0.25, 0.03)
(0.14, 0.16)
(0.14, 0.11)
...
Next, we need to clarify what you meant by "both criteria for this vector be the lowest amongst others". In the above case, should we choose (0.25, 0.03)? Or is it (0.10, 0.15)? How about (0.14, 0.11)? Fortunately you already said that in this case, we need to prioritize 1-H Dist first. So we will choose (0.25, 0.03), is this correct? From your comments in #askewchan's answer it seems that you want the sort criteria to be flexible.
If that's so, then your first criterion that "both criteria for this vector be the lowest amongst others" is actually part of your second criterion, which is "sort based on One Hamming Distance, then by Zero Hamming Distance", since after the sorting the vector with lowest distance on both scores will be at the top anyway.
Hence we just need to sort based on 1-D Dist and then by 0-H Dist when the 1-H Dist score is the same. This sort criteria can be changed flexibly, as long as you already have the pair of scores.
Here is a sample code:
import numpy as np
from scipy.spatial.distance import hamming
def sort_criteria(pair_of_scores):
score0, score1 = pair_of_scores
return (score1, score0) # Sort by 1-H, then by 0-H
# The following will sort by Euclidean distance
#return score0**2 + score1**2
# The following is to select the vectors with score0==0.5, then sort based on score1
#return score1 if np.abs(score0-0.5)<1e7 else (1+score1, score0) == 0.5
def main():
allLabelPredict = np.asarray(np.random.randint(0, 2, (5, 10)), dtype=np.float64)
myLabel = np.asarray(np.random.randint(0, 2, 10), dtype=np.float64)
print allLabelPredict
print myLabel
allSub0 = allLabelPredict[:, myLabel==0]
allSub1 = allLabelPredict[:, myLabel==1]
all_scores = [(hamming(sub0, 0), hamming(sub1, 1))
for sub0, sub1 in zip(allSub0, allSub1)]
print all_scores # The (0-H, 1-H) score pairs
all_scores = sorted(all_scores, key=sort_criteria) # The sorting
#all_scores = np.array([pair for pair in all_scores if pair[0]==0.5]) # For filtering
print all_scores
if __name__ == '__main__':
main()
Result:
[[ 1. 0. 0. 0. 0. 1. 1. 0. 1. 1.]
[ 1. 0. 0. 0. 1. 0. 1. 0. 0. 1.]
[ 0. 1. 1. 0. 1. 1. 1. 1. 1. 0.]
[ 0. 0. 1. 1. 1. 1. 1. 0. 1. 1.]
[ 1. 1. 1. 1. 1. 1. 0. 0. 0. 0.]]
[ 1. 1. 1. 1. 1. 0. 1. 1. 0. 1.]
[(1.0, 0.625), (0.0, 0.5), (1.0, 0.375), (1.0, 0.375), (0.5, 0.375)]
[(0.5, 0.375), (1.0, 0.375), (1.0, 0.375), (0.0, 0.5), (1.0, 0.625)]
You just need to change the sort_criteria function to change your criteria.
If you sort first based on one criteria, then another, the first entry in that sort will be the only one that could simultaneously minimize both criteria.
You can do that operation with numpy using argsort. This requires you to make a numpy array that has keys. I will assume that you have an array called zeroHamming and oneHamming.
# make an array of the distances with keys
# these must be input as pairs, not as columns
hammingDistances = np.array([(one,zero) for one,zero in zip(oneHamming,zeroHamming],\
dtype=[("one","float"),("zero","float")])
# to see how the keys work, try:
print hammingDistances['zero']
# do a sort by oneHamming, then by zeroHamming
sortedIndsOneFirst = np.argsort(hammingDistances,order=['one','zero'])
# do a sort by zeroHamming, then by oneHamming
sortedIndsZeroFirst = np.argsort(hammingDistances,order=['zero','one'])
Its easier to work with as1 = allLabelPredict.T, because then as1[0] will be your first vector, as1[1] your second etc. Then, your hamming distance function is simply:
def ham(a1, b1): return sum(map(abs, a1-b1))
So, if you want the vectors that match your criterion, you can use composition:
vects = numpy.array( [ a for a in as1 if ham(a, myLabel) < 2 ] )
where, myLabel is the vector you want to compare with.

How to determine scaling factor so that covariance matrix has a first element of 1?

I have data which I need to center and scale so that is centered around the origin. Then the data needs to be rotated so that the direction of maximum variance is on the x-axis. The mean of the data and the covariance is then calculated. I need the first element of the covariance matrix to be 1. I think this is done by adjusting the scaling factor, but I can't figure out what the scaling factor should be.
To center the data I take away the mean, and to rotate I use SVD, but the scaling is still my problem.
signature = numpy.loadtxt(name, comments = '%', usecols = (0,cols-1))
signature = numpy.transpose(signature)
#SVD to get D so that data can be scaled by 1/(highest singular value in D)
U, D, Vt = numpy.linalg.svd( signature , full_matrices=0)
cs = utils.centerscale(signature, scale=False)
signature = cs[0]
#plt.scatter(cs[0][0],cs[0][1],color='r')
#SVD so that data can be rotated so that direction of most variance is on x-axis
U, D, Vt = numpy.linalg.svd( signature , full_matrices=0)
cs = utils.centerscale(signature, center=False, scalefactor=D[0])
U, D, Vt = numpy.linalg.svd( cs[0] , full_matrices=0)
D = numpy.diag(D)
norm = numpy.dot(D,Vt)
The following are examples of results of the mean and cov of norm (the test cases use res).
**********************************************************************
Failed example:
print numpy.mean(res, axis=1)
Expected:
[ 7.52074907e-18 -6.59917722e-18]
Got:
[ -1.22008884e-17 2.41126563e-17]
**********************************************************************
Failed example:
print numpy.cov(res, bias=1)
Expected:
[[ 1.00000000e+00 9.02112676e-18]
[ 9.02112676e-18 1.40592827e-01]]
Got:
[[ 4.16666667e-03 -1.57698124e-19]
[ -1.57698124e-19 5.85803446e-04]]
**********************************************************************
1 items had failures:
2 of 4 in __main__.processfile
***Test Failed*** 2 failures.
All values are irrelevant except for the first element of the covariance matrix, that needs to be one.
I have tried looking everywhere and can't find an answer. Any help would be appreciated.
I don't know what utils.centerscale is or does, but if you want to scale a matrix by a constant factor so that the upper left term of its covariance matrix is 1, you can simply divide the matrix by the square root of the unscaled covariance term:
>>> import numpy
>>> numpy.random.seed(17)
>>> m = numpy.random.rand(5,4)
>>> m
array([[ 0.294665 , 0.53058676, 0.19152079, 0.06790036],
[ 0.78698546, 0.65633352, 0.6375209 , 0.57560289],
[ 0.03906292, 0.3578136 , 0.94568319, 0.06004468],
[ 0.8640421 , 0.87729053, 0.05119367, 0.65241862],
[ 0.55175137, 0.59751325, 0.48352862, 0.28298816]])
>>> c = numpy.cov(m,bias=1)
>>> c
array([[ 0.0288779 , 0.00524455, 0.00155373, 0.02779861, 0.01798404],
[ 0.00524455, 0.00592484, -0.00711072, 0.01006019, 0.00631144],
[ 0.00155373, -0.00711072, 0.13391344, -0.10551922, 0.00945934],
[ 0.02779861, 0.01006019, -0.10551922, 0.11250984, 0.00982862],
[ 0.01798404, 0.00631144, 0.00945934, 0.00982862, 0.01444482]])
>>> numpy.cov(m/c[0][0]**0.5, bias=1)
array([[ 1. , 0.18161135, 0.05380354, 0.96262562, 0.62276138],
[ 0.18161135, 0.20516847, -0.24623392, 0.3483699 , 0.21855613],
[ 0.05380354, -0.24623392, 4.63722877, -3.65397781, 0.32756326],
[ 0.96262562, 0.3483699 , -3.65397781, 3.89605297, 0.34035085],
[ 0.62276138, 0.21855613, 0.32756326, 0.34035085, 0.5002033 ]])
But this has the same effect as simply dividing the covariance matrix by the upper left member:
>>> (numpy.cov(m,bias=1)/numpy.cov(m,bias=1)[0][0])/(numpy.cov(m/c[0][0]**0.5, bias=1))
array([[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.]])
Depending on what you're doing, you might also be interested in numpy.corrcoef, which gives the correlation coefficient matrix instead.

Categories