Sorting array based two goals - python

I have a list of vectors (each vectors only contain 0 or 1) :
In [3]: allLabelPredict
Out[3]: array([[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 0., 0., ..., 0., 0., 1.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 1.],
[ 0., 0., 0., ..., 0., 0., 1.]])
In [4]: allLabelPredict.shape
Out[4]: (5000, 190)
As you can see I have 190 different vectors each vector is a result of one classifier, now I want to select some of these output based on proximity of each vector to my original label
In [7]: myLabel
Out[7]: array([ 0., 0., 0., ..., 1., 1., 1.])
In [8]: myLabel.shape
Out[8]: (5000,)
For this purpose I've defined two different criteria for each vector; Zero Hamming Distance and One Hamming Distance.
"One Hamming Distance": hamming distance between the sub-array of myLabel which are equal to "1" and sub-array of each vector (I have created sub-array of each vector by selecting value from each vector based on indices of "myLabel" where the value is '1')
"zero Hamming Distance": hamming distance between the sub-array of myLabel which are equal to "0" and sub-array of each vector (I have created sub-array of each vector by selecting value from each vector based on indices of "myLabel" where the value is '0')
To make it more clear will give you a small example:
MyLabel [1,1,1,1,0,0,0,0]
V1 [1,1,0,1,0,0,1,1]
sub-array1 [1,1,0,1]
sub-array0 [0,0,1,1]
"zero Hamming Distance": hamming(sub-array0, MyLabel[4:])
"one Hamming Distance": hamming(sub-array1, MyLabel[:4])
Now I want to select some vectors from 'allLabelPredict' based on "One Hamming Distance" and
"zero Hamming Distance"
I want to select those vectors which have the minimum "One Hamming Distance" and
"zero Hamming Distance". (by minimum I mean both criteria for this vector be the lowest amongst others)
If above request is not possible how can I do something like this sort somehow that always sort first based on "One Hamming Distance" and after that try to minimize "Zero Hamming Distance"

OK, so first I'd split up the entire allLabelPredict into two subarrays based on the values in myLabel:
import numpy as np
allLabelPredict = np.random.randint(0, 2, (5000, 190))
myLabel = np.random.randint(0, 2, 5000)
sub0 = allLabelPredict[myLabel==0]
sub1 = allLabelPredict[myLabel==1]
ham0 = np.abs(sub0 - 0).mean(0)
ham1 = np.abs(sub1 - 1).mean(0)
hamtot = np.abs(allLabelPredict - myLabel[:, None]).mean(0) # if they're not split
This is the same as scipy.spatial.distance.hamming, but that can only be applied to one vector at a time:
>>> np.allclose(scipy.spatial.distance.hamming(allLabelPredict[:,0], myLabel),
... np.abs(allLabelPredict[:,0] - myLabel).mean(0))
True
Now, the indices in either ham array will be the indices in the second axis of the allLabelPredict array. If you want to sort your vectors by hamming distance:
sortby0 = allLabelPredict[:, ham0.argsort()]
sortby1 = allLabelPredict[:, ham1.argsort()]
Or if you want the lowest zero (or one) hamming, you would look at
best0 = allLabelPredict[:, ham0.argmin()]
best1 = allLabelPredict[:, ham1.argmin()]
Or if you want the lowest one hamming with zero hamming near 0.1, you could say something like
hamscore = (ham0 - 0.1)**2 + ham1**2
best = allLabelPredict[:, hamscore.argmin()]

The crux of the answer should include this: use sorted(allLabelPredict, key=<criteria>)
It will let you sort the list based on the criteria you defined as a function and passed to keys argument.
To do this, first let's convert your 190 vectors into pair of (0-H Dist, 1-H Dist). Then you'll have something like this:
(0.10, 0.15)
(0.12, 0.09)
(0.25, 0.03)
(0.14, 0.16)
(0.14, 0.11)
...
Next, we need to clarify what you meant by "both criteria for this vector be the lowest amongst others". In the above case, should we choose (0.25, 0.03)? Or is it (0.10, 0.15)? How about (0.14, 0.11)? Fortunately you already said that in this case, we need to prioritize 1-H Dist first. So we will choose (0.25, 0.03), is this correct? From your comments in #askewchan's answer it seems that you want the sort criteria to be flexible.
If that's so, then your first criterion that "both criteria for this vector be the lowest amongst others" is actually part of your second criterion, which is "sort based on One Hamming Distance, then by Zero Hamming Distance", since after the sorting the vector with lowest distance on both scores will be at the top anyway.
Hence we just need to sort based on 1-D Dist and then by 0-H Dist when the 1-H Dist score is the same. This sort criteria can be changed flexibly, as long as you already have the pair of scores.
Here is a sample code:
import numpy as np
from scipy.spatial.distance import hamming
def sort_criteria(pair_of_scores):
score0, score1 = pair_of_scores
return (score1, score0) # Sort by 1-H, then by 0-H
# The following will sort by Euclidean distance
#return score0**2 + score1**2
# The following is to select the vectors with score0==0.5, then sort based on score1
#return score1 if np.abs(score0-0.5)<1e7 else (1+score1, score0) == 0.5
def main():
allLabelPredict = np.asarray(np.random.randint(0, 2, (5, 10)), dtype=np.float64)
myLabel = np.asarray(np.random.randint(0, 2, 10), dtype=np.float64)
print allLabelPredict
print myLabel
allSub0 = allLabelPredict[:, myLabel==0]
allSub1 = allLabelPredict[:, myLabel==1]
all_scores = [(hamming(sub0, 0), hamming(sub1, 1))
for sub0, sub1 in zip(allSub0, allSub1)]
print all_scores # The (0-H, 1-H) score pairs
all_scores = sorted(all_scores, key=sort_criteria) # The sorting
#all_scores = np.array([pair for pair in all_scores if pair[0]==0.5]) # For filtering
print all_scores
if __name__ == '__main__':
main()
Result:
[[ 1. 0. 0. 0. 0. 1. 1. 0. 1. 1.]
[ 1. 0. 0. 0. 1. 0. 1. 0. 0. 1.]
[ 0. 1. 1. 0. 1. 1. 1. 1. 1. 0.]
[ 0. 0. 1. 1. 1. 1. 1. 0. 1. 1.]
[ 1. 1. 1. 1. 1. 1. 0. 0. 0. 0.]]
[ 1. 1. 1. 1. 1. 0. 1. 1. 0. 1.]
[(1.0, 0.625), (0.0, 0.5), (1.0, 0.375), (1.0, 0.375), (0.5, 0.375)]
[(0.5, 0.375), (1.0, 0.375), (1.0, 0.375), (0.0, 0.5), (1.0, 0.625)]
You just need to change the sort_criteria function to change your criteria.

If you sort first based on one criteria, then another, the first entry in that sort will be the only one that could simultaneously minimize both criteria.
You can do that operation with numpy using argsort. This requires you to make a numpy array that has keys. I will assume that you have an array called zeroHamming and oneHamming.
# make an array of the distances with keys
# these must be input as pairs, not as columns
hammingDistances = np.array([(one,zero) for one,zero in zip(oneHamming,zeroHamming],\
dtype=[("one","float"),("zero","float")])
# to see how the keys work, try:
print hammingDistances['zero']
# do a sort by oneHamming, then by zeroHamming
sortedIndsOneFirst = np.argsort(hammingDistances,order=['one','zero'])
# do a sort by zeroHamming, then by oneHamming
sortedIndsZeroFirst = np.argsort(hammingDistances,order=['zero','one'])

Its easier to work with as1 = allLabelPredict.T, because then as1[0] will be your first vector, as1[1] your second etc. Then, your hamming distance function is simply:
def ham(a1, b1): return sum(map(abs, a1-b1))
So, if you want the vectors that match your criterion, you can use composition:
vects = numpy.array( [ a for a in as1 if ham(a, myLabel) < 2 ] )
where, myLabel is the vector you want to compare with.

Related

ValueError: all the input arrays must have same number of dims, but the arr at index 0 has 1 dimension(s) and the arr at index 11 has 2 dimension(s)

I have array as follows
samples_data = [array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)
array([ 0. , 0. , 0. , ..., -0.00020519,
-0.00019427, -0.00107348], dtype=float32)
array([ 0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ...,
-8.9004419e-07, 7.3998461e-07, -6.9706215e-07], dtype=float32)
array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)]
And I have a function like this
def generate_segmented_data_1(
samples_data: np.ndarray, sampling_rate: int = 16000
) -> np.ndarray:
new_data = []
for data in samples_data:
segments = segment_audio(data, sampling_rate=sampling_rate)
new_data.append(segments)
new_data = np.array(new_data)
return np.concatenate(new_data)
It shows error like this
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 11 has 2 dimension(s)
And the array at index 0 is like this
[array([ 0. , 0. , 0. , ..., -0.00022057,
0.00013752, -0.00114789], dtype=float32)
array([-4.3174211e-04, -5.4488028e-04, -1.1238289e-03, ...,
8.4724619e-05, 3.0450989e-05, -3.9514929e-05], dtype=float32)]
then the array at index 11 is like this
[[3.0856067e-05 3.0295929e-05 3.0955063e-05 ... 8.5010566e-03
1.3315652e-02 1.5698154e-02]]
And then what should I do so all of the segments I produced being concatenated as an array of segments?
I'm not quite sure I understand what you are trying to do.
b = np.array([[2]])
b.shape
# (1,1)
b = np.array([2])
b.shape
# (1,)
For the segment part of the question, it is unclear what your data structure is, but the code example is broken, as you are appending to a list that hasn't been created.
how do I can get the shape of below array to be 1D instead of 2D?
b = np.array([[2]])
b_shape = b.shape
This will result (1, 1). But, I want it results (1, ) without flattening it?
I suspect the confusion stems from the fact that you chose an example which can be also seen as a scalar, so I'll instead use a different example:
b = np.array([[1,2]])
now, b.shape is (1,2). Removing the first "one" dimension in any way (be it b.flatten() or b.squeeze() or using b[0]) all result in the same:
assert (b.flatten() == b.squeeze()).all()
assert (b.flatten() == b[0]).all()
Now, for the real problem: it appears you're trying to concatenate "rows" from "segments", but the "segments" (which I believe from your sample dat are lists of np.arrays?) are inconsistently formed.
Your sample data is very chaotic: Segments 0-10 seem to be lists of 1D arrays; Segment 11, 18 and 19 are either 2D arrays or lists of lists of floats. This, plus the error code, suggest you have an issue in the data processing of the segments.
Now, to actually concatenate both types of data:
new_data = []
for data in samples_data:
segments = function_a(data) # it appears this doesn't return consistent data
segments = np.asarray(segments) # force it to always be an array...
if segments.ndim > 1: # ...and append each row
for row in segments:
new_data.append(row)
elif segments.ndim == 1: # if just one row, append it directly
new_data.append(segments)
else:
# function_a returned an empty list, do nothing
pass
Given the shown data and code, this should work (but it's neither efficient, nor tested).

Numpy largest singular value larger than greatest eigenvalue

Let
import numpy as np
M = np.array([[ 1., -0.5301332 , 0.80512845],
[ 0., 0., 0.],
[ 0., 0., 0.]])
M is rank one, its only non zero eigenvalue is 1 (its trace). However np.linalg.norm(M, ord=2) returns 1.39 which is strictly greater than 1. Why?
The eigenvalues of M, returned by np.linalg.eigvals are 1, 0, 0, but the singular values of M are 1.39, 0, 0, which is a surprise to me. What did I miss?
In this particular case the 2-norm of M coincides with the Frobenius norm, which is given by the formula (np.sum(np.abs(M**2)))**(1/2), therefore we can see that:
import numpy as np
M = np.array([[ 1., -0.5301332 , 0.80512845],
[ 0., 0., 0.],
[ 0., 0., 0.]])
np.sqrt(np.sum(np.abs(M**2)))
1.388982732341062
np.sqrt(np.sum(np.abs(M**2))) == np.linalg.norm(M,ord=2) == np.linalg.norm(M, ord='fro')
True
In particular one can prove that the 2-norm is the square root of the largest eigenvalue of M.T#M i.e.
np.sqrt(np.linalg.eigvals(M.T#M)[0])
1.388982732341062
And this is its relation with eigenvalues of a matrix. Now recall that the singular values are the square root of the eigenvalues of M.T#M and we unpack the mistery.
Using a characterization of the Frobenius norm (square root of the sum of the trace of M.T#M):
np.sqrt(np.sum(np.diag(M.T#M)))
1.388982732341062
Confronting the results:
np.sqrt(np.linalg.eigvals(M.T#M)[0]) == np.sqrt(np.sum(np.diag(M.T#M))) == np.linalg.svd(M)[1][0]
True
second norm of a matrix the square root of the sum of all elements squared
norm(M, ord=2) = (1.**2 + 0.5301332**2 + 0.80512845**2)**0.5 = 1.39
to get the relation between the eigen values and singular values you need to calculate the eigen values of M^H.M and square root it
eigV = np.linalg.eigvals(M.T.dot(M))
array([1.92927303, 0. , 0. ])
eigV**0.5
array([1.38898273, 0. , 0. ])
This is perfectly normal. In the general case, the singular values are not equals to the eigen values. This is true only for positive Hermitian matrices.
For squared matrices, you have the following relationship:
M = np.matrix([[ 1., -0.5301332 , 0.80512845],
[ 0., 0., 0.],
[ 0., 0., 0.]])
u, v= np.linalg.eig(M.H # M) # M.H # M is Hermitian
print(np.sqrt(u)) # [1.38898273 0. 0. ]
u,s,v = lin.svd(M)
print(s) # [1.38898273 0. 0. ]

How to efficiently apply functions to values in an array based on condition?

I have an array arorg like this:
import numpy as np
arorg = np.array([[-1., 2., -4.], [0.5, -1.5, 3]])
and another array values that looks as follows:
values = np.array([1., 0., 2.])
values has the same number of entries as arorg has columns.
Now I want to apply functions to the entries or arorg depending on whether they are positive or negative:
def neg_fun(val1, val2):
return val1 / (val1 + abs(val2))
def pos_fun(val1, val2):
return 1. / ((val1 / val2) + 1.)
Thereby, val2 is the (absolute) value in arorg and val1 - this is the tricky part - comes from values; if I apply pos_fun and neg_fun to column i in arorg, val1 should be values[i].
I currently implement that as follows:
ar = arorg.copy()
for (x, y) in zip(*np.where(ar > 0)):
ar.itemset((x, y), pos_fun(values[y], ar.item(x, y)))
for (x, y) in zip(*np.where(ar < 0)):
ar.itemset((x, y), neg_fun(values[y], ar.item(x, y)))
which gives me the desired output:
array([[ 0.5 , 1. , 0.33333333],
[ 0.33333333, 0. , 0.6 ]])
As I have to do these calculations very often, I am wondering whether there is a more efficient way of doing this. Something like
np.where(arorg > 0, pos_fun(xxxx), arorg)
would be great but I don't know how to pass the arguments correctly (the xxx). Any suggestions?
As hinted in the question, here's one using np.where.
First off, we are using a direct translation of the function implementation to generate values/arrays for both positive and negative cases. Then, with a mask of positive values, we will choose between those two arrays using np.where.
Thus, the implementation would look something along these lines -
# Get positive and negative values for all elements
val1 = values
val2 = arorg
neg_vals = val1 / (val1 + np.abs(val2))
pos_vals = 1. / ((val1 / val2) + 1.)
# Get a positive mask and choose between positive and negative values
pos_mask = arorg > 0
out = np.where(pos_mask, pos_vals, neg_vals)
You don't need to apply function to zipped elements of arrays, you can accomplish the same thing through simple array operations and slicing.
First, get the positive and negative calculation, saved as arrays. Then create a return array of zeros (just as a default value), and populate it using boolean slices of pos and neg:
import numpy as np
arorg = np.array([[-1., 2., -4.], [0.5, -1.5, 3]])
values = np.array([1., 0., 2.])
pos = 1. / ((values / arorg) + 1)
neg = values / (values + np.abs(arorg))
ret = np.zeros_like(arorg)
ret[arorg>0] = pos[arorg>0]
ret[arorg<=0] = neg[arorg<=0]
ret
# returns:
array([[ 0.5 , 1. , 0.33333333],
[ 0.33333333, 0. , 0.6 ]])
import numpy as np
arorg = np.array([[-1., 2., -4.], [0.5, -1.5, 3]])
values = np.array([1., 0., 2.])
p = 1.0/(values/arorg+1)
n = values/(values+abs(arorg))
#using np.place to extract negative values and put them to p
np.place(p,arorg<0,n[arorg<0])
print(p)
[[ 0.5 1. 0.33333333]
[ 0.33333333 0. 0.6 ]]

reshape list of numpy arrays and then reshape back

I have a list which consists of several numpy arrays with different shapes.
I want to reshape this list of arrays into a numpy vector and then change each element in the vector and then reshape it back to the original list of arrays.
For example:
input
[numpy.zeros((2,2)), numpy.ones((3,3))]
First
To vector
[0,0,0,0,1,1,1,1,1,1,1,1,1]
Second
every time change only one element. for example change the 1st element 0 to 2
[0,2,0,0,1,1,1,1,1,1,1,1,1]
Last
convert it back to
[array([[0,2],[0,0]]),array([[1,1,1],[1,1,1],[1,1,1]])]
Is there any fast implementation? Thanks very much.
It seems like converting to a list and back will be inefficient. Instead, why not figure out which array to index (and where) and then just update that index? e.g.
def change_element(arr1, arr2, ix, value):
which = ix >= arr1.size
arr = [arr1, arr2][which]
ix = ix - arr1.size if which else ix
arr.ravel()[ix] = value
And here's some example usage:
>>> arr1 = np.zeros((2, 2))
>>> arr2 = np.ones((3, 3))
>>> change_element(arr1, arr2, 1, 2)
>>> change_element(arr1, arr2, 6, 3.14)
>>> arr1
array([[ 0., 2.],
[ 0., 0.]])
>>> arr2
array([[ 1. , 1. , 3.14],
[ 1. , 1. , 1. ],
[ 1. , 1. , 1. ]])
>>> change_element(arr1, arr2, 7, 3.14)
>>> arr1
array([[ 0., 2.],
[ 0., 0.]])
>>> arr2
array([[ 1. , 1. , 3.14],
[ 3.14, 1. , 1. ],
[ 1. , 1. , 1. ]])
A few notes -- This updates the arrays in place. It doesn't create new arrays. If you really need to create new arrays, I suppose you could np.copy them and return. Also, this relies on the arrays sharing memory before and after the ravel. I don't remember the exact circumstances where ravel would return a new array rather than a view into the original array . . .
Generalizing to more arrays is actually quite easy. We just need to walk down the list of arrays and see if ix is less than the array size. If it is, we've found our array. If it isn't, we need to subtract the array's size from ix to represent the number of elements we've traversed thus far:
def change_element(arrays, ix, value):
for arr in arrays:
if ix < arr.size:
arr.ravel()[ix] = value
return
ix -= arr.size
And you can call this similar to before:
change_element([arr1, arr2], 6, 3.14159)
#mgilson probably has the best answer for you, but if you absolutely have to convert to a flat list first and then go back again (perhaps because you need to do something else with the flat list as well), then you can do this with list comprehensions:
lst = [numpy.zeros((2,4)), numpy.ones((3,3))]
tlist = [e for a in lst for e in a.ravel()]
tlist[1] = 2
i = 0
lst2 = []
dims = [a.shape for a in lst]
for n, m in dims:
lst2.append(np.array(tlist[i:i+n*m]).reshape(n,m))
i += n*m
lst2
[array([[ 0., 2.],
[ 0., 0.]]), array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])]
Of course, you lose the information about your array sizes when you flatten, so you need to store them somewhere (here, in dims).

How to determine scaling factor so that covariance matrix has a first element of 1?

I have data which I need to center and scale so that is centered around the origin. Then the data needs to be rotated so that the direction of maximum variance is on the x-axis. The mean of the data and the covariance is then calculated. I need the first element of the covariance matrix to be 1. I think this is done by adjusting the scaling factor, but I can't figure out what the scaling factor should be.
To center the data I take away the mean, and to rotate I use SVD, but the scaling is still my problem.
signature = numpy.loadtxt(name, comments = '%', usecols = (0,cols-1))
signature = numpy.transpose(signature)
#SVD to get D so that data can be scaled by 1/(highest singular value in D)
U, D, Vt = numpy.linalg.svd( signature , full_matrices=0)
cs = utils.centerscale(signature, scale=False)
signature = cs[0]
#plt.scatter(cs[0][0],cs[0][1],color='r')
#SVD so that data can be rotated so that direction of most variance is on x-axis
U, D, Vt = numpy.linalg.svd( signature , full_matrices=0)
cs = utils.centerscale(signature, center=False, scalefactor=D[0])
U, D, Vt = numpy.linalg.svd( cs[0] , full_matrices=0)
D = numpy.diag(D)
norm = numpy.dot(D,Vt)
The following are examples of results of the mean and cov of norm (the test cases use res).
**********************************************************************
Failed example:
print numpy.mean(res, axis=1)
Expected:
[ 7.52074907e-18 -6.59917722e-18]
Got:
[ -1.22008884e-17 2.41126563e-17]
**********************************************************************
Failed example:
print numpy.cov(res, bias=1)
Expected:
[[ 1.00000000e+00 9.02112676e-18]
[ 9.02112676e-18 1.40592827e-01]]
Got:
[[ 4.16666667e-03 -1.57698124e-19]
[ -1.57698124e-19 5.85803446e-04]]
**********************************************************************
1 items had failures:
2 of 4 in __main__.processfile
***Test Failed*** 2 failures.
All values are irrelevant except for the first element of the covariance matrix, that needs to be one.
I have tried looking everywhere and can't find an answer. Any help would be appreciated.
I don't know what utils.centerscale is or does, but if you want to scale a matrix by a constant factor so that the upper left term of its covariance matrix is 1, you can simply divide the matrix by the square root of the unscaled covariance term:
>>> import numpy
>>> numpy.random.seed(17)
>>> m = numpy.random.rand(5,4)
>>> m
array([[ 0.294665 , 0.53058676, 0.19152079, 0.06790036],
[ 0.78698546, 0.65633352, 0.6375209 , 0.57560289],
[ 0.03906292, 0.3578136 , 0.94568319, 0.06004468],
[ 0.8640421 , 0.87729053, 0.05119367, 0.65241862],
[ 0.55175137, 0.59751325, 0.48352862, 0.28298816]])
>>> c = numpy.cov(m,bias=1)
>>> c
array([[ 0.0288779 , 0.00524455, 0.00155373, 0.02779861, 0.01798404],
[ 0.00524455, 0.00592484, -0.00711072, 0.01006019, 0.00631144],
[ 0.00155373, -0.00711072, 0.13391344, -0.10551922, 0.00945934],
[ 0.02779861, 0.01006019, -0.10551922, 0.11250984, 0.00982862],
[ 0.01798404, 0.00631144, 0.00945934, 0.00982862, 0.01444482]])
>>> numpy.cov(m/c[0][0]**0.5, bias=1)
array([[ 1. , 0.18161135, 0.05380354, 0.96262562, 0.62276138],
[ 0.18161135, 0.20516847, -0.24623392, 0.3483699 , 0.21855613],
[ 0.05380354, -0.24623392, 4.63722877, -3.65397781, 0.32756326],
[ 0.96262562, 0.3483699 , -3.65397781, 3.89605297, 0.34035085],
[ 0.62276138, 0.21855613, 0.32756326, 0.34035085, 0.5002033 ]])
But this has the same effect as simply dividing the covariance matrix by the upper left member:
>>> (numpy.cov(m,bias=1)/numpy.cov(m,bias=1)[0][0])/(numpy.cov(m/c[0][0]**0.5, bias=1))
array([[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.]])
Depending on what you're doing, you might also be interested in numpy.corrcoef, which gives the correlation coefficient matrix instead.

Categories