Although similar questions have been raised a couple of times, still I cannot make a function similar to the matlab ismember function in Python. In particular, I want to use this function in a loop, and compare in each iteration a whole matrix to an element of another matrix. Where the same value is occurring, I want to print 1 and in any other case 0.
Let say that I have the following matrices
d = np.reshape(np.array([ 2.25, 1.25, 1.5 , 1. , 0. , 1.25, 1.75, 0. , 1.5 , 0. ]),(1,10))
d_unique = np.unique(d)
then I have
d_unique
array([ 0. , 1. , 1.25, 1.5 , 1.75, 2.25])
Now I want to iterate like
J = np.zeros(np.size(d_unique))
for i in xrange(len(d_unique)):
J[i] = np.sum(ismember(d,d_unique[i]))
so as to take as an output:
J = [3,1,2,2,1,1]
Does anybody have any idea? Many thanks in advance.
In contrast to other answers, numpy has the built-in numpy.in1d for doing that.
Usage in your case:
bool_array = numpy.in1d(array1, array2)
Note: It also accepts lists as inputs.
EDIT (2021):
numpy now recommend using np.isin instead of np.in1d. np.isin preserves the shape of the input array, while np.in1d returns a flattened output.
To answer your question, I guess you could define a ismember similarly to:
def ismember(d, k):
return [1 if (i == k) else 0 for i in d]
But I am not familiar with numpy, so a little adjustement may be in order.
I guess you could also use Counter from collections:
>>> from collections import Counter
>>> a = [2.25, 1.25, 1.5, 1., 0., 1.25, 1.75, 0., 1.5, 0. ]
>>> Counter(a)
Counter({0.0: 3, 1.25: 2, 1.5: 2, 2.25: 1, 1.0: 1, 1.75: 1})
>>> Counter(a).keys()
[2.25, 1.25, 0.0, 1.0, 1.5, 1.75]
>>> c =Counter(a)
>>> [c[i] for i in sorted(c.keys())]
[3, 1, 2, 2, 1, 1]
Once again, not numpy, you will probably have to do some list(d) somewhere.
Try the following function:
def ismember(A, B):
return [ np.sum(a == B) for a in A ]
This should very much behave like the corresponding MALTAB function.
Try the ismember library from pypi.
pip install ismember
Example:
# Import library
from ismember import ismember
# data
d = [ 2.25, 1.25, 1.5 , 1. , 0. , 1.25, 1.75, 0. , 1.5 , 0. ]
d_unique = [ 0. , 1. , 1.25, 1.5 , 1.75, 2.25]
# Lookup
Iloc,idx = ismember(d, d_unique)
# Iloc is boolean defining existence of d in d_unique
print(Iloc)
# [[True True True True True True True True True True]]
# indexes of d_unique that exists in d
print(idx)
# array([5, 2, 3, 1, 0, 2, 4, 0, 3, 0], dtype=int64)
print(d_unique[idx])
array([2.25, 1.25, 1.5 , 1. , 0. , 1.25, 1.75, 0. , 1.5 , 0. ])
print(d[Iloc])
array([2.25, 1.25, 1.5 , 1. , 0. , 1.25, 1.75, 0. , 1.5 , 0. ])
# These vectors will match
d[Iloc]==d_unique[idx]
Related
i have vectors of this form :
test=np.linspace(0,1,10)
i want to stack them horizontally in order to make a matrix .
problem is that i define them in a loop so the first stack is between an empty matrix and the first column vector , which gives the following error:
ValueError: all the input arrays must have same number of dimensions
bottom line - i have a for loop that with every iteration creates a vector p1 and i want to add it to a final matrix of the form :
[p1 p2 p3 p4] which i could then do matrix operations on such as multiplying by the transposed etc
If you've got a list of 1D arrays that you want horizontally stacked, you could convert them all to column first, but it's probably easier to just vertically stack them and then transpose:
In [6]: vector_list = [np.linspace(0, 1, 10) for _ in range(3)]
In [7]: np.vstack(vector_list).T
Out[7]:
array([[0. , 0. , 0. ],
[0.11111111, 0.11111111, 0.11111111],
[0.22222222, 0.22222222, 0.22222222],
[0.33333333, 0.33333333, 0.33333333],
[0.44444444, 0.44444444, 0.44444444],
[0.55555556, 0.55555556, 0.55555556],
[0.66666667, 0.66666667, 0.66666667],
[0.77777778, 0.77777778, 0.77777778],
[0.88888889, 0.88888889, 0.88888889],
[1. , 1. , 1. ]])
How did you get this dimension error? What does empty array have to do with it?
A list of arrays of the same length:
In [610]: alist = [np.linspace(0,1,6), np.linspace(10,11,6)]
In [611]: alist
Out[611]:
[array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]),
array([10. , 10.2, 10.4, 10.6, 10.8, 11. ])]
Several ways of making an array from them:
In [612]: np.array(alist)
Out[612]:
array([[ 0. , 0.2, 0.4, 0.6, 0.8, 1. ],
[10. , 10.2, 10.4, 10.6, 10.8, 11. ]])
In [614]: np.stack(alist)
Out[614]:
array([[ 0. , 0.2, 0.4, 0.6, 0.8, 1. ],
[10. , 10.2, 10.4, 10.6, 10.8, 11. ]])
If you want to join them in columns, you can transpose one of the above, or use:
In [615]: np.stack(alist, axis=1)
Out[615]:
array([[ 0. , 10. ],
[ 0.2, 10.2],
[ 0.4, 10.4],
[ 0.6, 10.6],
[ 0.8, 10.8],
[ 1. , 11. ]])
np.column_stack is also handy.
In newer numpy versions you can do:
In [617]: np.linspace((0,10),(1,11),6)
Out[617]:
array([[ 0. , 10. ],
[ 0.2, 10.2],
[ 0.4, 10.4],
[ 0.6, 10.6],
[ 0.8, 10.8],
[ 1. , 11. ]])
You don't specify how you create the 'empty array' and how you attempt to stack. I can't exactly recreate the error message (full traceback would have helped). But given that message did you check the number of dimensions of the inputs? Did they match?
Array stacking in a loop is tricky. You have to pay close attention to the shapes, especially of the initial 'empty' array. There isn't a close analog to the empty list []. np.array([]) is 1d with shape (1,). np.empty((0,6)) is 2d with shape (0,6). Also all the stacking functions create a new array with each call (non operate in-place), so they are inefficient (compared to list append).
I have a function that creates a 2-dim array, a Vandermonde matrix and is called as:
vandermonde(generator, rank)
Where generator is a n-sized array for example
generator = np.array([-1/2, 1/2, 3/2, 5/2, 7/2, 9/2])
and rank=4
Then I need to create 4 Vandermonde matrices (because rank=4) skewed by h in my space (that h is arbitrary here, lets call h=1).
Therefore I came with the following deterministic code:
V = np.array([
vandermonde(generator-0*h, rank),
vandermonde(generator-1*h, rank),
vandermonde(generator-2*h, rank),
vandermonde(generator-3*h, rank)
])
Then I want instead do multiple manual calls to vandermonde I used a for-loop as in:
V=[]
for i in range(rank):
V.append(vandermonde(generator - h*i, rank))
V = np.array(V)
This approach works fine, but seems too "patchy". I tried a np.append approach as below:
M = np.array([])
for i in range(rank):
M = np.append(M,[vandermonde(generator - h*i, rank)])
But didn't worked as I expected, seems np.append expand the array instead to create a new element.
My questions are:
How can I not use standard Python lists, use directly a np approach cause np.append seems not behave as I expect, instead it just grow that array instead add a new array element
Is there any more direct numpy approaches to this?
My vandermonde function is:
def vandermonde(generator, rank=None):
"""Returns a vandermonde matrix
If rank not passwd returns a square vandermonde matrix
"""
if rank is None:
rank = len(generator)
return np.tile(generator,(rank,1)) ** np.array(range(rank)).reshape((rank,1))
The expected answer is a 3 dimensional array with size (generator, rank, rank) where each element is one of the generator skewed vandermonde matrices. For the constants above(generator, rank, h) we have:
V= array([[[ 1. , 1. , 1. , 1. , 1. , 1. ],
[ -0.5 , 0.5 , 1.5 , 2.5 , 3.5 , 4.5 ],
[ 0.25, 0.25, 2.25, 6.25, 12.25, 20.25],
[ -0.12, 0.12, 3.38, 15.62, 42.88, 91.12]],
[[ 1. , 1. , 1. , 1. , 1. , 1. ],
[ -1.5 , -0.5 , 0.5 , 1.5 , 2.5 , 3.5 ],
[ 2.25, 0.25, 0.25, 2.25, 6.25, 12.25],
[ -3.38, -0.12, 0.12, 3.38, 15.62, 42.88]],
[[ 1. , 1. , 1. , 1. , 1. , 1. ],
[ -2.5 , -1.5 , -0.5 , 0.5 , 1.5 , 2.5 ],
[ 6.25, 2.25, 0.25, 0.25, 2.25, 6.25],
[-15.62, -3.38, -0.12, 0.12, 3.38, 15.62]],
[[ 1. , 1. , 1. , 1. , 1. , 1. ],
[ -3.5 , -2.5 , -1.5 , -0.5 , 0.5 , 1.5 ],
[ 12.25, 6.25, 2.25, 0.25, 0.25, 2.25],
[-42.88, -15.62, -3.38, -0.12, 0.12, 3.38]]])
Some related ideas can be found in this discussion on: efficient-way-to-compute-the-vandermonde-matrix
Use broadcasting to get the final 3D array in a vectorized manner -
r = np.arange(rank)
V_out = (generator - h*r[:,None,None]) ** r[:,None]
We can also use cumprod to achieve the exponential values for another solution -
gr = np.repeat(generator - h*r[:,None,None], rank, axis=1)
gr[:,0] = 1
out = gr.cumprod(1)
I was reading and came across this formula:
The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix:
M = [[2,3,4,1,0],[0,0,0,0,5],[5,4,3,0,0],[1,1,1,1,1]]
Here the entries inside the matrix are ratings the people u has given to item i based on row u and column i. I want to calculate this cosine similarity for this matrix between items (rows). This should yield a 5 x 5 matrix I believe. I tried to do
df = pd.DataFrame(M)
item_mean_subtracted = df.sub(df.mean(axis=0), axis=1)
similarity_matrix = item_mean_subtracted.fillna(0).corr(method="pearson").values
However, this does not seem right.
Here's a possible implementation of the adjusted cosine similarity:
import numpy as np
from scipy.spatial.distance import pdist, squareform
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
Remarks:
I'm taking advantage of NumPy broadcasting to subtract the mean.
If M is a sparse matrix, you could do something like ths: M.toarray().
From the docs:
Y = pdist(X, 'cosine')
Computes the cosine distance between vectors u and v,
1 − u⋅v / (||u||2||v||2)
where ||∗||2 is the 2-norm of its argument *, and u⋅v is the dot product of u and v.
Array transposition is performed through the T method.
Demo:
In [277]: M_u
Out[277]: array([ 2. , 1. , 2.4, 1. ])
In [278]: item_mean_subtracted
Out[278]:
array([[ 0. , 1. , 2. , -1. , -2. ],
[-1. , -1. , -1. , -1. , 4. ],
[ 2.6, 1.6, 0.6, -2.4, -2.4],
[ 0. , 0. , 0. , 0. , 0. ]])
In [279]: np.set_printoptions(precision=2)
In [280]: similarity_matrix
Out[280]:
array([[ 1. , 0.87, 0.4 , -0.68, -0.72],
[ 0.87, 1. , 0.8 , -0.65, -0.91],
[ 0.4 , 0.8 , 1. , -0.38, -0.8 ],
[-0.68, -0.65, -0.38, 1. , 0.27],
[-0.72, -0.91, -0.8 , 0.27, 1. ]])
I'm hoping to delete columns in my arrays that have repeat entries in row 1 as shown below (row 1 has repeats of values 1 & 2.5, so one of each of those values have been been deleted, together with the column each deleted value lies within).
initial_array =
row 0 [[ 1, 1, 1, 1, 1, 1, 1, 1,]
row 1 [0.5, 1, 2.5, 4, 2.5, 2, 1, 3.5,]
row 2 [ 1, 1.5, 3, 4.5, 3, 2.5, 1.5, 4,]
row 3 [228, 314, 173, 452, 168, 351, 300, 396]]
final_array =
row 0 [[ 1, 1, 1, 1, 1, 1,]
row 1 [0.5, 1, 2.5, 4, 2, 3.5,]
row 2 [ 1, 1.5, 3, 4.5, 2.5, 4,]
row 3 [228, 314, 173, 452, 351, 396]]
Ways I was thinking of included using some function that checked for repeats, giving a True response for the second (or more) time a value turned up in the dataset, then using that response to delete the row. That or possibly using the return indices function within numpy.unique. I just can't quite find a way through it or find the right function though.
If I could find a way to return an mean value in the row 3 of the retained repeat and the deleted one, that would be even better (see below).
final_array_averaged =
row 0 [[ 1, 1, 1, 1, 1, 1,]
row 1 [0.5, 1, 2.5, 4, 2, 3.5,]
row 2 [ 1, 1.5, 3, 4.5, 2.5, 4,]
row 3 [228, 307, 170.5, 452, 351, 396]]
Thanks in advance for any help you can give to a beginner who is stumped!
You can use the optional arguments that come with np.unique and then use np.bincount to use the last row as weights to get the final averaged output, like so -
_,unqID,tag,C = np.unique(arr[1],return_index=1,return_inverse=1,return_counts=1)
out = arr[:,unqID]
out[-1] = np.bincount(tag,arr[3])/C
Sample run -
In [212]: arr
Out[212]:
array([[ 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. ],
[ 0.5, 1. , 2.5, 4. , 2.5, 2. , 1. , 3.5],
[ 1. , 1.5, 3. , 4.5, 3. , 2.5, 1.5, 4. ],
[ 228. , 314. , 173. , 452. , 168. , 351. , 300. , 396. ]])
In [213]: out
Out[213]:
array([[ 1. , 1. , 1. , 1. , 1. , 1. ],
[ 0.5, 1. , 2. , 2.5, 3.5, 4. ],
[ 1. , 1.5, 2.5, 3. , 4. , 4.5],
[ 228. , 307. , 351. , 170.5, 396. , 452. ]])
As can be seen that the output has now an order with the second row being sorted. If you are looking to keep the order as it was originally, use np.argsort of unqID, like so -
In [221]: out[:,unqID.argsort()]
Out[221]:
array([[ 1. , 1. , 1. , 1. , 1. , 1. ],
[ 0.5, 1. , 2.5, 4. , 2. , 3.5],
[ 1. , 1.5, 3. , 4.5, 2.5, 4. ],
[ 228. , 307. , 170.5, 452. , 351. , 396. ]])
You can find the indices of wanted columns using unique:
>>> indices = np.sort(np.unique(A[1], return_index=True)[1])
Then use a simple indexing to get the desire columns:
>>> A[:,indices]
array([[ 1. , 1. , 1. , 1. , 1. , 1. ],
[ 0.5, 1. , 2.5, 4. , 2. , 3.5],
[ 1. , 1.5, 3. , 4.5, 2.5, 4. ],
[ 228. , 314. , 173. , 452. , 351. , 396. ]])
This is a typical grouping problem, which can be solve elegantly and efficiently using the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
unique, final_array = npi.group_by(initial_array[1]).mean(initial_array, axis=1)
Note that there are many other reductions than mean; if you want the original behavior you described, you could replace 'mean' with 'first', for instance.
I'm setting a numpy array with a power-law equation. The problem is that part of my domain tries to do numpy.power(x, n) when x is negative and n is not an integer. In this part of the domain I want the value to be 0.0. Below is a code that has the correct behavior, but is there a more Pythonic way to do this?
# note mesh.x is a numpy array of length nx
myValues = npy.zeros((nx))
para = [5.8780046, 0.714285714, 2.819250868]
for j in range(nx):
if mesh.x[j] > para[1]:
myValues[j] = para[0]*npy.power(mesh.x[j]-para[1],para[2])
else:
myValues[j] = 0.0
Is "numpythonic" a word? It should be a word. The following is really neither pythonic nor unpythonic, but it is much more efficient than using a for loop, and close(r) to the way Travis would probably do it:
import numpy
mesh_x = numpy.array([0.5,1.0,1.5])
myValues = numpy.zeros_like( mesh_x )
para = [5.8780046, 0.714285714, 2.819250868]
mask = mesh_x > para[1]
myValues[mask] = para[0] * numpy.power(mesh_x[mask] - para[1], para[2])
print(myValues)
For very large problems you would probably want to avoid creating temporary arrays:
mask = mesh.x > para[1]
myValues[mask] = mesh.x[mask]
myValues[mask] -= para[1]
myValues[mask] **= para[2]
myValues[mask] *= para[0]
Here's one approach with np.where to choose values between the power calculations and 0 -
import numpy as np
np.where(mesh.x>para[1],para[0]*np.power(mesh.x-para[1],para[2]),0)
Explanation :
np.where(mask,A,B) chooses elements from A or B depending on mask elements. So, in our case it is mesh.x>para[1] when doing a vectorized comparison for all mesh.x elements in one go.
para[0]*np.power(mesh.x-para[1],para[2]) gives us the elements that are to be chosen in case a mask element is True. Else, we choose 0, which is the third argument to np.where.
More of an explanation of the answers given by #jez and #Divakar with simple examples than an answer itself. They both rely on some form of boolean indexing.
>>>
>>> a
array([[-4.5, -3.5, -2.5],
[-1.5, -0.5, 0.5],
[ 1.5, 2.5, 3.5]])
>>> n = 2.2
>>> a ** n
array([[ nan, nan, nan],
[ nan, nan, 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
np.where is made for this it selects one of two values based on a boolean array.
>>> np.where(np.isnan(a**n), 0, a**n)
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
>>>
>>> b = np.where(a < 0, 0, a)
>>> b
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.5],
[ 1.5, 2.5, 3.5]])
>>> b **n
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
Use of boolean indexing on the left-hand-side and the right-hand-side. This is similar to np.where
>>>
>>> a[a >= 0] = a[a >= 0] ** n
>>> a
array([[ -4.5 , -3.5 , -2.5 ],
[ -1.5 , -0.5 , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
>>> a[a < 0] = 0
>>> a
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
>>>