I'm hoping anybody could help me with the following.
I have 2 lists of arrays, which should be linked to each-other. Each list stands for a certain object. arr1 and arr2 are the attributes of that object.
For example:
import numpy as np
arr1 = [np.array([1, 2, 3]), np.array([1, 2]), np.array([2, 3])]
arr2 = [np.array([20, 50, 30]), np.array([50, 50]), np.array([75, 25])]
The arrays are linked to each other as in the 1 in arr1, first array belongs to the 20 in arr2 first array. The result I'm looking for in this example would be a numpy array with size 3,4. The 'columns' stand for 0, 1, 2, 3 (the numbers in arr1, plus 0) and the rows are filled with the corresponding values of arr2. When there are no corresponding values this cell should be 0.
Example:
array([[ 0, 20, 50, 30],
[ 0, 50, 50, 0],
[ 0, 0, 75, 25]])
How would I link these two list of arrays and reshape them in the desired format as shown in the above example?
Many thanks!
Here's an almost* vectorized approach -
lens = np.array([len(i) for i in arr1])
N = len(arr1)
row_idx = np.repeat(np.arange(N),lens)
col_idx = np.concatenate(arr1)
M = col_idx.max()+1
out = np.zeros((N,M),dtype=int)
out[row_idx,col_idx] = np.concatenate(arr2)
*: Almost because of the loop comprehension at the start, but that should be computationally negligible as it doesn't involve any computation there.
Here is a solution with for-loops. Showing each step in detail.
import numpy as np
arr1 = [np.array([1, 2, 3]), np.array([1, 2]), np.array([2, 3])]
arr2 = [np.array([20, 50, 30]), np.array([50, 50]), np.array([75, 25])]
maxi = []
for i in range(len(arr1)):
maxi.append(np.max(arr1[i]))
maxi = np.max(maxi)
output = np.zeros((len(arr2),maxi))
for i in range(len(arr1)):
for k in range(len(arr1[i])):
output[i][k]=arr2[i][k]
This is a straight forward approach, with only one level of iteration:
In [261]: res=np.zeros((3,4),int)
In [262]: for i,(idx,vals) in enumerate(zip(arr1, arr2)):
...: res[i,idx]=vals
...:
In [263]: res
Out[263]:
array([[ 0, 20, 50, 30],
[ 0, 50, 50, 0],
[ 0, 0, 75, 25]])
I suspect it is faster than #Divakar's approach for this example. And it should remain competitive as long as the number of columns is quite a bit larger than the number of rows.
Related
I am trying to solve the following system of linear equations:
10x1+ 40x2+ 70x3= 300
20x1+ 50x2+ 80x3= 360
30x1+ 60x2+ 80x3= 390
by using Cramer's method implementing a function by scratch:
def cramer(mat, constant): # takes the matrix and the costants
D = np.linalg.det(mat) # calculating the determinant of the original matrix
# substitution of the column with costant and creating new matrix
mat1 = np.array([constant, mat[1], mat[2]])
mat2 = np.array([mat[0], constant, mat[2]])
mat3 = np.array([mat[0], mat[1], constant])
#calculatin determinant of the matrix
D1 = np.linalg.det(mat1)
D2 = np.linalg.det(mat2)
D3 = np.linalg.det(mat3)
#finding the X1, X2, X3
X1 = D1/D
X2 = D2/D
X3 = D3/D
print(X1, X2, X3)
By using the above function on the system
a = np.array([[10, 40, 70],
[20, 50, 80],
[30, 60, 80]])
b = np.array([300, 360, 390])
I get the following result:
cramer(a,b)
-22.99999999999996 21.999999999999964 2.999999999999993
I have solved the system using the numpy function np.linalg.solve and I get another result:
x = np.linalg.solve(a, b)
[1. 2. 3.]
I cannot spot the formula error in the function I have witten. What should I adjust in the fuction in order to make it working properly?
The main problem is how you select the columns of a, i.e. you are actually selecting the rows of a rather than the columns. You can fix it by changing the matrix initializations to this:
mat1 = np.array([constant, mat[:,1], mat[:,2]])
mat2 = np.array([mat[:,0], constant, mat[:,2]])
mat3 = np.array([mat[:,0], mat[:,1], constant])
Basically mat[:,1] is saying something like mat[all rows, column 1].
TL;DR Optimal solution at the bottom.
To fix your current solution you need to use the second dimensions and pass all three matrices to compute determinants together (this way you will get stable floating point values):
def cramer(mat, constant):
D = np.linalg.det(mat)
mat1 = np.array([constant, mat[:, 1], mat[:, 2]])
mat2 = np.array([mat[:, 0], constant, mat[:, 2]])
mat3 = np.array([mat[:, 0], mat[:, 1], constant])
Dx = np.linalg.det([mat1, mat2, mat3])
X = Dx/D
print(X)
However, you don't need to create all these matrices one by one either. Instead, use a series of numpy manipulations described below.
First, create the mask to so you can then use it to replace values in a by values from b:
>>> mask = np.broadcast_to(np.diag([1,1,1]), [3, 3, 3]).swapaxes(0, 1)
array([[[1, 0, 0],
[1, 0, 0],
[1, 0, 0]],
[[0, 1, 0],
[0, 1, 0],
[0, 1, 0]],
[[0, 0, 1],
[0, 0, 1],
[0, 0, 1]]])
Then use np.where to get three matrices, each with one column replaced by b:
>>> Ms = np.where(mask, np.repeat(b, 3).reshape(3, 3), a)
array([[[300, 40, 70],
[360, 50, 80],
[390, 60, 80]],
[[ 10, 300, 70],
[ 20, 360, 80],
[ 30, 390, 80]],
[[ 10, 40, 300],
[ 20, 50, 360],
[ 30, 60, 390]]])
Then, compute three determinants and divide the determinant of a itself:
>>> np.linalg.det(Ms) / np.linalg.det(a)
array([1., 2., 3.])
Putting it all together:
def cramer(a, b):
mask = np.broadcast_to(np.diag([1,1,1]), [3, 3, 3]).swapaxes(0, 1)
Ms = np.where(mask, np.repeat(b, 3).reshape(3, 3), a)
return np.linalg.det(Ms) / np.linalg.det(a)
I'm working in python using numpy (could be a pandas series too) and am trying to make the following calculation:
Lets say I have an array corresponding to points on the x axis:
2, 9, 5, 6, 55, 8
For each element in this array I would like to get the distance to the closest element so the output would look like the following:
3, 1, 1, 1, 46, 1
I am trying to find a solution that can scale to 2D (distance to nearest XY point) and ideally would avoid a for loop. Is that possible?
There seems to be a theme with O(N^2) solutions here. For 1D, it's quite simple to get O(N log N):
x = np.array([2, 9, 5, 6, 55, 8])
i = np.argsort(x)
dist = np.diff(x[i])
min_dist = np.r_[dist[0], np.minimum(dist[1:], dist[:-1]), dist[-1]])
min_dist = min_dist[np.argsort(i)]
This clearly won't scale well to multiple dimensions, so use scipy.special.KDTree instead. Assuming your data is N-dimensional and has shape (M, N), you can do
k = KDTree(data)
dist = k.query(data, k=2)[0][:, -1]
Scipy has a Cython implementation of KDTree, cKDTree. Sklearn has a sklearn.neighbors.KDTree with a similar interface as well.
Approach 1
You can use broadcasting in order to get matrix of distances:
>>> data = np.array([2,9,5,6,55,8])
>>> dst_matrix = data - data[:, None]
>>> dst_matrix
array([[ 0, 7, 3, 4, 53, 6],
[ -7, 0, -4, -3, 46, -1],
[ -3, 4, 0, 1, 50, 3],
[ -4, 3, -1, 0, 49, 2],
[-53, -46, -50, -49, 0, -47],
[ -6, 1, -3, -2, 47, 0]])
Then we can eliminate diagonal as proposed in this post:
dst_matrix = dst_matrix[~np.eye(dst_matrix.shape[0],dtype=bool)].reshape(dst_matrix.shape[0],-1)
>>> dst_matrix
array([[ 7, 3, 4, 53, 6],
[ -7, -4, -3, 46, -1],
[ -3, 4, 1, 50, 3],
[ -4, 3, -1, 49, 2],
[-53, -46, -50, -49, -47],
[ -6, 1, -3, -2, 47]])
Finally, mininum items can be found:
>>> np.min(np.abs(dst_matrix), axis=1)
array([ 3, 1, 1, 1, 46, 1])
Approach 2
If you're looking for time and memory efficient solution, the best option is scipy.spatial.cKDTrees which packs points (of any dimension) into specific data structure that is optimized for querying closest points. It can also be extended to 2D or 3D.
import scipy.spatial
data = np.array([2,9,5,6,55,8])
ckdtree = scipy.spatial.cKDTree(data[:,None])
distances, idx = ckdtree.query(data[:,None], k=2)
output = distances[:,1] #distances to not coincident points
For each point querying first two closest points is required here because first of them is expected to be coincident. This is the only solution I found between all the proposed answers that doesn't take ages (the average performance is 4secs for 1M points). Warning: you need to filter duplicated points before applying this method.
There are many ways of achieving it. Some readable and generalizable ways are:
Approach 1:
dist = np.abs(a[:,None]-a)
np.min(dist, where=~np.eye(len(a),dtype=bool), initial=dist.max(), axis=1)
#[ 3 1 1 1 46 1]
Approach 2:
dist = np.abs(np.subtract.outer(a,a))
np.min(dist, where=~np.eye(len(a),dtype=bool), initial=dist.max(), axis=1)
For a 2-D case approach 1 (assumes Euclidean distance. Any other is also possible):
from scipy.spatial.distance import cdist
dist = cdist(a,a)
np.min(dist, where=~np.eye(len(a),dtype=bool), initial=dist.max(), axis=1)
For a 2-D case approach 2 with numpy only:
dist=np.sqrt(((a[:,None]-a)**2).sum(-1))
np.min(dist, where=~np.eye(len(a),dtype=bool), initial=dist.max(), axis=1)
You can achieve a faster distance calculation by using np.dot.
You can do some list comprehension on a pandas series:
s = pd.Series([2,9,5,6,55,8])
s.apply(lambda x: min([abs(x - s[y]) for y in s.index if s[y] != x]))
Out[1]:
0 3
1 1
2 1
3 1
4 46
5 1
Then you can just add .to_list() or .to_numpy() to the end to get rid of the series index:
s.apply(lambda x: min([abs(x - s[y]) for y in s.index if s[y] != x])).to_numpy()
array([ 3, 1, 1, 1, 46, 1], dtype=int64)
Say I have the numpy array arr_1 = np.arange(10) returning:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
How do I change multiple elements to a certain value using slicing?
For example: changing the zeroth, first and second element that occur every five elements, starting from the first element, to 100. I want this:
array([0, 100, 100, 100, 4, 5, 100, 100, 100, 9])
I tried arr_1[1::[5, 6, 7]] = 100 but that doesn't work.
Here is another solution based on what you did :
arr_1 = np.arange(10)
arr_1[1::5] = 100
arr_1[2::5] = 100
arr_1[3::5] = 100
and it returns :
array([ 0, 100, 100, 100, 4, 5, 100, 100, 100, 9])
If your repeat offset divides the array length:
a.reshape((-1, 5))[:, 1:4] = 100
General case requires two lines:
a[: len(a) // 5 * 5].reshape((-1, 5))[:, 1:4] = 100
a[len(a) // 5 * 5 :][1:4] = 100
How it works: Reshaping in the described way stacks consecutive stretches of the array in such a way that the target substretches are aligned and can therefore be addressed in one go using standard 2d indexing:
>>> a = np.arange(15)
>>> a.reshape((-1, 5))
array([[ 0, 1x, 2x, 3x, 4],
[ 5, 6x, 7x, 8x, 9],
[10, 11x, 12x, 13x, 14]])
Here's one approach with masking -
a = np.arange(10) # Input array
idx = np.array([0,1,2]) # Indices to be set
offset = 1 # Offset
a[np.in1d(np.mod(np.arange(a.size),5) , idx+offset)] = 100
Sample run with original sample -
In [849]: a = np.arange(10) # Input array
...: idx = np.array([0,1,2]) # Indices to be set
...: offset = 1 # Offset
...:
...: a[np.in1d(np.mod(np.arange(a.size),5) , idx+offset)] = 100
...:
In [850]: a
Out[850]: array([ 0, 100, 100, 100, 4, 5, 100, 100, 100, 9])
Sample run with non-sequential indices -
In [851]: a = np.arange(11) # Input array
...: idx = np.array([0,2,3]) # Indices to be set
...: offset = 1 # Offset
...:
In [852]: a[np.in1d(np.mod(np.arange(a.size),5) , idx+offset)] = 100
In [853]: a
Out[853]: array([ 0, 100, 2, 100, 100, 5, 100, 7, 100, 100, 10])
You just need to wrap your list of indexes in np.array(list). You were very close to being correct:
In [2]: arr_1 = np.arange(10)
In [3]: arr_1[np.array([0,1,2,5,6,7])] = 100
In [4]: arr_1
Out[4]: array([100, 100, 100, 3, 4, 100, 100, 100, 8, 9])
I used hand coded values for the indexes, per your requirements. You can get the indexes in an automated way using some technique you like, like that shown by Divakar.
Let's say I have a (sparse) matrix M size (N*N, N*N). I want to select elements from this matrix where the outer product of grid (a (n,m) array, where n*m=N) is True (it is a boolean 2D array, and na=grid.sum()). This can be done as follows
result = M[np.outer( grid.flatten(),grid.flatten() )].reshape (( N, N ) )
result is an (na,na) sparse array (and na < N). The previous line is what I want to achieve: get the elements of M that are true from the product of grid, and squeeze the ones that aren't true out of the array.
As n and m (and hence N) grow, and M and result are sparse matrices, I am not able to do this efficiently in terms of memory or speed. Closest I have tried is:
result = sp.lil_matrix ( (1, N*N), dtype=np.float32 )
# Calculate outer product
A = np.einsum("i,j", grid.flatten(), grid.flatten())
cntr = 0
it = np.nditer ( A, flags=['multi_index'] )
while not it.finished:
if it[0]:
result[0,cntr] = M[it.multi_index[0], it.multi_index[1]]
cntr += 1
# reshape result to be a N*N sparse matrix
The last reshape could be done by this approach, but I haven't got there yet, as the while loop is taking forever.
I have also tried selecting nonzero elements of A too, and looping over but this eats up all of the memory:
A=np.einsum("i,j", grid.flatten(), grid.flatten())
nzero = A.nonzero() # This eats lots of memory
cntr = 0
for (i,j) in zip (*nzero):
temp_mat[0,cntr] = M[i,j]
cnt += 1
'n' and 'm' in the example above are around 300.
I don't know if it was a typo, or code error, but your example is missing an iternext:
R=[]
it = np.nditer ( A, flags=['multi_index'] )
while not it.finished:
if it[0]:
R.append(M[it.multi_index])
it.iternext()
I think appending to a list is simpler and faster than R[ctnr]=.... It's competitive if R is a regular array, and sparse indexing is slower (even the fastest lil format).
ndindex wraps this use of a nditer as:
R=[]
for index in np.ndindex(A.shape):
if A[index]:
R.append(M[index])
ndenumerate also works:
R = []
for index,a in np.ndenumerate(A):
if a:
R.append(M[index])
But I wonder if you really want to advance the cntr each it step, not just the True cases. Otherwise reshaping result to (N,N) doesn't make much sense. But in that case, isn't your problem just
M[:N, :N].multiply(A)
or if M was a dense array:
M[:N, :N]*A
In fact if both M and A are sparse, then the .data attribute of that multiply will be the same as the R list.
In [76]: N=4
In [77]: M=np.arange(N*N*N*N).reshape(N*N,N*N)
In [80]: a=np.array([0,1,0,1])
In [81]: A=np.einsum('i,j',a,a)
In [82]: A
Out[82]:
array([[0, 0, 0, 0],
[0, 1, 0, 1],
[0, 0, 0, 0],
[0, 1, 0, 1]])
In [83]: M[:N, :N]*A
Out[83]:
array([[ 0, 0, 0, 0],
[ 0, 17, 0, 19],
[ 0, 0, 0, 0],
[ 0, 49, 0, 51]])
In [84]: c=sparse.csr_matrix(M)[:N,:N].multiply(sparse.csr_matrix(A))
In [85]: c.data
Out[85]: array([17, 19, 49, 51], dtype=int32)
In [89]: [M[index] for index, a in np.ndenumerate(A) if a]
Out[89]: [17, 19, 49, 51]
Consider a list of n scipy.sparse.arrays with entries of type float. I am using the in Compressed Sparse Row format structure.
my_list = [sparse_array_1, sparse_array_2, ... , sparse_array_n]
Each sparse_array_i has the same length.
What I want to generate is a list of maximum per row values. So this example
[array[0, array[4, array[88,
3, 2, 287,
99, 1234, 0,
3], 0], 77]
would result in
[88, 287, 1324, 77]
Is this possible in a pythonic way?
I'm not familiar with scipy sparse arrays, but if they behave like other python iterables then a combination of map and zip will achieve what you want:
>>> arr
[[0, 3, 99, 3], [4, 2, 1234, 0], [88, 287, 0, 77]]
>>> zip(*arr)
[(0, 4, 88), (3, 2, 287), (99, 1234, 0), (3, 0, 77)]
>>> map(max, zip(*arr))
[88, 287, 1234, 77]
Here's the answer for two sparse matrices: just repeat this n-1 times.
import numpy as np
def spmax(X,Y):
# X,Y two csr sparse matrices
sX = X.copy(); sX.data[:] = 1
sY = Y.copy(); sY.data[:] = 1
sXY = sX+sY; sXY.data[:] = 1
X = X+sXY; X.data = X.data-1
Y = Y+sXY; Y.data = Y.data-1
maxXY = X.copy()
maxXY.data = np.amax(np.c_[X.data,Y.data],axis=1)
return maxXY
This is pretty slow though. Hopefully, they'll implement this in scipy.sparse at some point. This is a pretty basic operation.