I have an array n×m, where n = 217000 and m = 3 (some data from telescope).
I need to calculate the distances between 2 points in 3D (according to my x, y, z coordinates in columns).
When I try to use sklearn tools the result is:
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
What tool can I use in this situation and what max possible size for this tools?
What tool can I use in this situation...?
You could implement the euclidean distance function on your own using the approach suggested by #Saksow. Assuming that a and b are one-dimensional NumPy arrays, you could also use any of the methods proposed in this thread:
import numpy as np
np.linalg.norm(a-b)
np.sqrt(np.sum((a-b)**2))
np.sqrt(np.dot(a-b, a-b))
If you wish to compute in one go the pairwise distance (not necessarily the euclidean distance) between all the points in your array, the module scipy.spatial.distance is your friend.
Demo:
In [79]: from scipy.spatial.distance import squareform, pdist
In [80]: arr = np.asarray([[0, 0, 0],
...: [1, 0, 0],
...: [0, 2, 0],
...: [0, 0, 3]], dtype='float')
...:
In [81]: squareform(pdist(arr, 'euclidean'))
Out[81]:
array([[ 0. , 1. , 2. , 3. ],
[ 1. , 0. , 2.23606798, 3.16227766],
[ 2. , 2.23606798, 0. , 3.60555128],
[ 3. , 3.16227766, 3.60555128, 0. ]])
In [82]: squareform(pdist(arr, 'cityblock'))
Out[82]:
array([[ 0., 1., 2., 3.],
[ 1., 0., 3., 4.],
[ 2., 3., 0., 5.],
[ 3., 4., 5., 0.]])
Notice that the number of points in the mock data array used in this toy example is and the resulting pairwise distance array has elements.
...and what max possible size for this tools?
If you try to apply the approach above using your data () you get an error:
In [105]: data = np.random.random(size=(217000, 3))
In [106]: squareform(pdist(data, 'euclidean'))
Traceback (most recent call last):
File "<ipython-input-106-fd273331a6fe>", line 1, in <module>
squareform(pdist(data, 'euclidean'))
File "C:\Users\CPU 2353\Anaconda2\lib\site-packages\scipy\spatial\distance.py", line 1220, in pdist
dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError
The issue is you are running out of RAM. To perform such computation you would need more than 350TB! The required amount of memory result from multiplying the number of elements of the distance matrix (2170002) by the number of bytes of each element of that matrix (8), and dividing this product by the apropriate factor (10243) to express the result in gigabytes:
In [107]: round(data.shape[0]**2 * data.dtype.itemsize / 1024.**3)
Out[107]: 350.8
So the maximum allowed size for your data is determined by the amount of available RAM (take a look at this thread for further details).
Using only Python and Euclidean distance formula for 3 dimensions:
import math
distance = math.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2 + (z1 - z2) ** 2)
Related
I am trying to assign decimal values into a NumPy matrix as shown below but when I do...
matrix = np.array([[0,0,0],[0,0,0]])
rn[1,] = [13.735,34.2,3]
It seems to ignore the decimal and just assigns the integer values into the matrix?
array([[ 0, 0, 0],
[13, 34, 3]])
I couldn't find the reason why so far after some research
As per the documentation, the desired data-type for the array. If not given, then the type will be determined as the minimum type required to hold the objects in the sequence. So since you did not specify the type when creating the array it inferred it to be an integer and not a float as all the values in the array are integers.
In [31]: import numpy as np
In [32]: matrix = np.array([[0,0,0],[0,0,0]], dtype=np.float32)
In [33]: matrix
Out[33]:
array([[0., 0., 0.],
[0., 0., 0.]], dtype=float32)
In [34]: matrix[1,] = [13.735,34.2,3]
In [35]: matrix
Out[35]:
array([[ 0. , 0. , 0. ],
[13.735, 34.2 , 3. ]], dtype=float32)
I am calculating running total of distances for all points covered as given in a day's schedule. Currently this is being carried out through looping over a numpy matrix, as per:
locSequence= sequence of points to be covered
h1 and h2 = any two consecutive points
locations = a simple matrix that contains distance between any two points
for i in range(len(locSequence)-1):
h1=locSequence[i]
h2=locSequence[i+1]
score += numpyMatrix[locations.loc[h1,'idx'],locations.loc[h2,'idx']+1]
I was wondering if there is a pythonic way of writing this code, i.e., without looping. Thanks!
I'm going to assume that locSequence is a list of integers ranging in values from 0 to len(locations), so that the distance between any two locations loc1, loc2 is equivalent to locations[loc1, loc2] = locations[loc2, loc1].
Example data:
In [ ]: locSequence = [0, 1, 2, 0] #Travel from point 0 to point 1 to point 2 to point 0.
...: # locations must be a symmetric matrix with all zero diagonal values;
...: # the distance between point x to point x is zero
...: locations = np.zeros((3,3))
...: locations[np.triu_indices(3, 1)] = np.arange(1, 4)
...: locations += locations.T
...: locations
Out[ ]:
array([[ 0., 1., 2.],
[ 1., 0., 3.],
[ 2., 3., 0.]])
Then, we simply use broadcasting to get all of the distances:
In [ ]: distances = locations[locSequence[:-1], locSequence[1:]]
...: distances
Out[ ]: array([ 1., 3., 2.])
In [ ]: distances.sum()
Out[ ]: 6.0
I'm trying to solve a problem using multidimensional arrays, rather than resorting to for loops, in order to gain a performance boost, but am having trouble with the indexing.
I've tried various permutations using np.newaxis, but can't seem to achieve the following functionality.
Problem:
Part 1) Take an M x N x N array called a, and for each of the M square matrices, set the upper triangular matrix elements as their negative values.
Part 2) Sum all elements in each of the M matrices (of shape N X N), returning a 1D array with M elements. Let's call this array b.
Attempted Solution
Here is my MWP / attempt using loops (which does work, but I'd rather find a fully array/matrix-based approach
a = np.array(
[[[ 0, 1],
[ 5, 0]],
[[ 0, 3],
[ 2, 0]]])
Part 1):
triangular_upper_idx = np.triu_indices_from(a[0])
for i in range(len(a)):
a[i][triangular_upper_idx] *= -1
a
result:
array([[[ 0, -1],
[ 5, 0]],
[[ 0, -3],
[ 2, 0]]])
Part 2):
b = np.zeros(len(a))
for i in range(len(a)):
b[i] = np.sum(a[i])
b
result:
array([ 4., -1.])
Note:
I have seen a similar question on this topic (Triangular indices for multidimensional arrays in numpy) but the solution there was nested for loops... I feel like numpy may offer a more efficient, clever array-based solution?
Any guidance would be much appreciated.
Thanks
yes numpy has the tools
r = 2
neg_uppr = np.triu(-np.ones((r,r)),1) + np.tril(np.ones((r,r)))
can't tell from your numerical example if you want the diagonal to be inverted too? Then use np.triu(-np.ones((r,r))) + np.tril(np.ones((r,r)),-1)
neg_uppr
Out[23]:
array([[ 1., -1.],
[ 1., 1.]])
a = np.array(
[[[ 0, 1],
[ 5, 0]],
[[ 0, 3],
[ 2, 0]]])
its fast to use the builtin element-wise arithmetic
a = a * neg_uppr
a
Out[26]:
array([[[ 0., -1.],
[ 5., 0.]],
[[ 0., -3.],
[ 2., 0.]]])
you can specify axes to sum over:
np.sum(a, (1,2))
Out[27]: array([ 4., -1.])
Hi I want to join multiple arrays in python, using numpy to form multidimensional arrays, it's inside of a for loop, this is a pseudocode
import numpy as np
h = np.zeros(4)
for x in range(3):
x1 = some array of length of 4 returned from a previous function (3,5,6,7)
h = np.concatenate((h,x1), axis =0)
The first iteration goes fine, but during the second iteration on the for loop I get the following error,
ValueError: all the input arrays must have same number of dimensions
The output array should look something like this
[[0,0,0,0],[3,5,6,7],[6,3,6,7]]
etc
So how can I join the arrays?
Thanks
You need to use vstack. It allows you to stack arrays. You take a sequence of arrays and stack them vertically to make a single array
import numpy as np
h = np.zeros(4)
for x in range(3):
x1 = [3,5,6,7]
h = np.vstack((h,x1))
# not h = np.concatenate((h,x1), axis =0)
print h
Output:
[[ 0. 0. 0. 0.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]]
more edits later.
If you do want to use cocatenate only, you can do the following way as well:
import numpy as np
h1 = np.zeros(4)
for x in range(3):
x1 = np.array([3,5,6,7])
h1= np.concatenate([h1,x1.T], axis =0)
print h1.shape
print h1.reshape(4,4)
Output:
(16,)
[[ 0. 0. 0. 0.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]
[ 3. 5. 6. 7.]]
Both have different applications. You can choose according to your need.
There are multiple ways of doing this. I'll list a few examples:
First, we import numpy and define a function that generates those arrays of length 4.
import numpy as np
def previous_function_returning_array_of_length_4(x):
return np.array(range(4)) + x
The first way involves creating a list of arrays, then calling numpy.array() to convert the list to a 2D array.
h0 = np.zeros(4)
arrays = [h0]
for x in range(3):
x1 = previous_function_returning_array_of_length_4(x)
arrays.append(x1)
h = np.array(arrays)
You can do the same with np.vstack():
h0 = np.zeros(4)
arrays = [h0]
for x in range(3):
x1 = previous_function_returning_array_of_length_4(x)
arrays.append(x1)
h = np.vstack(arrays)
Alternatively, if you know how many arrays you are going to create, you can create the 2D array first and fill in the values:
h = np.zeros((4, 4))
for ii in range(3):
x1 = previous_function_returning_array_of_length_4(ii)
h[ii + 1, ...] = x1
There are more ways, but hopefully, this will give you an idea of what to do.
It is best to collect values in a list, and perform the concatenate or array creation once, at the end.
h = [np.zeros(4)]
for x in range(3):
x1 = some array of length of 4 returned from a previous function (3,5,6,7)
h = h.append(x1)
h = np.array(h)
# or h = np.vstack(h)
All the concatenate/stack/array functions takes a list of multiple items. It is faster to append to a list than to do a concatenate of 2 items.
======================
Let's try your approach step by step:
In [189]: h=np.zeros(4)
In [190]: h
Out[190]: array([ 0., 0., 0., 0.]) # 1d array (4,) shape
In [191]: x1=np.array([3,5,6,7]) # another 1d
In [192]: h1=np.concatenate((h,x1),axis=0)
In [193]: h1
Out[193]: array([ 0., 0., 0., 0., 3., 5., 6., 7.])
In [194]: h1.shape
Out[194]: (8,) # also a 1d array, but with 8 items
In [195]: x1=np.array([6,3,6,7])
In [196]: h1=np.concatenate((h1,x1),axis=0)
In [197]: h1
Out[197]: array([ 0., 0., 0., 0., 3., 5., 6., 7., 6., 3., 6., 7.])
In this case I'm adding (4,) arrays one after the other, still getting a 1d array.
If I go back an create x1 as 2d (1,4):
In [198]: h=np.zeros(4)
In [199]: x1=np.array([[6,3,6,7]])
In [200]: h1=np.concatenate((h,x1),axis=0)
...
ValueError: all the input arrays must have same number of dimensions
I get this dimension error right away.
The fact that you get the error on the 2nd iteration suggests that the 1st x1 is (4,), but the 2nd is 2d.
When you have dimensions errors like this, check the shapes.
vstack adds dimensions to the inputs, as needed, so you can build 2d arrays:
In [207]: h=np.zeros(4)
In [208]: x1=np.array([3,5,6,7])
In [209]: h=np.vstack((h,x1))
In [210]: h
Out[210]:
array([[ 0., 0., 0., 0.],
[ 3., 5., 6., 7.]])
In [211]: x1=np.array([6,3,6,7])
In [212]: h=np.vstack((h,x1))
In [213]: h
Out[213]:
array([[ 0., 0., 0., 0.],
[ 3., 5., 6., 7.],
[ 6., 3., 6., 7.]])
I have a large scipy sparse symmetric matrix which I need to condense by taking the sum of blocks to make a new smaller matrix.
For example, for a 4x4 sparse matrix A I will like to make a 2x2 matrix B in which B[i,j] = sum(A[i:i+2,j:j+2]).
Currently, I just go block by block to recreate the condensed matrix but this is slow. Any ideas on how to optimize this?
Update: Here is an example code that works fine, but is slow for a sparse matrix of 50.000x50.000 that I want to condense in a 10.000x10.000:
>>> A = (rand(4,4)<0.3)*rand(4,4)
>>> A = scipy.sparse.lil_matrix(A + A.T) # make the matrix symmetric
>>> B = scipy.sparse.lil_matrix((2,2))
>>> for i in range(B.shape[0]):
... for j in range(B.shape[0]):
... B[i,j] = A[i:i+2,j:j+2].sum()
First of all, lil matrix for the one your are summing up is probably really bad, I would try COO or maybe CSR/CSS (I don't know which will be better, but lil is probably inherently slower for many of these operations, even the slicing might be much slower, though I did not test). (Unless you know that for example dia fits perfectly)
Based on COO I could imagine doing some tricking around. Since COO has row and col arrays to give the exact positions:
matrix = A.tocoo()
new_row = matrix.row // 5
new_col = matrix.col // 5
bin = (matrix.shape[0] // 5) * new_col + new_row
# Now do a little dance because this is sparse,
# and most of the possible bin should not be in new_row/new_col
# also need to group the bins:
unique, bin = np.unique(bin, return_inverse=True)
sum = np.bincount(bin, weights=matrix.data)
new_col = unique // (matrix.shape[0] // 5)
new_row = unique - new_col * (matrix.shape[0] // 5)
result = scipy.sparse.coo_matrix((sum, (new_row, new_col)))
(I won't guarantee that I didn't confuse row and column somewhere and this only works for square matrices...)
Given a square matrix of size N and a split size of d (so matrix will be partitioned into N/d * N/d sub-matrices of size d), could you use numpy.split a couple times to build a collection of those sub-matrices, sum each of them, and put them back together?
This should be treated more as pseudocode than an efficient implementation, but it expresses my idea:
def chunk(matrix, size):
row_wise = []
for hchunk in np.split(matrix, size):
row_wise.append(np.split(hchunk, size, 1))
return row_wise
def sum_chunks(chunks):
sum_rows = []
for row in chunks:
sum_rows.append([np.sum(col) for col in row])
return np.array(sum_rows)
Or more compactly as
def sum_in_place(matrix, size):
return np.array([[np.sum(vchunk) for vchunk in np.split(hchunk, size, 1)]
for hchunk in np.split(matrix, size)])
This gives you something like the following:
In [16]: a
Out[16]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In [17]: chunk.sum_in_place(a, 2)
Out[17]:
array([[10, 18],
[42, 50]])
For a 4x4 example you can do the following:
In [43]: a = np.arange(16.).reshape((4, 4))
In [44]: a
Out[44]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[ 12., 13., 14., 15.]])
In [45]: u = np.array([a[:2, :2], a[:2, 2:], a[2:,:2], a[2:, 2:]])
In [46]: u
Out[46]:
array([[[ 0., 1.],
[ 4., 5.]],
[[ 2., 3.],
[ 6., 7.]],
[[ 8., 9.],
[ 12., 13.]],
[[ 10., 11.],
[ 14., 15.]]])
In [47]: u.sum(1).sum(1).reshape(2, 2)
Out[47]:
array([[ 10., 18.],
[ 42., 50.]])
Using something like itertools it should be possible to automate and generalise an expression for u.