Avoid memory errors when integrating large array with Numpy - python

I have a 601x350x200x146 numpy float64 array which, according to my calculations takes about 22.3 Gb of memory. My output of free -m tells me I have about 100Gb of free memory so it fits fine. However, when integrating with
result = np.trapz(large_arr, axis=3)
I get a memory error. I understand that this is because of the intermediate arrays that numpy.trapz has to create to perform the integration. But I'm looking to see if there's a way around it, or at least a way to minimize the extra use of memory.
I have read about memory errors and I know of things to avoid this: one is placing a gc.collect() call before the integration. I tried this and it didn't work.
The other one is using the *= operators such as writing arr*=a instead of arr=arr*a, which I can't really do here. So I don't know what else to try.
Does anyone know of a way to do this operation without raising a memory error?
You can reproduce the error with:
arr = np.ones((601,350,200,146), dtype=np.float64)
arr=np.trapz(arr, axis=3)
although you'll have to scale down the size to match your memory size.

numpy.trapz provides some convenience, but the actual calculation is very simple. To avoid large temporary arrays, just implement it yourself:
In [37]: x.shape
Out[37]: (2, 4, 4, 10)
Here's the result of numpy.trapz(x, axis=3):
In [38]: np.trapz(x, axis=3)
Out[38]:
array([[[ 43. , 48.5, 46.5, 67. ],
[ 35.5, 39.5, 52.5, 35. ],
[ 44.5, 47.5, 34.5, 39.5],
[ 54. , 40. , 46.5, 50.5]],
[[ 42. , 60. , 55.5, 51. ],
[ 51.5, 40. , 52. , 42.5],
[ 48.5, 43. , 32. , 36.5],
[ 42.5, 38. , 38. , 45. ]]])
Here's the calculation written to use no large intermediate arrays. (The slice x[:,:,:,1:-1] does not copy the data associated with the array.)
In [48]: 0.5*(x[:,:,:,0] + 2*x[:,:,:,1:-1].sum(axis=3) + x[:,:,:,-1])
Out[48]:
array([[[ 43. , 48.5, 46.5, 67. ],
[ 35.5, 39.5, 52.5, 35. ],
[ 44.5, 47.5, 34.5, 39.5],
[ 54. , 40. , 46.5, 50.5]],
[[ 42. , 60. , 55.5, 51. ],
[ 51.5, 40. , 52. , 42.5],
[ 48.5, 43. , 32. , 36.5],
[ 42.5, 38. , 38. , 45. ]]])
If x has shape (m, n, p, q), the few temporary arrays that are generated in that expression all have shape (m, n, p).

Related

Combine array of indices with array of values

I have an array in the following form where the first two columns are supposed to be indices of a 2-dimensional array and the following columns are arbitrary values.
data = np.array([[ 0. , 1. , 48. , 4. ],
[ 1. , 2. , 44. , 4.4],
[ 1. , 1. , 34. , 2.3],
[ 0. , 2. , 55. , 2.2],
[ 0. , 0. , 42. , 2. ],
[ 1. , 0. , 22. , 1. ]])
How do I combine the indices data[:,:2] with their values data[:,2:] such that the resulting array is accessible by the indices in the first two columns.
In my example that would be:
result = np.array([[[42. , 2. ], [48. , 4. ], [55. , 2.2]],
[[22. , 1. ], [34. , 2.3], [44. , 4.4]]])
I know that there is a trivial solution using python loops. But performance is a concern since I'm dealing with a huge amount of data. Specifically it's output of another program that I need to process.
Maybe there is a relatively trivial numpy solution as well. But I'm kind of stuck.
If it helps the following can be safely assumed:
All numbers in the first two columns are whole numbers (although the array consists of floats).
Every possible index (or rather combinations of indices) in the original array is used exactly once. I.e. there is guaranteed to be exactly one entry of the form [i, j, ...].
The indices start at 0 and I know the highest indices beforehand.
Edit:
Hmm. I see now how my example is misleading. The truth is that some of my input arrays are sorted, but that's unreliable. So I shouldn't assume anything about the order. I reordered some rows in my example to make it clearer. In case anyone wants to make sense of the answer and comment below: In my original question the array appeared to be sorted by the first two columns.
find row, column, depth base your data array, then fill like below:
import numpy as np
data = np.array([[ 0. , 0. , 42. , 2. ],
[ 0. , 1. , 48. , 4. ],
[ 0. , 2. , 55. , 2.2],
[ 1. , 0. , 22. , 1. ],
[ 1. , 1. , 34. , 2.3],
[ 1. , 2. , 44. , 4.4]])
row = int(max(data[:,0]))+1
col = int(max(data[:,1]))+1
depth = len(data[0, 2:])
out = np.zeros([row, col, depth])
out = data[:, 2:].reshape(row,col,depth)
print(out)
Output:
[[[42. 2. ]
[48. 4. ]
[55. 2.2]]
[[22. 1. ]
[34. 2.3]
[44. 4.4]]]
You can use numba in no-python parallel mode with loops (which is inherently for python loops acceleration) that will be one of the most efficient methods in terms of performance as szczesny mentioned in the comments, that won't need to sort; this code is adjusted for when column counts are 2, if it be changeable, this code can be modified to handle that:
# without signature --> #nb.njit(parallel=True)
#nb.njit("float64[:, :, ::1](float64[:, ::1])", parallel=True)
def numba_(data):
data_ = data[:, :2].astype(np.int8)
res = np.empty((data_[:, 0].max() + 1, data_[:, 1].max() + 1, 2))
for i in nb.prange(data_.shape[0]):
res[data_[i, 0], data_[i, 1], 0] = data[i, 2]
res[data_[i, 0], data_[i, 1], 1] = data[i, 3]
return res
without the sorting and curing the proposed NumPy code (horizontal axis --> data.shape[0]):
More general to consider more than 2 columns:
#nb.njit("float64[:, :, ::1](float64[:, ::1])", parallel=True)
def numba_(data):
data_ = data[:, :2].astype(np.int8)
assert data_.shape[0] == data.shape[0]
depth = data[:, 2:].shape[1]
res = np.empty((data_[:, 0].max() + 1, data_[:, 1].max() + 1, depth))
for i in nb.prange(data_.shape[0]):
for j in range(depth):
res[data_[i, 0], data_[i, 1], j] = data[i, j + 2]
return res

How to set bounds when minimizing using scipy

I have some data in a numpy array.
I would like to scale the data using a linear function according to the following rules:
The mean is as close to 65 as possible
The smallest value is at least 50
For my first attempt I made a scoring function:
import numpy as np
from scipy.optimize import minimize
def score(x):
return abs(np.mean(x[0]*data+x[1]) - 65) + abs(x[0]*np.min(data)+x[1] - 50)
I have added on abs(x[0]*np.min(data)+x[1] - 50) as a vain attempt to get it to satisfy rule 2.
I then tried:
x0 = [0.85,0]
res = minimize(score,x0)
np.set_printoptions(suppress=True)
print res
This gives:
fun: 4.8516444911893615
hess_inv: array([[ 0.0047, -0.1532],
[-0.1532, 5.2375]])
jac: array([-50.9628, -2. ])
message: 'Desired error not necessarily achieved due to precision loss.'
nfev: 580
nit: 2
njev: 142
status: 2
success: False
x: array([0.7408, 1.4407])
In other words the optimization failed.
I would also like to set bounds for the coefficients, e.g. bounds = [(0.7,1.3),(-5,5)].
My question is, what is the correct way to run the optimization with the boundary condition that the scaled smallest value is at least 50? Also, how can I make it so that the optimization runs without failure?
Consider the following:
import numpy as np
from scipy.optimize import minimize
data = np.array([ 59. , 59.5, 61. , 61.5, 62.5, 63. , 63. , 65.5, 66.5,
67. , 68. , 69. , 69.5, 70.5, 70.5, 70.5, 71. , 72. ,
72. , 73.5, 73.5, 74. , 75. , 75.5, 78. , 79. , 79. ,
79. , 79.5, 80.5, 80.5, 80.5, 80.5, 80.5, 82.5, 82.5,
82.5, 83. , 83. , 83. , 83. , 83. , 83.5, 83.5, 84. ,
84.5, 84.5, 84.5, 86. , 86. , 86. , 86.5, 86.5, 87.5,
88. , 88. , 88.5, 89. , 90. , 90.5, 90.5, 90.5, 91. ,
91.5, 91.5, 92. , 92. , 93. , 93. , 93. , 93.5, 93.5,
94. , 94. , 94. , 94. , 94. , 94. , 94.5, 94.5, 94.5,
94.5, 95.5, 95.5, 95.5, 95.5, 95.5, 95.5, 96. , 96. ,
96. , 96.5, 96.5, 96.5, 98. , 98. , 98. , 98. , 98. ,
98. , 98. , 98. , 98.5, 98.5, 98.5, 98.5, 98.5, 100. ,
100. , 100. , 100. ])
def scale(data, coeffs):
m,b = coeffs
return (m * data) + b
def score(coeffs):
scaled = scale(data, coeffs)
# Penalty components
p_1 = abs(np.mean(scaled) - 65)
p_2 = max(0, (50 - np.min(scaled)))
return p_1 + p_2
res = minimize(score, (0.85, 0.0), method = 'Powell')
#np.set_printoptions(suppress=True)
print(res)
post = scale(data, res.x)
print(np.mean(post))
print(np.min(post))
print(score(res.x))
Outputs:
direc: array([[ -3.05475495e-02, 2.62047576e+00],
[ 7.54828106e-07, -6.47892698e-05]])
fun: 1.4210854715202004e-14
message: 'Optimization terminated successfully.'
nfev: 360
nit: 8
status: 0
success: True
x: array([ 0.55914442, 17.02691959])
print(np.mean(post)) # 65.0
print(np.min(post)) # 50.0164406291
print(score(res.x)) # 1.42108547152e-14
A few things:
I added a scale helper function to clean up the code a bit, since I use it in the score function as well as at the end to show the scaled data.
The score function was fixed and broken out into two separate penalties (one for each requirement) for clarity. It computes the scaled vector once (and calls it scaled), then computes the penalty components.
Note: This score function has an odd non-smooth area around min(data) = 50 because of the max call. This may cause issues with some optimization methods.
I used the Powell algorithm because I had used it before and it worked in a similar problem with using a min/max operator. Wikipedia says:
The method is useful for calculating the local minimum of a continuous but complex function, especially one without an underlying mathematical definition, because it is not necessary to take derivatives
Someone more familiar with the optimization methods may be able to suggest a better alternative.
(Edit) Lastly, with respect to your question about boundary conditions. Usually, when we talk about boundary conditions we're talking about the boundary of the independent variable, the vector we're optimizing (here, elements of coeffs or x) -- for example, "x[0] must be less than 0", or "x[1] must be between 0 and 1" -- not what you seem to be looking for.
Sorry if I'm understanding you wrong, but just scaling the data according to those 2 rules is straight forward linear algebra:
e = np.mean(data)
m = e - np.min(data)
data * (65-50)/m + (65 - e*(65-50)/m)
# i.e. (data-e) * (65-50)/m + 65
This has exactly mean 65 and minimum 50.

Building histogram from a dict without having to iterate over the keys

I have a dict containing numpy array of varying length:
MyDcit= {0:array([[ 15. , 3.89678216],
[ 36. , 9.49245167],
[ 53. , 3.82997799],
[ 83. , 5.25727272],
[ 86. , 8.76663208]]),
1:array([[ 4. , 4.1171155 ],
[ 16. , 12.68122196],
[ 31. , 8.64805222],
[ 37. , 6.07202959]]),
2:array([]),...,
90:array([[ 1. , 1. ],
[ 24. , 8.14221573],
[ 27. , 7.36309862]])}
I would like to obtain an histogram of all the values in the dict. The solution I have now is to iterate over the keys in the dict and fill a numpy array with an histogram of fixed length:
for KeysElements in MyDict.keys():
hist,bins = numpy.histogram(np.asarray(MyDict[KeysElements])[:,1],50)
numpy_hist[KeysElements,:] = hist
I then sum up all the histograms over the fist dimension of the numpy array to obtain the histogram of all the keys of the initial dict:
Total_hist = numpy.sum(numpy_hist,axis=0)
The problems with this solutions is that I do not knwo how to handle the bins which change for each iteration, so my question is: are there any possibilities to achieve this without having to built histograms in a loop?
Thanks for any advices or links.
Greg
You don't seem to use the MyDict index values or the 0th values in the 2nd axis of your np arrays. If this is the case then you could add all the numpy arrays together and do the histogram on that
import numpy as np
MyDict = {0:np.array([[ 15. , 3.89678216],
[ 36. , 9.49245167],
[ 53. , 3.82997799],
[ 83. , 5.25727272],
[ 86. , 8.76663208]]),
1:np.array([[ 4. , 4.1171155 ],
[ 16. , 12.68122196],
[ 31. , 8.64805222],
[ 37. , 6.07202959]]),
2:np.array([]),
90:np.array([[ 1. , 1. ],
[ 24. , 8.14221573],
[ 27. , 7.36309862]])}
np_array = np.array([]).reshape(0,2)
for i in MyDict:
a = MyDict[i]
if len(a.shape) == 2 and a.shape[1] == 2:
np_array = np.append(np_array, MyDict[i], axis=0)
print(np.histogram(np_array, 50))

"m x n" dimensional gradient-style array in Python

I checked out
gradient descent using python and numpy
but it didn't solve my problem.
I'm trying to get familiar with image-processing and I want to generate a few test arrays to mess around with in Python.
Is there a method (like np.arange) to create a m x n array where the inner entries form some type of gradient?
I did an example of a naive method for generating the desired output.
Excuse my generality of the term gradient, I'm using it in it's simple meaning as smooth transition in color.
#!/usr/bin/python
import numpy as np
import matplotlib.pyplot as plt
#Set up parameters
m = 15
n = 10
A_placeholder = np.zeros((m,n))
V_m = np.arange(0,m).astype(np.float32)
V_n = np.arange(0,n).astype(np.float32)
#Iterate through combinations
for i in range(m):
m_i = V_m[i]
for j in range(n):
n_j = V_n[j]
A_placeholder[i,j] = m_i * n_j #Some combination
#Relabel
A_gradient = A_placeholder
A_placeholder = None
#Print data
print A_gradient
#[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18.]
[ 0. 3. 6. 9. 12. 15. 18. 21. 24. 27.]
[ 0. 4. 8. 12. 16. 20. 24. 28. 32. 36.]
[ 0. 5. 10. 15. 20. 25. 30. 35. 40. 45.]
[ 0. 6. 12. 18. 24. 30. 36. 42. 48. 54.]
[ 0. 7. 14. 21. 28. 35. 42. 49. 56. 63.]
[ 0. 8. 16. 24. 32. 40. 48. 56. 64. 72.]
[ 0. 9. 18. 27. 36. 45. 54. 63. 72. 81.]
[ 0. 10. 20. 30. 40. 50. 60. 70. 80. 90.]
[ 0. 11. 22. 33. 44. 55. 66. 77. 88. 99.]
[ 0. 12. 24. 36. 48. 60. 72. 84. 96. 108.]
[ 0. 13. 26. 39. 52. 65. 78. 91. 104. 117.]
[ 0. 14. 28. 42. 56. 70. 84. 98. 112. 126.]]
#Show Image
plt.imshow(A_gradient)
plt.show()
I've tried np.gradient but it didn't give me the desired output.
#print np.gradient(np.array([V_m,V_n]))
#Traceback (most recent call last):
# File "Untitled.py", line 19, in <module>
# print np.gradient(np.array([V_m,V_n]))
# File "/Users/Mu/anaconda/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1458, in gradient
# out[slice1] = (y[slice2] - y[slice3])
#ValueError: operands could not be broadcast together with shapes (10,) (15,)
A_placeholder[i,j] = m_i * n_j
Any operation like that can be expressed in numpy using broadcasting
A = np.arange(m)[:, None] * np.arange(n)[None, :]

Getting all points of a given connected component rapidly

Scikit-Image has quite a few methods available for blob detection:
Laplacian of Gaussian (LoG)
Difference of Gaussian (DoG)
Determinant of Hessian (DoH)
All three return an array that contains a single point within the bounds of the found components:
>>> from skimage import data, feature
>>> img = data.coins()
>>> feature.blob_doh(img)
array([[ 121. , 271. , 30. ],
[ 123. , 44. , 23.55555556],
[ 123. , 205. , 20.33333333],
[ 124. , 336. , 20.33333333],
[ 126. , 101. , 20.33333333],
[ 126. , 153. , 20.33333333],
[ 156. , 302. , 30. ],
[ 185. , 348. , 30. ],
[ 192. , 212. , 23.55555556],
[ 193. , 275. , 23.55555556],
[ 195. , 100. , 23.55555556],
[ 197. , 44. , 20.33333333],
[ 197. , 153. , 20.33333333],
[ 260. , 173. , 30. ],
[ 262. , 243. , 23.55555556],
[ 265. , 113. , 23.55555556],
[ 270. , 363. , 30. ]])
I'd like to use that information to produce lists that contains the coordinates of all the points in a given component.
I could just iterate through the whole image myself starting with the seeds and just collect all the points in a dict with the key being the point provide by blob detection, but I imagine it would rather slow unless I'm using cython(more than willing to be wrong about this, as I'm fairly new to python). More truthfully, I simply think there is probably a better way then just doing it myself.

Categories