Slicing without views (or: shuffling multiple arrays) - python

I have two different numpy arrays and I would like to shuffle them in asynchronized way.
The current solution is taken from https://www.tensorflow.org/versions/r0.8/tutorials/mnist/pros/index.html and proceeds as follows:
perm = np.arange(self.no_images_train)
np.random.shuffle(perm)
self.images_train = self.images_train[perm]
self.labels_train = self.labels_train[perm]
The problem is that it doubles memory each time I do it. Somehow the old arrays are not getting deleted, probably because the slicing operator creates views I guess. I tried the following change, out of pure desperation:
perm = np.arange(self.no_images_train)
np.random.shuffle(perm)
n_images_train = self.images_train[perm]
n_labels_train = self.labels_train[perm]
del self.images_train
del self.labels_train
gc.collect()
self.images_train = n_images_train
self.labels_train = n_labels_train
Still the same, memory leaks and I am running out of memory after a couple of operations.
Btw, the two arrays are of rank 100000,224,244,1 and 100000,1.
I know that this has been dealt with here (Better way to shuffle two numpy arrays in unison), but the answer didn't help me, as the provided solution needs slicing again.
Thanks for any help.

One way to permute two large arrays in-place in a synchronized way is to save the state of the random number generator and then shuffle the first array. Then restore the state and shuffle the second array.
For example, here are my two arrays:
In [48]: a
Out[48]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
In [49]: b
Out[49]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
Save the current internal state of the random number generator:
In [50]: state = np.random.get_state()
Shuffle a in-place:
In [51]: np.random.shuffle(a)
Restore the internal state of the random number generator:
In [52]: np.random.set_state(state)
Shuffle b in-place:
In [53]: np.random.shuffle(b)
Check that the permutations are the same:
In [54]: a
Out[54]: array([13, 12, 11, 15, 10, 5, 1, 6, 14, 3, 9, 7, 0, 8, 4, 2])
In [55]: b
Out[55]: array([13, 12, 11, 15, 10, 5, 1, 6, 14, 3, 9, 7, 0, 8, 4, 2])
For your code, this would look like:
state = np.random.get_state()
np.random.shuffle(self.images_train)
np.random.set_state(state)
np.random.shuffle(self.labels_train)

Actually I don't think that there is any issue with numpy or python. Numpy uses the system malloc / free to allocate the array and this leads to memory fragmentation (see Memory Fragmentation on SO).
So I guess that your memory profile may increase and suddenly drops when the system is able to reduce fragmentation, if possible.

Related

Converting an array of time increments to an array of instants

If I have an array of time increments, for example:
intervals = np.random.normal(loc=1,scale=0.1,size=100)
one possible way to create the corresponding array of time instants is to create a list and manually make the sum:
Sum=0.
instants=[]
for k in range(len(intervals)):
Sum+=intervals[k]
instants.append(Sum)
instants=np.array(instants)
So, I have just switched from a array of dt(i) to an array of t(i).
But usually python offers elegant alternatives to using for loops. Is there a better way to do it?
What you here describe is the cumulative sum. You can calculate this with .cumsum() [numpy-doc]:
intervals.cumsum()
For example:
>>> intervals
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> intervals.cumsum()
array([ 0, 1, 3, 6, 10, 15, 21, 28, 36, 45])

How can I find the final cumulative sum across numpy axis? [duplicate]

This question already has answers here:
How to calculate the sum of all columns of a 2D numpy array (efficiently)
(6 answers)
Closed 4 years ago.
I have a numpy array
np.array(data).shape
(50,50)
Now, I want to find the cumulative sums across axis=1. The problem is cumsum creates an array of cumulative sums, but I just care about the final value of every row.
This is incorrect of course:
np.cumsum(data, axis=1)[-1]
Is there a succinct way of doing this without looping through the array.
You are almost there, but as you have it now, you are selecting just the final row. What you need is to select all rows from the last column, so your indexing at the end should be: [:,-1].
Example:
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
>>> a.cumsum(axis=1)[:,-1]
array([ 10, 35, 60, 85, 110])
Note, I'm leaving this up as I think it explains what was going wrong with your attempt, but admittedly, there are more effective ways of doing this in the other answers!
The final cumulative sum of every row, is in fact simply the sum of every row, or the row-wise sum, so we can implement this as:
>>> x.sum(axis=1)
array([ 10, 35, 60, 85, 110])
So here for every row, we calculate the sum of all the columns. We thus do not need to first generate the sums in between (well these will likely be stored in an accumulator in numpy), but not "emitted" in the array.
You can use numpy.ufunc.reduce if you don't need the intermediary accumulated results of any ufunc.
>>> a = np.arange(9).reshape(3,3)
>>> a
>>>
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>>
>>> np.add.reduce(a, axis=1)
>>> array([ 3, 12, 21])
However, in the case of sum, Willem's answer is clearly superior and to be preferred. Just keep in mind that in the general case, there's ufunc.reduce.

Function reads np.array - produces the mean for k nn to number p in np.array

I need to defina a function which reads a numpy array and produces the mean for k nearest points to number p in the array.
Example:
array= np.array([1, 2, 3, 4, 5, 6, 7, 50, 24, 32, 9, 11, 12, 10])
p= 15 (**Note this is not a number in the array, I will need to find the
number closest to p or p number itself)
k = 3
In this case, I would need to generate the mean for ([11, 12, 10)]
as they are closest to p = 15
With the above numbers, I will need to find the mean for k number of points closest to p and p can be explicitly stated in the array or may not be.
I am new and very confused at this point and feel I have exhausted my resources. I feel this question has been asked before but the answers are much too complex for what I need.
Thanks in advance.
Given a (1d) array arr and scalar input p, here's how you could find the mean of the n nearest values:
def neighbor_mean(arr, p, n=3):
idx = np.abs(arr - p).argsort()[:n]
return arr[idx].mean()
arr = np.array([1, 2, 3, 4, 5, 6, 7, 50, 24, 32, 9, 11, 12, 10])
neighbor_mean(arr, p=15)
# 11.0
In the above, first you take the absolute differences:
np.abs(arr - 15)
# array([14, 13, 12, 11, 10, 9, 8, 35, 9, 17, 6, 4, 3, 5])
Then argsort() returns the indices that would sort an array. We're interested in the n-smallest absolute differences. This is what you're really looking for, rather than sorting the differences directly.
np.abs(arr - p).argsort()[:3]
# array([12, 11, 13])
Lastly you want to index your input array arr and take the mean of this:
arr[[12, 11, 13]]
# array([12, 11, 10]) # mean: 11.0

Python with numpy: How to delete an element from each row of a 2-D array according to a specific index

Say I have a 2-D numpy array A of size 20 x 10.
I also have an array of length 20, del_ind.
I want to delete an element from each row of A according to del_ind, to get a resultant array of size 20 x 9.
How can I do this?
I looked into np.delete with a specified axis = 1, but this only deletes element from the same position for each row.
Thanks for the help
You will probably have to build a new array.
Fortunately you can avoid python loops for this task, using fancy indexing:
h, w = 20, 10
A = np.arange(h*w).reshape(h, w)
del_ind = np.random.randint(0, w, size=h)
mask = np.ones((h,w), dtype=bool)
mask[range(h), del_ind] = False
A_ = A[mask].reshape(h, w-1)
Demo with a smaller dataset:
>>> h, w = 5, 4
>>> %paste
A = np.arange(h*w).reshape(h, w)
del_ind = np.random.randint(0, w, size=h)
mask = np.ones((h,w), dtype=bool)
mask[range(h), del_ind] = False
A_ = A[mask].reshape(h, w-1)
## -- End pasted text --
>>> A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])
>>> del_ind
array([2, 2, 1, 1, 0])
>>> A_
array([[ 0, 1, 3],
[ 4, 5, 7],
[ 8, 10, 11],
[12, 14, 15],
[17, 18, 19]])
Numpy isn't known for inplace edits; it's mainly intended for statically sized matrices. For that reason, I'd recommend doing this by copying the intended elements to a new array.
Assuming that it's sufficient to delete one column from every row:
def remove_indices(arr, indices):
result = np.empty((arr.shape[0], arr.shape[1] - 1))
for i, (delete_index, row) in enumerate(zip(indices, arr)):
result[i] = np.delete(row, delete_index)
return result

calculate histogram peaks in python

In Python, how do I calcuate the peaks of a histogram?
I tried this:
import numpy as np
from scipy.signal import argrelextrema
data = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3, 4,
5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9,
12,
15, 16, 17, 18, 19, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24,]
h = np.histogram(data, bins=[0, 5, 10, 15, 20, 25])
hData = h[0]
peaks = argrelextrema(hData, np.greater)
But the result was:
(array([3]),)
I'd expect it to find the peaks in bin 0 and bin 3.
Note that the peaks span more than 1 bin. I don't want it to consider the peaks that span more than 1 column as additional peak.
I'm open to another way to get the peaks.
Note:
>>> h[0]
array([19, 15, 1, 10, 5])
>>>
In computational topology, the formalism of persistent homology provides a definition of "peak" that seems to address your need. In the 1-dimensional case the peaks are illustrated by the blue bars in the following figure:
A description of the algorithm is given in this
Stack Overflow answer of a peak detection question.
The nice thing is that this method not only identifies the peaks but it quantifies the "significance" in a natural way.
A simple and efficient implementation (as fast as sorting numbers) and the source material to the above answer given in this blog article:
https://www.sthu.org/blog/13-perstopology-peakdetection/index.html
Try the findpeaks library.
pip install findpeaks
# Your input data:
data = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 12, 15, 16, 17, 18, 19, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,]
# import library
from findpeaks import findpeaks
# Find some peaks using the smoothing parameter.
fp = findpeaks(lookahead=1, interpolate=10)
# fit
results = fp.fit(data)
# Make plot
fp.plot()
# Results with respect to original input data.
results['df']
# Results based on interpolated smoothed data.
results['df_interp']
I wrote an easy function:
def find_peaks(a):
x = np.array(a)
max = np.max(x)
lenght = len(a)
ret = []
for i in range(lenght):
ispeak = True
if i-1 > 0:
ispeak &= (x[i] > 1.8 * x[i-1])
if i+1 < lenght:
ispeak &= (x[i] > 1.8 * x[i+1])
ispeak &= (x[i] > 0.05 * max)
if ispeak:
ret.append(i)
return ret
I defined a peak as a value bigger than 180% that of the neighbors and bigger than 5% of the max value. Of course you can adapt the values as you prefer in order to find the best set up for your problem.

Categories