I have many arrays of different length and what I want to do is to have for those arrays a fixed length, let's say 100 samples. These arrays contain time series and I do not want to lose the shape of those series while reducing the size of the array. What I think I need here is an undersampling algorithm. Is there an easy way to reduce the number of samples in an array doing like an average on some of those values?
Thanks
Heres a little script to do it without numpy. Maintains shape even if length required is larger than the length of the array.
from math import floor
def sample(input, count):
output = []
sample_size = float(len(input)) / count
for i in range(count):
output.append(input[int(floor(i * sample_size))])
return output
if you use a slice with generated random indices, and you keep your original array (or only the shape of it to reduce memory usage):
import numpy as np
input_data = somearray
shape = input_data.shape
n_samples= 100
inds = np.random.randint(0,shape[0], size=n_samples)
sub_samples = input_data[inds]
Here's a shorter version of Nick Fellingham's answer.
from math import floor
def sample(input,count):
ss=float(len(input))/count
return [ input[int(floor(i*ss))] for i in range(count) ]
Related
I would like to generate a sample that follows a normal distribution from M source values each with a standard deviation, with N samples per source value. Can this be done efficiently with numpy arrays?
My desired output is an MxN array. I expected this pseudocode to work, but it fails with an error:
import numpy as np
# initial data
M = 100
x = np.arange(M)
y = x**2
y_err = y * 0.1
# sample the data N times per datapoint
N = 1000
N_samples = np.random.normal(loc=y, scale=y_err, size=N)
Running this yields a broadcasting error since N and M are not the same:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
I can imagine solutions that use loops, but is there a better/faster method that minimizes the use of loops? For example, many numpy functions are vectorized so I would expect there to be some numpy method that would be faster or at least avoid the use of loops.
I was able to create two methods: one that uses loops, and one that uses numpy functions. However, the numpy method is slower for large arrays, so I am curious as to why this is and whether there is an alternative method.
Method one: loop through each of the M source values and sample N points from that value, and proceed through the whole dataset so that the numpy sampler is used M times:
# initialize the sample array
y_sampled = np.zeros([M, N])
for i in range(M):
y_sampled[i] = prng.normal(loc=y[i], scale=y_err_abs[i], size=num_samples)
Method two: use numpy's vectorized methods on an adjusted dataset, wherein the source data is duplicated to be an MxN array, on which the numpy sampler is applied once
# duplicate the source data and error arrays horizontally N times
y_dup = np.repeat(np.vstack(y), N,axis=1)
y_err_dup = np.repeat(np.vstack(y_err), N, axis=1)
# apply the numpy sampler once on the entire 2D array
y_sampled = np.random.normal(loc=y_dup, scale=y_err_dup, size=(M,N))
I expected the second method to be faster since the sampler is applied only once, albeit on a 2D array. The walltime is similar for small arrays (M = 100) but different by a factor of ~2x for larger arrays (M = 1E5). Timing:
M = 100 N = 1000
Time used by loop method: 0.0156 seconds
Time used by numpy resize/duplicating method: 0.0199
M = 100000 N = 1000
Time used by loop method: 3.9298 seconds
Time used by numpy resize/duplicating method: 7.3371 seconds
I would expect there to be a built-in method to sample N times, instead of duplicating the dataset N times, but these methods work.
I want to take a part of values (say 500 values) of an array and perform some operation on it such as take sum of squares of those 500 values. and then proceed with the next 500 values of the same array.
How should I implement this? Would a blackman window be useful in this case or is another approach more suitable?
It depends on several criteria:
Is the number of elements per operation an integer divisor of your array length?
Is the number of elements a significant fraction of your array length?
If 1. is True then you can reshape your array to use reduce-functions like .sum(axis=axis), which is probably the most performant way. See #P. Camilleri answer for this case.
However if 1. is False then the second question becomes important. If you answer "yes" to 2. then you can just use a for-loop over the array because the Python loop overhead is not quite as significant for loops with few iterations:
width = 500
for i in range(0, arr.size, width):
print(arr[i:i+width]) # do your operation here!
However if your answer is "No" to 1. and 2. you probably should use a convolution filter (see scipy.ndimage.filters) and then only pick the interesting elements:
width = 10
result = some_filter(arr)
# take only the elements starting by width_half and make "width" stepsize
result = result[(width-0.5)//2, :, width]
For example the sum of squares:
import numpy as np
arr = np.random.randint(0, 10, (25))
arr_squared = arr ** 2
width = 10
for i in range(0, arr_squared.size, width):
print(arr_squared[i:i+width].sum())
# 267, 329, 170
or using a convolution:
from scipy.ndimage import convolve
convolve(arr_squared, np.ones(width), mode='constant')[4::10]
# array([267, 329, 170])
Assuming your array a is 1D and its length is a multiple a 500, a simple np.sum(a.reshape((-1, 500)) ** 2, axis=1) would suffice. If you want a more complicated operation, please edit your question accordingly.
I have the following code to create a random subset (of size examples) of a large set:
def sampling(input_set):
tmp = random.sample(input_set, examples)
return tmp
The problem is that my input is a large matrix, so input_set.shape = (n,m). However, sampling(input_set) is a list, while I want it to be a submatrix of size = (examples, m), not a list of length examples of vectors of size m.
I modified my code to do this:
def sampling(input_set):
tmp = random.sample(input_set, examples)
sample = input_set[0:examples]
for i in range(examples):
sample[i] = tmp[i]
return sample
This works, but is there a more elegant/better way to accomplish what I am trying to do?
Use numpy as follow to create a n x m matrix (assuming input_set is a list)
import numpy as np
input_matrix = np.array(input_set).reshape(n,m)
Ok, if i understand correctly the question you just want to drop the last couple of rolls (n - k) so:
sample = input_matrix[:k - n]
must do the job for you.
Don't know if still interested in, but maybe you do something like this:
#select a random 6x6 matrix with items -10 / 10
import numpy as np
mat = np.random.randint(-10,10,(6,6))
print (mat)
#select a random int between 0 and 5
startIdx = np.random.randint(0,5)
print(startIdx)
#extracy submatrix (will be less than 3x3 id the index is out of bounds)
print(mat[startIdx:startIdx+3,startIdx:startIdx+3])
I have an array of values and would like to create a matrix from that, where each row is my starting point vector multiplied by a sample from a (normal) distribution.
The number of rows of this matrix will then vary in dependence from the number of samples I want.
%pylab
my_vec = array([1,2,3])
my_rand_vec = my_vec*randn(100)
Last command does not work, because array shapes do not match.
I could think of using a for loop, but I am trying to leverage on array operations.
Try this
my_rand_vec = my_vec[None,:]*randn(100)[:,None]
For small numbers I get for example
import numpy as np
my_vec = np.array([1,2,3])
my_rand_vec = my_vec[None,:]*np.random.randn(5)[:,None]
my_rand_vec
# array([[ 0.45422416, 0.90844831, 1.36267247],
# [-0.80639766, -1.61279531, -2.41919297],
# [ 0.34203295, 0.6840659 , 1.02609885],
# [-0.55246431, -1.10492863, -1.65739294],
# [-0.83023829, -1.66047658, -2.49071486]])
Your solution my_vec*rand(100) does not work because * corresponds to the element-wise multiplication which only works if both arrays have identical shapes.
What you have to do is adding an additional dimension using [None,:] and [:,None] such that numpy's broadcasting works.
As a side note I would recommend not to use pylab. Instead, use import as in order to include modules as pointed out here.
It is the outer product of vectors:
my_rand_vec = numpy.outer(randn(100), my_vec)
You can pass the dimensions of the array you require to numpy.random.randn:
my_rand_vec = my_vec*np.random.randn(100,3)
To multiply each vector by the same random number, you need to add an extra axis:
my_rand_vec = my_vec*np.random.randn(100)[:,np.newaxis]
I'd like to sample from indices of a 2D Numpy array, considering that each index is weighted by the number inside of that array. The way I know it is with numpy.random.choice however that does not return the index but the number itself. Is there any efficient way of doing so?
Here is my code:
import numpy as np
A=np.arange(1,10).reshape(3,3)
A_flat=A.flatten()
d=np.random.choice(A_flat,size=10,p=A_flat/float(np.sum(A_flat)))
print d
You could do something like:
import numpy as np
def wc(weights):
cs = np.cumsum(weights)
idx = cs.searchsorted(np.random.random() * cs[-1], 'right')
return np.unravel_index(idx, weights.shape)
Notice that the cumsum is the slowest part of this, so if you need to do this repeatidly for the same array I'd suggest computing the cumsum ahead of time and reusing it.
To expand on my comment: Adapting the weighted choice method presented here https://stackoverflow.com/a/10803136/553404
def weighted_choice_indices(weights):
cs = np.cumsum(weights.flatten())/np.sum(weights)
idx = np.sum(cs < np.random.rand())
return np.unravel_index(idx, weights.shape)