I want to create a numpy array, where each element is the amount of 1s in another numpy array of size x created with np.random.randint.
>>> x = 10
>>> np.random.randint(2, size=x)
array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
>>> sum(array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1]))
5
and using it like this results in the same array being used, instead of generating a new random one each time
>>> np.full((5,), sum(np.random.randint(2, size=10)), dtype="int")
array([5, 5, 5, 5, 5])
How can I do this, or is there a better way to do this? I also tried the following
>>> a = np.random.rand(10)
>>> len(a[a < 0.5])
7
>>> np.full((5,), len(np.random.rand(10)[np.random.rand(10) < 0.5]), dtype="int")
array([7, 7, 7, 7, 7])
but as you can see that also resulted in the same numbers. The problem is that I don't want to use for loops, and instead find a way to do it quickly using numpy.
You could just generate a matrix which is N arrays each of size x made of random ints. Then sum over each array,
import numpy as np
x = 10
N = 5
a = np.sum(np.random.randint(2, size=[N,x]),0)
I'm fairly sure np.full is not what you want here as this is for array initialisation to a single value.
Using the binomial distribution as discussed above:
In [13]: np.random.binomial(10, 0.5, 5)
Out[13]: array([7, 4, 6, 7, 4])
This assumes that there are 10 distinct left/right decisions, each having 0.5 probability.
Related
the thing that I'm looking for, is a function that given "a" will return "b" by the following:
a = numpy.array([1, 1, 1, 1, 5, 5, 5, 5, 5, 6, 5, 2, 2, 2, 2])
which at first 1 shows 4 times in a row, after that 5 shows 5 times, 6 shows 1 time, 5 shows 1 and 2 shows 4 times
and what will return is an array like this:
b = numpy.array([4, 5, 1, 1, 4])
the function that im looking for will treat 5 this way, even though 5 is in the array "a" 6 times in total, it will count seperately per sequence
it is a very specific, i wrote a function like this, but i want to know if there is in numpy a built-in function like this for fast perfotmance
thanks in advance
This can be done with bincount on cumsum of nonzero diff:
out = np.bincount((np.diff(a)!=0).cumsum())
out[0] += 1
Output:
array([4, 5, 1, 1, 4])
You can also use additional attributes of np.diff to create an array of differences with extra units in both ends added artificially:
>>> np.diff(a,prepend=a[0]-1,append=a[-1]+1)
array([ 1, 0, 0, 0, 4, 0, 0, 0, 0, 1, -1, -3, 0, 0, 0, 1])
Now this is ready for combination of np.diff and np.nonzero:
x = np.diff(a, prepend=a[0]-1, append=a[-1]+1)
np.diff(np.nonzero(x))
Output:
array([[4, 5, 1, 1, 4]], dtype=int32)
But this is a little bit slower: 3x slower for small array a and 25% slower for large array a = np.random.randint(3,size=10000000).
I want to generate N random integers, where the first integer is uniformly chosen from 0..N, the second is uniformly chosen from 0..(N-1), the third from 0..(N-2), and so on. Is there a way to do this quickly in numpy, without incurring the cost of performing a separate numpy call N times?
You can pass arrays as the arguments to the integers method of numpy's random generator class. The arguments will broadcast, and generate the appropriate values.
For example,
In [17]: import numpy as np
In [18]: rng = np.random.default_rng()
In [19]: N = 16
In [20]: rng.integers(0, np.arange(N, 0, -1))
Out[20]: array([13, 10, 11, 11, 9, 8, 3, 0, 2, 5, 3, 1, 0, 2, 0, 0])
Note that the upper value given to the integers method is excluded, so if the ranges that you stated are inclusive, you'll have to adjust the arange arguments appropriately:
In [24]: rng.integers(0, np.arange(N+1, 1, -1))
Out[24]: array([ 6, 9, 11, 11, 7, 2, 5, 5, 8, 7, 5, 5, 4, 0, 1, 0])
We can sample random numbers uniformly in (0,1) and scale, then convert to int:
N = 10
np.random.seed(10)
randoms = np.random.rand(N)
(randoms * np.arange(1,N+1)).astype(int)
Output:
array([0, 0, 1, 2, 2, 1, 1, 6, 1, 0])
Let's say we have a 1d numpy array filled with some int values. And let's say that some of them are 0.
Is there any way, using numpy array's power, to fill all the 0 values with the last non-zero values found?
for example:
arr = np.array([1, 0, 0, 2, 0, 4, 6, 8, 0, 0, 0, 0, 2])
fill_zeros_with_last(arr)
print arr
[1 1 1 2 2 4 6 8 8 8 8 8 2]
A way to do it would be with this function:
def fill_zeros_with_last(arr):
last_val = None # I don't really care about the initial value
for i in range(arr.size):
if arr[i]:
last_val = arr[i]
elif last_val is not None:
arr[i] = last_val
However, this is using a raw python for loop instead of taking advantage of the numpy and scipy power.
If we knew that a reasonably small number of consecutive zeros are possible, we could use something based on numpy.roll. The problem is that the number of consecutive zeros is potentially large...
Any ideas? or should we go straight to Cython?
Disclaimer:
I would say long ago I found a question in stackoverflow asking something like this or very similar. I wasn't able to find it. :-(
Maybe I missed the right search terms, sorry for the duplicate then. Maybe it was just my imagination...
Here's a solution using np.maximum.accumulate:
def fill_zeros_with_last(arr):
prev = np.arange(len(arr))
prev[arr == 0] = 0
prev = np.maximum.accumulate(prev)
return arr[prev]
We construct an array prev which has the same length as arr, and such that prev[i] is the index of the last non-zero entry before the i-th entry of arr. For example, if:
>>> arr = np.array([1, 0, 0, 2, 0, 4, 6, 8, 0, 0, 0, 0, 2])
Then prev looks like:
array([ 0, 0, 0, 3, 3, 5, 6, 7, 7, 7, 7, 7, 12])
Then we just index into arr with prev and we obtain our result. A test:
>>> arr = np.array([1, 0, 0, 2, 0, 4, 6, 8, 0, 0, 0, 0, 2])
>>> fill_zeros_with_last(arr)
array([1, 1, 1, 2, 2, 4, 6, 8, 8, 8, 8, 8, 2])
Note: Be careful to understand what this does when the first entry of your array is zero:
>>> fill_zeros_with_last(np.array([0,0,1,0,0]))
array([0, 0, 1, 1, 1])
Inspired by jme's answer here and by Bas Swinckels' (in the linked question) I came up with a different combination of numpy functions:
def fill_zeros_with_last(arr, initial=0):
ind = np.nonzero(arr)[0]
cnt = np.cumsum(np.array(arr, dtype=bool))
return np.where(cnt, arr[ind[cnt-1]], initial)
I think it's succinct and also works, so I'm posting it here for the record. Still, jme's is also succinct and easy to follow and seems to be faster, so I'm accepting it :-)
If the 0s only come in strings of 1, this use of nonzero might work:
In [266]: arr=np.array([1,0,2,3,0,4,0,5])
In [267]: I=np.nonzero(arr==0)[0]
In [268]: arr[I] = arr[I-1]
In [269]: arr
Out[269]: array([1, 1, 2, 3, 3, 4, 4, 5])
I can handle your arr by applying this repeatedly until I is empty.
In [286]: arr = np.array([1, 0, 0, 2, 0, 4, 6, 8, 0, 0, 0, 0, 2])
In [287]: while True:
.....: I=np.nonzero(arr==0)[0]
.....: if len(I)==0: break
.....: arr[I] = arr[I-1]
.....:
In [288]: arr
Out[288]: array([1, 1, 1, 2, 2, 4, 6, 8, 8, 8, 8, 8, 2])
If the strings of 0s are long it might be better to look for those strings and handle them as a block. But if most strings are short, this repeated application may be the fastest route.
That wasn't perhaps the best description in the title, but I can hopefully describe my problem below. There's really two parts to it.
The ultimate thing I'm trying to do is group certain times together within an astropy table - as the values are not the same for each time that will go into a particular group, I don't believe I can just give the column name in the group_by() method.
So, what I'm trying to do is produce an array describing which group each time with be associated with so that I can pass that to group_by(). I can get the bin edges by performing, for example (the 10 is arbitrary),
>>> np.where(np.diff(table['Times']) > 10)[0]
array([ 2, 8, 9, 12])
Let's say the table has length 15. What I want to know is how it might be possible to use that array above to create the following array without having to use loops
array([0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4])
such that when I place that array in the group_by() method it groups the table according to those bin edges.
Alternatively, if there is a better way of grouping an astropy table according to time ranges.
It sounds like np.digitize should do what you want. Using arr in place of your table, try
arr = np.array([1,2,3,15, 16, 17, 17, 18, 19, 30,41,42, 43, 55, 56])
bin_edges = arr[np.where(np.diff(arr) > 10)[0]]
indices = np.digitize(arr, bin_edges, right=True)
print indices
One approach with np.repeat -
def repeat_based(bin_edges, n):
reps = np.diff(np.hstack((-1,bin_edges,n-1)))
return np.repeat(np.arange(bin_edges.size+1),reps)
Another approach with np.cumsum -
def cumsum_based(bin_edges, n):
id_arr = np.zeros(n,dtype=int)
id_arr[bin_edges+1] = 1
return id_arr.cumsum()
Sample run -
In [400]: bin_edges = np.array([ 2, 8, 9, 12])
In [401]: repeat_based(bin_edges, n = 15)
Out[401]: array([0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4])
In [402]: cumsum_based(bin_edges, n = 15)
Out[402]: array([0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4])
I'm looking to compactly represent duplicates in a Python list / 1D numpy array. For instance, say we have
x = np.array([1, 0, 0, 3, 3, 0])
this array has several duplicate elements, that can be represented with a
group_id = np.array([0, 1, 1, 2, 2, 1])
so that all duplicates in a given cluster are found with x[group_id==<some_id>].
The list of duplicate pairs can be efficiently computed with sorting,
s_idx = np.argsort(x)
diff_idx = np.nonzero(x[s_idx[:-1]] == x[s_idx[1:]])[0]
where the pair s_idx[diff_idx] <-> s_idx[diff_idx+1] correspond to the indices in the original array that are duplicates.
(here array([1, 2, 3]) <-> array([2, 5, 4])).
However, I'm not sure how to efficiently calculate cluster_id from this linkage information for large arrays sizes (N > 10⁶).
Edit: as suggested by #Chris_Rands, this can indeed be done with itertools.groupby,
import numpy as np
import itertools
def get_group_id(x):
group_id = np.zeros(x.shape, dtype='int')
for i, j in itertools.groupby(x):
j_el = next(j)
group_id[x==j_el] = i
return group_id
however the scaling appears to be O(n^2), and this would not scale to my use case (N > 10⁶),
for N in [50000, 100000, 200000]:
%time _ = get_group_id(np.random.randint(0, N, size=N))
CPU times: total: 1.53 s
CPU times: total: 5.83 s
CPU times: total: 23.9 s
and I belive using the duplicate linkage information would be more efficient as computing duplicate pairs for N=200000 takes just 6.44 µs in comparison.
You could use numpy.unique:
In [13]: x = np.array([1, 0, 0, 3, 3, 0])
In [14]: values, cluster_id = np.unique(x, return_inverse=True)
In [15]: values
Out[15]: array([0, 1, 3])
In [16]: cluster_id
Out[16]: array([1, 0, 0, 2, 2, 0])
(The cluster IDs are assigned in the order of the sorted unique values, not in the order of a value's first appearance in the input.)
Locations of the items in cluster 0:
In [22]: cid = 0
In [23]: values[cid]
Out[23]: 0
In [24]: (cluster_id == cid).nonzero()[0]
Out[24]: array([1, 2, 5])
Here's an approach using np.unique to keep the order according to the first appearance of a number -
unq, first_idx, ID = np.unique(x,return_index=1,return_inverse=1)
out = first_idx.argsort().argsort()[ID]
Sample run -
In [173]: x
Out[173]: array([1, 0, 0, 3, 3, 0, 9, 0, 2, 6, 0, 0, 4, 8])
In [174]: unq, first_idx, ID = np.unique(x,return_index=1,return_inverse=1)
In [175]: first_idx.argsort().argsort()[ID]
Out[175]: array([0, 1, 1, 2, 2, 1, 3, 1, 4, 5, 1, 1, 6, 7])