Generate a random 3 element Numpy array of integers summing to 3 - python

I need to fill a numpy array of three elements with random integers such that the sum total of the array is three (e.g. [0,1,2]).
By my reckoning there are 10 possible arrays:
111,
012,
021,
102,
120,
201,
210,
300,
030,
003
My ideas is to randomly generate an integer between 1 and 10 using randint, and then use a look-up table to fill the array from the above list of combinations.
Does anyone know of a better approach?

Here is how I did it:
>>> import numpy as np
>>> a=np.array([[1,1,1],[0,1,2],[0,2,1],[1,0,2],[1,2,0],[2,0,1],[2,1,0],[3,0,0],[0,3,0],[0,0,3]])
>>> a[np.random.randint(0,10)]
array([1, 2, 0])
>>> a[np.random.randint(0,10)]
array([0, 1, 2])
>>> a[np.random.randint(0,10)]
array([1, 0, 2])
>>> a[np.random.randint(0,10)]
array([3, 0, 0])

Here’s a naive programmatic way to do this for arbitrary array sizes/sums:
def n_ints_summing_to_v(n, v):
elements = (np.arange(n) == np.random.randint(0, n)) for i in range(v))
return np.sum(elements, axis=0)
This will, of course, slow down proportionally to the desired sum, but would be ok for small values.
Alternatively, we can phrase this in terms of drawing samples from the Multinomial distribution, for which there is a function available in NumPy (see here), as follows:
def n_ints_summing_to_v(n, v):
return np.random.multinomial(v, ones((n)) / float(n))
This is a lot quicker!

This problem can be solved in the generic case, where the number of elements and their sum are both configurable. One advantage of the solution below is that it does not require generating a list of all the possibilities. The idea is to pick random numbers sequentially, each of which is less than the required sum. The required sum is reduced every time you pick a number:
import numpy
def gen(numel = 3, sum = 3):
arr = numpy.zeros((numel,), dtype = numpy.int)
for i in range(len(arr) - 1): # last element must be free to fill in the sum
arr[i] = numpy.random.randint(0, sum + 1)
sum -= arr[i]
if sum == 0: break # Nothing left to do
arr[-1] = sum # ensure that everything adds up
return arr
print(gen())
This solution does not guarantee that the possibilities will all occur with the same frequency. Among the ten possibilities you list, four start with 0, three with 1, two with 2 and one with 3. This is clearly not the uniform distribution that numpy.random.randint() provides for the first digit.

Related

Allocate an integer randomly across k bins

I'm looking for an efficient Python function that randomly allocates an integer across k bins.
That is, some function allocate(n, k) will produce a k-sized array of integers summing to n.
For example, allocate(4, 3) could produce [4, 0, 0], [0, 2, 2], [1, 2, 1], etc.
It should be randomly distributed per item, assigning each of the n items randomly to each of the k bins.
This should be faster than your brute-force version when n >> k:
def allocate(n, k):
result = np.zeros(k)
sum_so_far = 0
for ind in range(k-1):
draw = np.random.randint(n - sum_so_far + 1)
sum_so_far += draw
result[ind] = draw
result[k-1] = n - sum_so_far
return result
The idea is to draw a random number up to some maximum m (which starts out equal to n), and then we subtract that number from the maximum for the next draw, and so on, thus guaranteeing that we will never exceed n. This way we fill up the first k-1 entries; the final one is filled with whatever is missing to get a sum of exactly n.
Note: I am not sure whether this results in a "fair" random distribution of values or if it is somehow biased towards putting larger values into earlier indices or something like that.
If you are looking for a uniform distribution across all possible allocations (which is different from randomly distributing each item individually):
Using the "stars and bars" approach, we can transform this into a question of picking k-1 positions for possible dividers from a list of n+k-1 possible positions. (Wikipedia proof)
from random import sample
def allocate(n,k):
dividers = sample(range(1, n+k), k-1)
dividers = sorted(dividers)
dividers.insert(0, 0)
dividers.append(n+k)
return [dividers[i+1]-dividers[i]-1 for i in range(k)]
print(allocate(4,3))
There are ((n+k-1) choose (k-1)) possible distributions, and this is equally likely to result in each one of them.
(This is a modification of Wave Man's solution: that one is not uniform across all possible solutions: note that the only way to get [0,0,4] is to roll (0,0), but there are two ways to get [1,2,1]; rolling (1,3) or (3,1). Choosing from n+k-1 slots and counting dividers as taking a slot corrects for this. In this solution, the random sample (1,2) corresponds to [0,0,4], and the equally likely random sample (2,5) corresponds to [1,2,1])
Here's a brute-force approach:
import numpy as np
def allocate(n, k):
res = np.zeros(k)
for i in range(n):
res[np.random.randint(k)] += 1
return res
Example:
for i in range(3):
print(allocate(4, 3))
[0. 3. 1.]
[2. 1. 1.]
[2. 0. 2.]
Adapting Michael Szczesny's comment based on numpy's new paradigm:
def allocate(n, k):
return np.random.default_rng().multinomial(n, [1 / k] * k)
This notebook verifies that it returns the same distribution as my brute-force approach.
Here's my solution. I think it will make all possible allocations equally likely, but I don't have a proof of that.
from random import randint
def allocate(n,k):
dividers = [randint(0,n) for i in range(k+1)]
dividers[0] = 0
dividers[k] = n
dividers = sorted(dividers)
return [dividers[i+1]-dividers[i] for i in range(k)]
print(allocate(10000,100))

finding the occurrence of vector v (1,k) inside a matrix M (m,k)

I want to find the number of occurrences of vector v in matrix M.
What I have is a matrix the size (60K, 10)
and I initialised a test vector v (1,10):
tester = np.zeros((1, 10))
Now I want to check how much time that vector entirely repeats itself in the matrix rows.
I did it iterative and it works, but the fact that the matrix is very large, it affects the performance and im trying to find some more elegant and faster way.
would appreciate some help
Thanks.
you can do the following:
temp = np.where((prediction == tester)).all(axis=1))
len(temp[0])
what np.where() returns in the case it has no values [x,y] accept for the condition is the indices, in your case it will return the True and False option, starting from the True.
so using this will sure to lower your running time, and for me its much more elegant then looping through the matrix.
you can check np.where api:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
Just compare and use all, so each row will result in a True value only if all its elements compare equal to the reference array. Then, you can simply sum the result, since int(True) == 1.
Example:
np.random.seed(0)
data = np.random.randint(0, 2, size=(50, 3))
to_match = np.random.randint(0, 2, size=(1, 3))
print(to_match)
print((data == to_match).all(axis=1).sum())
Output:
[[0 0 0]]
4
...which means that there are 4 instances of [0, 0, 0] in data.

Sliding window on list of lists in Python

I'm trying to use numpy/pandas to constuct a sliding window style comparator. I've got list of lists each of which is a different length. I want to compare each list to to another list as depicted below:
lists = [[10,15,5],[5,10],[5]]
window_diff(l[1],l[0]) = 25
The window diff for lists[0] and lists[1] would give 25 using the following window sliding technique shown in the image below. Because lists[1] is the shorter path we shift it once to the right, resulting in 2 windows of comparison. If you sum the last row in the image below we get the total difference between the two lists using the two windows of comparison; in this case a total of 25. To note we are taking the absolute difference.
The function should aggregate the total window_diff between each list and the other lists, so in this case
tot = total_diffs(lists)
tot>>[40, 30, 20]
# where tot[0] represents the sum of lists[0] window_diff with all other lists.
I wanted to know if there was a quick route to doing this in pandas or numpy. Currently I am using a very long winded process of for looping through each of the lists and then comparing bitwise by shifting the shorter list in accordance to the longer list.
My approach works fine for short lists, but my dataset is 10,000 lists long and some of these lists contain 60 or so datapoints, so speed is a criteria here. I was wondering if numpy, pandas had some advice on this? Thanks
Sample problem data
from random import randint
lists = [[random.randint(0,1000) for r in range(random.randint(0,60))] for x in range(100000)]
Steps :
Among each pair of lists from the input list of lists create sliding windows for the bigger array and then get the absolute difference against the smaller one in that pair. We can use NumPy strides to get those sliding windows.
Get the total sum and store this summation as a pair-wise differentiation.
Finally sum along each row and col on the 2D array from previous step and their summation is final output.
Thus, the implementation would look something like this -
import itertools
def strided_app(a, L, S=1 ): # Window len = L, Stride len/stepsize = S
a = np.asarray(a)
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
N = len(lists)
pair_diff_sums = np.zeros((N,N),dtype=type(lists[0][0]))
for i, j in itertools.combinations(range(N), 2):
A, B = lists[i], lists[j]
if len(A)>len(B):
pair_diff_sums[i,j] = np.abs(strided_app(A,L=len(B)) - B).sum()
else:
pair_diff_sums[i,j] = np.abs(strided_app(B,L=len(A)) - A).sum()
out = pair_diff_sums.sum(1) + pair_diff_sums.sum(0)
For really heavy datasets, here's one method using one more level of looping -
N = len(lists)
out = np.zeros((N),dtype=type(lists[0][0]))
for k,i in enumerate(lists):
for j in lists:
if len(i)>len(j):
out[k] += np.abs(strided_app(i,L=len(j)) - j).sum()
else:
out[k] += np.abs(strided_app(j,L=len(i)) - i).sum()
strided_app is inspired from here.
Sample input, output -
In [77]: lists
Out[77]: [[10, 15, 5], [5, 10], [5]]
In [78]: pair_diff_sums
Out[78]:
array([[ 0, 25, 15],
[25, 0, 5],
[15, 5, 0]])
In [79]: out
Out[79]: array([40, 30, 20])
Just for completeness of #Divakar's great answer and for its application to very large datasets:
import itertools
N = len(lists)
out = np.zeros(N, dtype=type(lists[0][0]))
for i, j in itertools.combinations(range(N), 2):
A, B = lists[i], lists[j]
if len(A)>len(B):
diff = np.abs(strided_app(A,L=len(B)) - B).sum()
else:
diff = np.abs(strided_app(B,L=len(A)) - A).sum()
out[i] += diff
out[j] += diff
It does not create unnecessary large datasets and updates a single vector while iterating only over the upper triangular array.
It will still take a while to compute, as there is a tradeoff between computational complexity and larger-than-ram datasets. Solutions for larger than ram datasets often rely on iterations, and python is not great at it. Iterating in python over a large dataset is slow, very slow.
Translating the code above to cython could speedup things a bit.

Numpy argmax - random tie breaking

In numpy.argmax function, tie breaking between multiple max elements is so that the first element is returned.
Is there a functionality for randomizing tie breaking so that all maximum numbers have equal chance of being selected?
Below is an example directly from numpy.argmax documentation.
>>> b = np.arange(6)
>>> b[1] = 5
>>> b
array([0, 5, 2, 3, 4, 5])
>>> np.argmax(b) # Only the first occurrence is returned.
1
I am looking for ways so that 1st and 5th elements in the list are returned with equal probability.
Thank you!
Use np.random.choice -
np.random.choice(np.flatnonzero(b == b.max()))
Let's verify for an array with three max candidates -
In [298]: b
Out[298]: array([0, 5, 2, 5, 4, 5])
In [299]: c=[np.random.choice(np.flatnonzero(b == b.max())) for i in range(100000)]
In [300]: np.bincount(c)
Out[300]: array([ 0, 33180, 0, 33611, 0, 33209])
In the case of a multi-dimensional array, choice won't work.
An alternative is
def randargmax(b,**kw):
""" a random tie-breaking argmax"""
return np.argmax(np.random.random(b.shape) * (b==b.max()), **kw)
If for some reason generating random floats is slower than some other method, random.random can be replaced with that other method.
Easiest way is
np.random.choice(np.where(b == b.max())[0])
Since the accepted answer may not be obvious, here is how it works:
b == b.max() will return an array of booleans, with values of true where items are max and values of false for other items
flatnonzero() will do two things: ignore the false values (nonzero part) then return indices of true values. In other words, you get an array with indices of items matching the max value
Finally, you pick a random index from the array
Additional to #Manux's answer,
Changing b.max() to np.amax(b,**kw, keepdims=True) will let you do it along axes.
def randargmax(b,**kw):
""" a random tie-breaking argmax"""
return np.argmax(np.random.random(b.shape) * (b==b.max()), **kw)
randargmax(b,axis=None)
Here is a comparison between the two main solutions by #divakar and #shyam-padia :
method (1) - using np.where
np.random.choice(np.where(b == b.max())[0])
method (2) - using np.flatnonzero
np.random.choice(np.flatnonzero(b == b.max())
Code
Here is the code I wrote for the comparison:
def method1(b, bmax,):
return np.random.choice(np.where(b == bmax)[0])
def method2(b, bmax):
return np.random.choice(np.flatnonzero(b == bmax))
def time_it(n):
b = np.array([1.0, 2.0, 5.0, 5.0, 0.4, 0.1, 5.0, 0.3, 0.1])
bmax = b.max()
start = time.perf_counter()
for i in range(n):
method1(b, bmax)
elapsed1 = time.perf_counter() - start
start = time.perf_counter()
for i in range(n):
method2(b, bmax)
elapsed2 = time.perf_counter() - start
print(f'method1 time: {elapsed1} - method2 time: {elapsed2}')
return elapsed1, elapsed2
Results
The following figure shows the computation time for running each method for [100, 1000, 10000, 100000, 1000000] iterations where x-axis represents number of iterations, y-axis shows time in seconds. It can be seen that np.where performs better than np.flatnonzero when number of iterations increases. Note that the x-axis has a logarithmic scale.
To show how the two methods compare in the lower iteration, we can re-plot the previous results by making the y-axis being a logarithmic scale. We can see that np.where stays always better than np.flatnonzero.

Fastest way to access middle four elements in Numpy array?

Suppose I have a Numpy array, such as
rand = np.random.randn(6, 6)
I need the central four values in the array, since it has axes of even length. If it had been odd, such as 5 by 5, then there would only be one central value. What is the simplest/fastest/easiest way of retrieving these four entries? I can obtain them very crudely with indices, but I'm looking for a faster way than calling a bunch of functions and performing a bunch of calculations.
For example, consider the following:
array([[ 0.25659355, -0.75456113, 0.39467396, 0.50805361],
[-0.77218172, 1.00016061, -0.70389486, 1.67632146],
[-0.41106158, -0.63757421, 1.70390504, -0.79073362],
[-0.2016959 , 0.55316318, -1.55280823, 0.45740193]])
I want the following:
array([[1.00016061, -0.70389486],
[-0.63757421, 1.70390504]])
But not just for a 4 by 4 array - if it is even by even, I want the central four elements, as above.
Is something like this too complicated?
def get_middle(arr):
n = arr.shape[0] / 2.0
n_int = int(n)
if n % 2 == 1:
return arr[[n_int], [n_int]]
else:
return arr[n_int:n_int + 2, n_int:n_int + 2]
You can do this with a single slicing operation:
rand = np.random.randn(n,n)
# assuming n is even
center = rand[n/2-1:n/2+1, n/2-1:n/2+1]
I'm abusing order of operations by leaving out the parens, just to make it a little less messy.
Given array a:
import numpy as np
a = np.array([[ 0.25659355, -0.75456113, 0.39467396, 0.50805361],
[-0.77218172, 1.00016061, -0.70389486, 1.67632146],
[-0.41106158, -0.63757421, 1.70390504, -0.79073362],
[-0.2016959 , 0.55316318, -1.55280823, 0.45740193]])
The easiest way to get the central 4 values is:
ax, ay = a.shape
a[int(ax/2)-1:int(ax/2)+1, int(ay/2)-1:int(ay/2)+1]
This works if you have even numbers for the dimensions of the array. In case of odd numbers, there won't be a central 4 values.
Could you just use indexing? Like:
A = np.array([[ 0.25659355, -0.75456113, 0.39467396, 0.50805361],
[-0.77218172, 1.00016061, -0.70389486, 1.67632146],
[-0.41106158, -0.63757421, 1.70390504, -0.79073362],
[-0.2016959 , 0.55316318, -1.55280823, 0.45740193]])
])
A[1:3,1:3]
Or if matrix A had odd dimensions, say 5x5 then:
A[2,2]

Categories