randomly split data in n groups? - python

I am currently trying to write code for splitting a given data into a number of groups.
The groups should be created randomly and they should encompass together the entire data.
So let's suppose there's an array A of eg. shape = (3, 3, 3) that has 27 root elements e:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])
I want to create n groups such that g1 & g2 & ... & gn will "add up" to the original array A.
I shuffled A as following
def shuffle(array):
shuf = array.ravel()
np.random.shuffle(shuf)
return np.reshape(shuf, array.shape)
But how do I create n groups (n < e) randomly?
Thanks!
Leo

Though not so elegant, the following code will spread the array to n group with ensuring each group having at least one element, and spread the rest randomly.
import numpy as np
def shuffle_and_group(array, n):
shuf = array.ravel()
np.random.shuffle(shuf)
shuf = list(shuf)
groups = []
for i in range(n): # ensuring no empty group
groups.append([shuf.pop()])
for num in shuf: # spread the remaining
groups[np.random.randint(n)].append(num)
return groups
array = np.arange(15)
print(shuffle_and_group(array, 9))
In case you worry about the time, the code will have time complexity of O(e) where e is the number of elements.

Related

Clean way to generate random numbers from 0 to 50 of size 1000 in python, with no similar number of occurrences

What would be the cleanest way to generate random numbers from 0 to 50, of size 1000, with the condition that no number should have the same number of occurrence as any other number using python and numpy.
Example for size 10: [0, 0, 0, 1, 1, 3, 3, 3, 3, 2] --> no number occurs same number of times
Drawing from a rng.dirichlet distribution and rejecting samples guarantees to obey the requirements, but with low entropy for the number of unique elements. You have to adjust the range of unique elements yourself with np.ones(rng.integers(min,max)). If max approaches the maximum number of unique elements (here 50) rejection might take long or has no solution, causing an infinite loop. The code is for a resulting array of size of 100.
import numpy as np
times = np.array([])
rng = np.random.default_rng()
#rejection sampling
while times.sum() != 100 or len(times) != len(np.unique(times)):
times = np.around(rng.dirichlet(np.ones(rng.integers(5,10)))*100)
nr = rng.permutation(np.arange(51))[:len(times)]
np.repeat(nr, times.astype(int))
Random output
array([ 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 33, 33, 33,
33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22,
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 25, 5, 5, 5])
Here's a recursive and possibly very slow implementation that produces the output desired.
import numpy as np
def get_sequence_lengths(values, total):
if total == 0:
return [[]], True
if total < 0:
return [], False
if len(values) == 0:
return [], False
sequences = []
result = False
for i in range(len(values)):
ls, suc = get_sequence_lengths(values[:i] + values[i + 1:], total - values[i])
result |= suc
if suc:
sequences.extend([[values[i]] + s for s in ls])
return sequences, result
def gen_numbers(rand_min, rand_max, count):
values = list(range(rand_min, rand_max + 1))
sequences, success = get_sequence_lengths(list(range(1, count+1)), count)
sequences = list(filter(lambda x: len(x) <= 1 + rand_max - rand_min, sequences))
if not success or not len(sequences):
raise ValueError('Cannot generate with given parameters.')
sequence = sequences[np.random.randint(len(sequences))]
values = np.random.choice(values, len(sequence), replace=False)
result = []
for v, s in zip(values, sequence):
result.extend([v] * s)
return result
get_sequence_length will generate all permutations of unique positive integers that sum up to the given total. The sequence will then be further filtered by the number available values. Finally the generation of paired value and counts from the sequence produces the output.
As mentioned above get_sequence_length is recursive and is going to be quite slow for larger input values.
To avoid the variability of generating random combinations in a potentially long trial/error loop, you could use a function that directly produces a random partition of a number where all parts are distinct (increasing). from that you simply need to map shuffled numbers over the chunks provided by the partition function:
def randPart(N,size=0): # O(√N)
if not size:
maxSize = int((N*2+0.25)**0.5-0.5) # ∑1..maxSize <= N
size = random.randrange(1,maxSize) # select random size
if size == 1: return (N,) # one part --> all of N
s = size*(size-1)//2 # min sum of deltas for rest
a = random.randrange(1,(N-s)//size) # base value
p = randPart(N-a*size,size-1) # deltas on other parts
return (a,*(n+a for n in p)) # combine to distinct parts
usage:
size = 30
n = 10
chunks = randPart(size)
numbers = random.sample(range(n),len(chunks))
result = [n for count,n in zip(chunks,numbers) for _ in range(count)]
print(result)
[9, 9, 9, 0, 0, 0, 0, 7, 7, 7, 7, 7, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6]
# resulting frequency counts
from collections import Counter
print(sorted(Counter(result).values()))
[3, 4, 5, 6, 12]
note that, if your range of random numbers is smaller than the maximum number of distinct partitions (for example fewer than 44 numbers for an output of 1000 values), you would need to modify the randPart function to take the limit into account in its calculation of maxSize:
def randPart(N,sizeLimit=0,size=0):
if not size:
maxSize = int((N*2+0.25)**0.5-0.5) # ∑1..maxSize <= N
maxSize = min(maxSize,sizeLimit or maxSize)
...
You could also change it to force a minimum number of partitions
This solves your problem in the way #MYousefi suggested.
import random
seq = list(range(50))
random.shuffle(seq)
values = []
for n,v in enumerate(seq):
values.extend( [v]*(n+1) )
if len(values) > 1000:
break
print(values)
Note that you can't get exactly 1,000 numbers. At first, I generated the entire sequence and then took the first 1,000, but that means whichever sequence gets truncated will be the same length as one of the earlier ones. You end up with 1,035.

I need to generate x amount of unique lists inside another list

I am able to generate the desired output but i need 10 of them and each list has to be unicue. The best solution i thought of was to create a 2nd function, generate a emty list and populate each element with list from 1st function. The output i got so far is x amount of lists but they are not unique and python gives me error when i try to call on the first function inside the 2nd one.
import random
numbers = list(range(1, 35))
out = []
final = []
print(numbers) # to see all numbers
# returns 7 unique pop'd numbers from list 'numbers' and appends them to list 'out'
def do():
for x in range(7):
out.append(numbers.pop(random.randrange(len(numbers))))
print(sorted(out))
# In other words i want to print output from function do() 10 times but each item in list has to be unique, not the lists themself
def do_ten():
for x in range(10):
final.append(out)
# do() python doesnt like this call
print(sorted(final))
do_ten()
This generates a specific amount of lists, in a list, which contain random numbers from 1 to 100, you can use l and n to control the amount of lists and numbers respectively.
import random
l, n = 3, 5 # The amount of lists and numbers respectively.
lis = [[i for i in random.sample(range(1, 35), n)] for group in range(l)]
print(lis)
Random Output:
[[16, 11, 17, 13, 9], [26, 6, 16, 29, 24], [24, 2, 4, 1, 20]]
You are popping 10 times 7 numbers from a list containing 34 elements (from 1 to 34). This is not possible. You need to have at least 70 elements in your list numbers(for example, from 0 to 69).
This is a solution that should work, based on the code you've already written:
import random
numbers = list(range(0, 70))
final = []
print(numbers) # to see all numbers
# returns a list of 7 unique popped numbers from list 'numbers'
def do():
out = []
for x in range(7):
l = len(numbers)
r = random.randrange(l)
t = numbers.pop(r)
out.append(t)
return out
# Call 10 times do() and add the returned list to 'final'
def do_ten():
for x in range(10):
out = do() # Get result from do()
final.append(out) # Add it to 'final'
do_ten()
print(final)
Does it help:
num_lists = 10
len_list = 10
[list(np.random.randint(1,11,len_list)) for _ in range(num_lists)]
As some people may have different definitin of "uniqueness", you may try:
source_list = range(0, num_lists*len_list,1)
[list(np.random.choice(source_list, len_list, replace=False)) for _ in range(num_lists)]
Pulling 7 of 34 numbers from your numberrange without repeats can be done using random.sample - to ensure you do not get duplicate lists, you can add a tuple of the list to a set and your final result and only add to final if this tuple is not yet in the set:
import random
numbers = range(1, 35) # 1...34
final = []
chosen = set()
while len(final) < 10:
# replace popping numbers with random.sample
one_list = random.sample(numbers, k=7) # 7 numbers out of 34 - no repeats
# create a tuple of this list and only add to final if not yet added
t = tuple(one_list)
if t not in chosen:
chosen.add(t)
final.append(one_list)
print (final)
Output:
[[1, 5, 10, 26, 14, 33, 6],
[3, 11, 1, 30, 7, 21, 18],
[24, 23, 28, 2, 13, 18, 1],
[4, 25, 32, 15, 22, 8, 27],
[32, 9, 10, 16, 17, 26, 12],
[34, 32, 10, 26, 16, 21, 20],
[6, 34, 22, 11, 26, 12, 5],
[29, 17, 25, 15, 3, 6, 5],
[24, 8, 31, 28, 17, 12, 15],
[6, 19, 11, 22, 30, 33, 15]]
If you dont need unique resulting lists, you can simplify this to a one-liner but it might have dupes inside:
final = [random.sample(range(1,11),k=7) for _ in range(10)]

From a 2D array, create 2nd 2D array of Unique(non-repeated) random selected values from 1st array (values not shared among rows) without using a loop

This is a follow up on this question.
From a 2d array, create another 2d array composed of randomly selected values from original array (values not shared among rows) without using a loop
I am looking for a way to create a 2D array whose rows are randomly selected unique values (non-repeating) from another row, without using a loop.
Here is a way to do it With using a loop.
pool = np.random.randint(0, 30, size=[4,5])
seln = np.empty([4,3], int)
for i in range(0, pool.shape[0]):
seln[i] =np.random.choice(pool[i], 3, replace=False)
print('pool = ', pool)
print('seln = ', seln)
>pool = [[ 1 11 29 4 13]
[29 1 2 3 24]
[ 0 25 17 2 14]
[20 22 18 9 29]]
seln = [[ 8 12 0]
[ 4 19 13]
[ 8 15 24]
[12 12 19]]
Here is a method that does not uses a loop, however, it can select the same value multiple times in each row.
pool = np.random.randint(0, 30, size=[4,5])
print(pool)
array([[ 4, 18, 0, 15, 9],
[ 0, 9, 21, 26, 9],
[16, 28, 11, 19, 24],
[20, 6, 13, 2, 27]])
# New array shape
new_shape = (pool.shape[0],3)
# Indices where to randomly choose from
ix = np.random.choice(pool.shape[1], new_shape)
array([[0, 3, 3],
[1, 1, 4],
[2, 4, 4],
[1, 2, 1]])
ixs = (ix.T + range(0,np.prod(pool.shape),pool.shape[1])).T
array([[ 0, 3, 3],
[ 6, 6, 9],
[12, 14, 14],
[16, 17, 16]])
pool.flatten()[ixs].reshape(new_shape)
array([[ 4, 15, 15],
[ 9, 9, 9],
[11, 24, 24],
[ 6, 13, 6]])
I am looking for a method that does not use a loop, and if a particular value from a row is selected, that value can Not be selected again.
Here is a way without explicit looping. However, it requires generating an array of random numbers of the size of the original array. That said, the generation is done using compiled code so it should be pretty fast. It can fail if you happen to generate two identical numbers, but the chance of that happening is essentially zero.
m,n = 4,5
pool = np.random.randint(0, 30, size=[m,n])
new_width = 3
mask = np.argsort(np.random.rand(m,n))<new_width
pool[mask].reshape(m,3)
How it works:
We generate a random array of floats, and argsort it. By default, when artsort is applied to a 2d array it is applied along axis 1 so the value of the i,j entry of the argsorted list is what place the j-th entry of the i-th row would appear if you sorted the i-th row.
We then find all the values in this array where the entries whose values are less than new_width. Each row contains the numbers 0,...,n-1 in a random order, so exactly new_width of them will be less than new_width. This means each row of mask will have exactly new_width number of entries which are True, and the rest will be False (when you use a boolean operator between a ndarray and a scalar it applies it component-wise).
Finally, the boolean mask is applied to the original data to grab new_width many entries from each row.
You could also use np.vectorize for your loop solution, although that is just shorthand for a loop.

Python evenly distribute value to a list in a dictionary

I have the following code which assigns numbers at random to employees:
emp_numbers = {}
employees = ['empA', 'empB', 'empC', 'empD', 'empE', 'empF']
numbers = 26
for x in employees:
emp_numbers[x] = []
emp = list(emp_numbers.keys())
for number in range(1, numbers+1):
emp_name = choice(emp);
emp_numbers[emp_name].append(number)
print (emp_numbers)
Output:
{'empA': [4, 25], 'empB': [2, 10, 11, 15, 18, 20, 22, 23], 'empC': [5, 13, 21, 24], 'empD': [3, 6, 7, 8, 12, 16, 19, 26], 'empE': [14], 'EmpF': [1, 9, 17]}
It works great. However, I don't know how to get it to distribute the numbers as evenly as possible. Some employees are getting 2 numbers, some have 8. Any advice on how to get it do that?
Thanks!
Since you want to assign all numbers, you can randomise the order of numbers instead of employees:
numbers = list(range(1, 27))
random.shuffle(numbers)
You can then use slicing to get an even count of numbers for every employee:
for idx, employee in enumerate(employees):
emp_numbers[employee] = numbers[idx::len(employees)]
The [start::n] syntax selects every n'th item beginning at start. The first employee gets item 0, 6, 12, ..., the second gets item 1, 7, 13, ..., and so on.
import random
# initial setup
employees = ['empA', 'empB', 'empC', 'empD', 'empE', 'empF']
numbers = list(range(1, 26+1))
# dict to hold assignment and randomly shuffled numbers
employee_numbers = {}
random.shuffle(numbers)
# assign shuffled numbers to employees
for idx, employee in enumerate(employees):
employee_numbers[employee] = numbers[idx::len(employees)]
# print result
print(employee_numbers)
Assuming the count of numbers is evenly divisible with the number of employees, you could go about it this way:
import random
import collections
employees = ["empA", "empB", "empC", "empD", "empE", "empF"] # employee names
numbers = list(range(1, 27)) # numbers from 1..26
emp_numbers = collections.defaultdict(list) # collects the employee numbers
random.shuffle(numbers) # shuffle the numbers to distribute
for i, number in enumerate(numbers): # get the index of the number and the number
employee = employees[i % len(employees)] # round-robin over the employees...
emp_numbers[employee].append(number) # ... and associate a number with a name.
print(emp_numbers)
outputs e.g.
{'empF': [25, 4, 9, 21], 'empD': [2, 10, 3, 11], 'empE': [18, 5, 17, 15], 'empB': [7, 24, 26, 6, 8], 'empC': [1, 14, 13, 12], 'empA': [16, 23, 20, 19, 22]}
If the numbers aren't evenly divisible, some folks will get more numbers than others.

How to extract columns from an indexed matrix?

I have the following matrix:
M = np.matrix([[1,2,3,4,5,6,7,8,9,10],
[11,12,13,14,15,16,17,18,19,20],
[21,22,23,24,25,26,27,28,29,30]])
And I receive a vector indexing the columns of the matrix:
index = np.array([1,1,2,2,2,2,3,4,4,4])
This vector has 4 different values, so my objective is to create a list containing four new matrices so that the first matrix is made by the first two columns of M, the second matrix is made by columns 3 to 6 and so on:
M1 = np.matrix([[1,2],[11,12],[21,22]])
M2 = np.matrix([[3,4,5,6],[13,14,15,16],[23,24,25,26]])
M3 = np.matrix([[7],[17],[27]])
M4 = np.matrix([[8,9,10],[18,19,20],[28,29,30]])
l = list(M1,M2,M3,M4)
I need to do this in a automated way, since the number of rows and columns of M as well as the indexing scheme are not fixed. How can I do this?
There are 3 points to note:
For a variable number of variables, as in this case, the recommended solution is to use a dictionary.
You can use simple numpy indexing for the individual case.
Unless you have a very specific reason, use numpy.array instead of numpy.matrix.
Combining these points, you can use a dictionary comprehension:
d = {k: np.array(M[:, np.where(index==k)[0]]) for k in np.unique(index)}
Result:
{1: array([[ 1, 2],
[11, 12],
[21, 22]]),
2: array([[ 3, 4, 5, 6],
[13, 14, 15, 16],
[23, 24, 25, 26]]),
3: array([[ 7],
[17],
[27]]),
4: array([[ 8, 9, 10],
[18, 19, 20],
[28, 29, 30]])}
import numpy as np
M = np.matrix([[1,2,3,4,5,6,7,8,9,10],
[11,12,13,14,15,16,17,18,19,20],
[21,22,23,24,25,26,27,28,29,30]])
index = np.array([1,1,2,2,2,2,3,4,4,4])
m = [[],[],[],[]]
for i,c in enumerate(index):
m[k-1].append(c)
for idx in m:
print M[:,idx]
this is a little hard coded, I assumed you will always want 4 matrixes and such.. you can change it for more generalisation

Categories