Random sampling a large Cartesian product of iterables - python

I have multiple iterables and I need to create the Cartesian product of those iterables and then randomly sample from the resulting pool of tuples. The problem is that the total number of combinations of these iterables is somewhere around 1e19, so I can't possibly load all of this into memory.
What I thought was using itertools.product in combination with a random number generator to skip random number of items, then once I get to the randomly selected item, I perform my calculations and continue until I run out of the generator. The plan was to do something like:
from itertools import product
from random import randint
iterables = () # tuple of 18 iterables
versions = product(iterables)
def do_stuff():
# do stuff
STEP_SIZE = int(1e6)
# start both counts from 0.
# First value to be taken is start + step
# after that increment start to be equal to count and repeat
start = 0
count = 0
while True:
try:
step = randint(1, 100) * STEP_SIZE
for v in versions:
# if the count is less than required skip values while incrementing count
if count < start + step:
versions.next()
count += 1
else:
do_stuff(*v)
start = count
except StopIteration:
break
Unfortunately, itertools.product objects don't have the next() method, so this doesn't work. What other way is there to go through this large number of tuples and either take a random sample or directly run calculations on the values?

Don't try to generate the Cartesian product. Sample from one iterable at a time to generate your result using random.choice(). The number of elements across all iterables is small, so you can store all the elements in memory directly.
Here's an example using 18 iterables with 10 elements each (as specified in the comment):
import random
iterables = [list(range(i, i + 10)) for i in range(0, 180, 10)]
result = [random.choice(iterable) for iterable in iterables]
print(result)

Which version of Python are you using? Somewhere along the way .next() methods were deprecated in favor a new next() built-in function. That works fine with all iterators. Here, for example, under the current released 3.10.1:
>>> import itertools
>>> itp = itertools.product(range(5), repeat=6)
>>> next(itp)
(0, 0, 0, 0, 0, 0)
>>> next(itp)
(0, 0, 0, 0, 0, 1)
>>> next(itp)
(0, 0, 0, 0, 0, 2)
>>> next(itp)
(0, 0, 0, 0, 0, 3)
>>> for ignore in range(50):
... ignore = next(itp)
>>> next(itp)
(0, 0, 0, 2, 0, 4)
Beyond that, you didn't show us the most important part of your code: how you build your product.
Without seeing that, I can only guess that it would be far more efficient to make a random choice from the first sequence passed to product(), then another from the second, and so on. Build a random element of the product from one component at a time.
Picking a random product tuple efficiently
Perhaps overkill, but this class shows an especially efficient way to do this. The .index() method maps an integer i to the i'th tuple (0-based) in the product. Then picking a random tuple from the product is simply applying .index() to a random integer in range(total number of elements in the product).
from math import prod
from random import randrange
class RanProduct:
def __init__(self, iterables):
self.its = list(map(list, iterables))
self.n = prod(map(len, self.its))
def index(self, i):
if i not in range(self.n):
raise ValueError(f"index {i} not in range({self.n})")
result = []
for it in reversed(self.its):
i, r = divmod(i, len(it))
result.append(it[r])
return tuple(reversed(result))
def pickran(self):
return self.index(randrange(self.n))
and then
>>> r = RanProduct(["abc", range(2)])
>>> for i in range(6):
... print(i, '->', r.index(i))
0 -> ('a', 0)
1 -> ('a', 1)
2 -> ('b', 0)
3 -> ('b', 1)
4 -> ('c', 0)
5 -> ('c', 1)
>>> r = RanProduct([range(10)] * 19)
>>> r.pickran()
(3, 5, 8, 8, 3, 6, 7, 6, 8, 6, 2, 0, 5, 6, 1, 0, 0, 8, 2)
>>> r.pickran()
(4, 5, 0, 5, 7, 1, 7, 2, 7, 4, 8, 4, 2, 0, 2, 9, 3, 6, 2)
>>> r.pickran()
(8, 7, 4, 1, 3, 0, 4, 6, 4, 3, 9, 8, 5, 8, 9, 9, 7, 1, 8)
>>> r.pickran()
(8, 6, 6, 0, 6, 7, 1, 3, 9, 5, 1, 4, 5, 8, 6, 8, 4, 9, 9)
>>> r.pickran()
(4, 9, 4, 7, 1, 5, 5, 1, 6, 7, 1, 8, 9, 0, 7, 9, 1, 7, 0)
>>> r.pickran()
(3, 0, 3, 9, 8, 6, 3, 0, 3, 0, 9, 9, 3, 5, 2, 3, 7, 8, 8)

Related

Fast python algorithm to find all possible partitions from a list of numbers that has subset sums equal to given ratios

Say I have a list of 20 random integers from 0 to 9. I want to divide the list into N subsets so that the ratio of subset sums equal to given values, and I want to find all possible partitions. I wrote the following code and got it work for the N = 2 case.
import random
import itertools
#lst = [random.randrange(10) for _ in range(20)]
lst = [2, 0, 1, 7, 2, 4, 9, 7, 6, 0, 5, 4, 7, 4, 5, 0, 4, 5, 2, 3]
def partition_sum_with_ratio(numbers, ratios):
target1 = round(int(sum(numbers) * ratios[0] / (ratios[0] + ratios[1])))
target2 = sum(numbers) - target1
p1 = [seq for i in range(len(numbers), 0, -1) for seq in
itertools.combinations(numbers, i) if sum(seq) == target1
and sum([s for s in numbers if s not in seq]) == target2]
p2 = [tuple(n for n in numbers if n not in seq) for seq in p1]
return list(zip(p1, p2))
partitions = partition_sum_with_ratios(lst, ratios=[4, 3])
print(partitions[0])
Output:
((2, 0, 1, 2, 4, 6, 0, 5, 4, 4, 5, 0, 4, 5, 2), (7, 9, 7, 7, 3))
If you calculate the sum of each subset, you will find the ratio is 44 : 33 = 4 : 3, which are exactly the input values. However, I want the function to work for any number of subsets. For example, I expect
partition_sum_with_ratio(lst, ratios=[4, 3, 3])
to return something like
((2, 0, 1, 2, 4, 6, 0, 5, 4, 4, 3), (5, 0, 4, 5, 2, 7), (9, 7, 7))
I have been thinking about this problem for a month and I found this to be extremely hard. My conclusion is that this problem can only be solved by a recursion. I would like to know if there are any relatively fast algorithm for this. Any suggestions?
Yes, recursion is called for. The basic logic is to do a bipartition into one part and the rest and then recursively split the rest in all possible ways. I've followed your lead in assuming that everything is distinguishable, which creates a lot of possibilities, possibly too many to enumerate. Nevertheless:
import itertools
def totals_from_ratios(sum_numbers, ratios):
sum_ratios = sum(ratios)
totals = [(sum_numbers * ratio) // sum_ratios for ratio in ratios]
residues = [(sum_numbers * ratio) % sum_ratios for ratio in ratios]
for i in sorted(
range(len(ratios)), key=lambda i: residues[i] * ratios[i], reverse=True
)[: sum_numbers - sum(totals)]:
totals[i] += 1
return totals
def bipartitions(numbers, total):
n = len(numbers)
for k in range(n + 1):
for combo in itertools.combinations(range(n), k):
if sum(numbers[i] for i in combo) == total:
set_combo = set(combo)
yield sorted(numbers[i] for i in combo), sorted(
numbers[i] for i in range(n) if i not in set_combo
)
def partitions_into_totals(numbers, totals):
assert totals
if len(totals) == 1:
yield [numbers]
else:
for first, remaining_numbers in bipartitions(numbers, totals[0]):
for rest in partitions_into_totals(remaining_numbers, totals[1:]):
yield [first] + rest
def partitions_into_ratios(numbers, ratios):
totals = totals_from_ratios(sum(numbers), ratios)
yield from partitions_into_totals(numbers, totals)
lst = [2, 0, 1, 7, 2, 4, 9, 7, 6, 0, 5, 4, 7, 4, 5, 0, 4, 5, 2, 3]
for part in partitions_into_ratios(lst, [4, 3, 3]):
print(part)

How to do an infinite loop from k to a given number and back to zero?

I want something like (if starting from 2):
2, 3, ..., n-1, n, n-1, ..., 3, 2, 1, 0, 1, 2, 3, ... (forever)
Is there something simpler than this?:
def print_numbers_forth_and_back_forever(first_number, upper_limit):
for i in range(first_number, upper_limit):
print(i)
while True:
# back from n to 0
for i in reversed(range(0, upper_limit+1)):
print(i)
# 1 to n-1
for i in range(1, upper_limit):
print(i)
print_numbers_forth_and_back_forever(4, 10)
The following using itertools will provide a generator that should fulfil the requirements - simply alter the arguments of range(a, b) to alter the output:
from itertools import cycle, chain
r = range(10)
r_reversed = reversed(r[1:-1])
gen = cycle(chain(r, r_reversed))
Output:
>>> from itertools import islice
>>> list(islice(gen, 20))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1]
To start at an arbitrary integer, use the start argument for islice as follows:
>>> list(islice(gen, 4, 20))
[4, 5, 6, 7, 8, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1]
The same as you did with a generator :
# you can add start argument to replace 0
def gen_infinite_bounce(first, end) :
yield from range(first,end)
while True:
yield from range(end, 0, -1)
yield from range(0, end)
first = 2 # first Value
end = 5 # end bounce
# you can create a variable start to replace 0
for i in gen_infinite_bounce(first, end):
print(i)
You can also do it with itertools :
first = 2 # first Value
end = 5 # end bounce
# you can create a variable start to replace 0
from itertools, import chain, cycle
for i in chain(
range(first, end), cycle(
chain(
range(end, 0, -1),
range(0, end)
)
)
):
print(i)
You can use itertools with range() objects to efficiently create the output you want:
from itertools import cycle, chain, islice
def cycle_fwd_back(first, last):
yield from range(first, last)
yield from cycle(chain(reversed(range(last + 1)), range(1, last)))
# First 30
print(list(islice(cycle_fwd_back(4, 10), 30)))
Outputs:
[4, 5, 6, 7, 8, 9, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 8, 7]
You can use below code
def print_numbers_forth_and_back_forever(first_number, upper_limit):
inc =
i = first_number
while True:
if i == upper_limit:
inc = -1
elif i == 0:
inc = 1
print(i,end = ' ')
i += inc

Consecutive numbers list where each number repeats

How can I create a list of consecutive numbers where each number repeats N times, for example:
list = [0,0,0,1,1,1,2,2,2,3,3,3,4,4,4,5,5,5]
Another idea, without any need for other packages or sums:
[x//N for x in range((M+1)*N)]
Where N is your number of repeats and M is the maximum value to repeat. E.g.
N = 3
M = 5
[x//N for x in range((M+1)*N)]
yields
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5]
My first instinct is to get some functional help from the funcy package. If N is the number of times to repeat each value, and M is the maximum value to repeat, then you can do
import funcy as fp
fp.flatten(fp.repeat(i, N) for i in range(M + 1))
This will return a generator, so to get the array you can just call list() around it
sum([[i]*n for i in range(0,x)], [])
The following piece of code is the simplest version I can think of.
It’s a bit dirty and long, but it gets the job done.
In my opinion, it’s easier to comprehend.
def mklist(s, n):
l = [] # An empty list that will contain the list of elements
# and their duplicates.
for i in range(s): # We iterate from 0 to s
for j in range(n): # and appending each element (i) to l n times.
l.append(i)
return l # Finally we return the list.
If you run the code …:
print mklist(10, 2)
[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9]
print mklist(5, 3)
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4
Another version a little neater, with list comprehension.
But uhmmm… We have to sort it though.
def mklist2(s, n):
return sorted([l for l in range(s) * n])
Running that version will give the following results.
print mklist2(5, 3)
Raw : [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
Sorted: [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4]

Efficient way to look in list of lists?

I am continuously creating a randomly generated list, New_X of size 10, based on 500 columns.
Each time I create a new list, it must be unique, and my function NewList only returns New_X once it hasn't already been created and appended to a List_Of_Xs
def NewList(Old_List):
end = True
while end == True:
""" Here is code that generates my new sorted list, it is a combination of elements
from Old_List and the other remaining columns,
but the details aren't necessary for this question. """
end = (List_Of_Xs == np.array([New_X])).all(axis=1).any()
List_Of_Xs.append(New_X)
return New_X
My question is, is the line end = (List_Of_Xs == np.array([New_X])).all(axis=1).any() an efficient way of looking in List_Of_Xs?
My List_Of_Xs can grow to a size of over 100,000 lists long, so I am unsure if this is inefficient or not.
Any help would be appreciated!
As I observed in a comment, the array comparison is potentially quite slow, especially as the list gets large. It has to create arrays each time, which consumes time.
Here's a set implementation
Function to create a 10 element list:
def foo(N=10):
return np.random.randint(0,10,N).tolist()
Function to generate lists, and print the unique ones
def foo1(m=10):
Set_of_Xs = set()
while len(Set_of_Xs)<m:
NewX = foo(10)
tx = tuple(NewX)
if not tx in Set_of_Xs:
print(NewX)
Set_of_Xs.add(tx)
return Set_of_Xs
Sample run. As written it doesn't show if there are duplicates.
In [214]: foo1(5)
[9, 4, 3, 0, 9, 4, 9, 5, 6, 3]
[1, 8, 0, 3, 0, 0, 4, 0, 0, 5]
[6, 7, 2, 0, 6, 9, 0, 7, 0, 8]
[9, 5, 6, 3, 3, 5, 6, 9, 6, 9]
[9, 2, 6, 0, 2, 7, 2, 0, 0, 4]
Out[214]:
{(1, 8, 0, 3, 0, 0, 4, 0, 0, 5),
(6, 7, 2, 0, 6, 9, 0, 7, 0, 8),
(9, 2, 6, 0, 2, 7, 2, 0, 0, 4),
(9, 4, 3, 0, 9, 4, 9, 5, 6, 3),
(9, 5, 6, 3, 3, 5, 6, 9, 6, 9)}
So let me get this straight since the code doesn't appear complete:
1. You have an old list that is constantly growing with each iteration
2. You calculate a list
3. You compare it against each of the lists in the old list to see if you should break the loop?
One option is to store the lists in a set instead of a List of List.
Comparing an element against all the elements of a list would be an O(n) operation each iteration. Using a set it should be O(1) avg... Although you may be getting O(n) every iteration until the last.
Other thoughts would be to calculate the md5 of each element and compare those so you're not comparing the full lists.

How to split an integer into a list of digits?

Suppose I have an input integer 12345. How can I split it into a list like [1, 2, 3, 4, 5]?
Convert the number to a string so you can iterate over it, then convert each digit (character) back to an int inside a list-comprehension:
>>> [int(i) for i in str(12345)]
[1, 2, 3, 4, 5]
return array as string
>>> list(str(12345))
['1', '2', '3', '4', '5']
return array as integer
>>> map(int,str(12345))
[1, 2, 3, 4, 5]
I'd rather not turn an integer into a string, so here's the function I use for this:
def digitize(n, base=10):
if n == 0:
yield 0
while n:
n, d = divmod(n, base)
yield d
Examples:
tuple(digitize(123456789)) == (9, 8, 7, 6, 5, 4, 3, 2, 1)
tuple(digitize(0b1101110, 2)) == (0, 1, 1, 1, 0, 1, 1)
tuple(digitize(0x123456789ABCDEF, 16)) == (15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
As you can see, this will yield digits from right to left. If you'd like the digits from left to right, you'll need to create a sequence out of it, then reverse it:
reversed(tuple(digitize(x)))
You can also use this function for base conversion as you split the integer. The following example splits a hexadecimal number into binary nibbles as tuples:
import itertools as it
tuple(it.zip_longest(*[digitize(0x123456789ABCDEF, 2)]*4, fillvalue=0)) == ((1, 1, 1, 1), (0, 1, 1, 1), (1, 0, 1, 1), (0, 0, 1, 1), (1, 1, 0, 1), (0, 1, 0, 1), (1, 0, 0, 1), (0, 0, 0, 1), (1, 1, 1, 0), (0, 1, 1, 0), (1, 0, 1, 0), (0, 0, 1, 0), (1, 1, 0, 0), (0, 1, 0, 0), (1, 0, 0, 0))
Note that this method doesn't handle decimals, but could be adapted to.
[int(i) for i in str(number)]
or, if do not want to use a list comprehension or you want to use a base different from 10
from __future__ import division # for compatibility of // between Python 2 and 3
def digits(number, base=10):
assert number >= 0
if number == 0:
return [0]
l = []
while number > 0:
l.append(number % base)
number = number // base
return l
While list(map(int, str(x))) is the Pythonic approach, you can formulate logic to derive digits without any type conversion:
from math import log10
def digitize(x):
n = int(log10(x))
for i in range(n, -1, -1):
factor = 10**i
k = x // factor
yield k
x -= k * factor
res = list(digitize(5243))
[5, 2, 4, 3]
One benefit of a generator is you can feed seamlessly to set, tuple, next, etc, without any additional logic.
like #nd says but using the built-in function of int to convert to a different base
>>> [ int(i,16) for i in '0123456789ABCDEF' ]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
>>> [int(i,2) for i in "100 010 110 111".split()]
[4, 2, 6, 7]
Another solution that does not involve converting to/from strings:
from math import log10
def decompose(n):
if n == 0:
return [0]
b = int(log10(n)) + 1
return [(n // (10 ** i)) % 10 for i in reversed(range(b))]
Using join and split methods of strings:
>>> a=12345
>>> list(map(int,' '.join(str(a)).split()))
[1, 2, 3, 4, 5]
>>> [int(i) for i in ' '.join(str(a)).split()]
[1, 2, 3, 4, 5]
>>>
Here we also use map or a list comprehension to get a list.
Strings are just as iterable as arrays, so just convert it to string:
str(12345)
Simply turn it into a string, split, and turn it back into an array integer:
nums = []
c = 12345
for i in str(c):
l = i.split()[0]
nums.append(l)
np.array(nums)

Categories