Constraining frequencies when building a list of lists of integers - python

I am trying to write a function which will return a list of lists of integers corresponding to pools that I will pool chemicals in. I want to keep the number of chemicals in each pool as uniform as possible. Each chemical is replicated some number of times across pools (in this example, 3 times across 9 pools). For the example below, I have 31 chemicals. Thus, each pool should have 10.333 drugs in it (or, more specifically, each pool should have floor(93/9) = 10 drugs with 93%9 = 3 pools having 11 drugs. My function for doing so is below. Currently, I'm trying to get the function to loop until there is one set of integers left (i.e. 3 pools with 9 chemicals) so that I can code the function to recognize which pools are allowed one more chemical and finalize the list of lists that tells me which pools to put each chemical in.
However, as written right now, the function will not always give my desired distribution of 11,11,11,10,10,10,9,9,9 for the frequencies of integers appearing in the list of lists. I've written the following to attempt to constrain the distribution: 1) Randomly select, without replacement, a list of bits (pool numbers). If any of the bits in the selected list have frequency >= 10 in the output list and I already have 3 pools with frequency 11, discard this list of bits. 2) If any of the bits in the selected list have frequency >= 9 in the output list, and there are 6 pools with frequency >= 10, discard this list of bits. 3) If any of the bits in the selected list have frequency >= 11 in the output list, discard this list of bits. It seems that this bit of code isn't working properly. I'm thinking it's either related to me improperly coding these three conditions. It appears that some lists of bits are being accidentally discarded while others are improperly added to the output list. Alternatively, there could be a scenario in which two pools go from 9 to 10 chemicals in the same step, resulting in 4 pools of 10 instead of 3 pools of 10. Am I thinking about this problem wrong? Is there an obvious place where my code isn't working?
The function for generating normalized pools:
(overlapping_kbits returns a list of lists of bits of length replicates with each bit being an integer in the range [1,pools], filtered such that no two lists may have greater than overlaps between them.)
import numpy as np
import pandas as pd
import itertools
import re
import math
from collections import Counter
def normalized_pool(pools, replicates, overlaps, ligands):
solvent_bits = [list(bits) for bits in itertools.combinations(range(pools),replicates)]
print(len(solvent_bits))
total_items = ligands*replicates
norm_freq = math.floor(total_items/pools)
num_extra = total_items%pools
num_norm = pools-3
normed_bits = []
count_extra = 0
count_norm = 0
while len(normed_bits) < ligands-1 and len(solvent_bits)>0:
rand = np.random.randint(0,len(solvent_bits))
bits = solvent_bits.pop(rand) #Sample without replacement
print(bits)
bin_freqs = Counter(itertools.chain.from_iterable(normed_bits))
print(bin_freqs)
previous = len(normed_bits)
#Constrain the frequency distribution
count_extra = len([bin_freqs[bit] for bit in bin_freqs.keys() if bin_freqs[bit] >= norm_freq+1])
count_norm = len([bin_freqs[bit] for bit in bin_freqs.keys() if bin_freqs[bit] >= norm_freq])
if any(bin_freqs[bit] >= norm_freq for bit in bits) and count_extra == num_extra:
print('rejected')
continue #i.e. only allow num_extra number of bits to have a frequency higher than norm_freq
elif any(bin_freqs[bit] >= norm_freq+1 for bit in bits):
print('rejected')
continue #i.e. never allow any bit to be greater than norm_freq+1
elif (any(bin_freqs[bit] >= norm_freq-1 for bit in bits) and count_norm >= num_norm):
if count_extra == num_extra:
print('rejected')
continue #only num_norm bins can have norm_freq
normed_bits.append(bits)
bin_freqs = Counter(itertools.chain.from_iterable(normed_bits))
return normed_bits
test_bits = normalized_pool(9,3,2,31)
test_freqs = Counter(itertools.chain.from_iterable(test_bits))
print(test_freqs)
print(len(test_bits))
I can get anything from 11,11,11,10,10,10,9,9,9 (my desired output) to 11,11,11,10,10,10,10,10,7. For a minimal example, try:
test_bits = normalized_pool(7,3,2,10)
test_freqs = Counter(itertools.chain.from_iterable(test_bits))
print(test_freqs)
Which should return 5,5,4,4,3,3,3 as the elements of the test_freqs Counter.
EDIT: Modified the function so it can run from being copied and pasted. Merged the function call into the larger block of code since it was being overlooked.

Related

Justification of constants used in random.sample

I'm looking into the source code for the function sample in random.py (python standard library).
The idea is simple:
If a small sample (k) is needed from a large population (n): Just pick k random indices, since it is unlikely you'll pick the same number twice as the population is so large. And if you do, just pick again.
If a relatively large sample (k) is needed, compared to the total population (n): It is better to keep track of what you have picked.
My Question
There are a few constants involved, setsize = 21 and setsize += 4 ** _log(3*k,4). The critical ratio is roughly k : 21+3k. The comment says # size of a small set minus size of an empty list and # table size for big sets.
Where have these specific numbers come from? What is there justification?
The comments shed some light, however I find they bring as many questions as they answer.
I would kind of understand, size of a small set but find the "minus size of an empty list" confusing. Can someone shed any light on this?
what is meant specifically by "table" size, as apposed to say "set size".
Looking on the github repository, it looks like a very old version simply used the ratio k : 6*k, as the critical ratio, but I find that equally mysterious.
The code
def sample(self, population, k):
"""Chooses k unique random elements from a population sequence or set.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use range as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
if isinstance(population, _Set):
population = tuple(population)
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
randbelow = self._randbelow
n = len(population)
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize:
# An n-length list is smaller than a k-length set
pool = list(population)
for i in range(k): # invariant: non-selected at [0,n-i)
j = randbelow(n-i)
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
else:
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow(n)
while j in selected:
j = randbelow(n)
selected_add(j)
result[i] = population[j]
return result
(I apologise is this question would be better placed in math.stackexchange. I couldn't think of any probability/statistics-y reasons for this particular ratio, and the comments sounded as though, it was maybe something to do with the amount of space that sets and lists use - but could't find any details anywhere).
This code is attempting to determine whether using a list or a set would take more space (instead of trying to estimate the time cost, for some reason).
It looks like 21 was the difference between the size of an empty list and a small set on the Python build this constant was determined on, expressed in multiples of the size of a pointer. I don't have a build of that version of Python, but testing on my 64-bit CPython 3.6.3 gives a difference of 20 pointer sizes:
>>> sys.getsizeof(set()) - sys.getsizeof([])
160
and comparing the 3.6.3 list and set struct definitions to the list and set definitions from the change that introduced this code, 21 seems plausible.
I said "the difference between the size of an empty list and a small set" because both now and at the time, small sets used a hash table contained inside the set struct itself instead of externally allocated:
setentry smalltable[PySet_MINSIZE];
The
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
check adds the size of the external table allocated for sets larger than 5 elements, with size again expressed in number of pointers. This computation assumes the set never shrinks, since the sampling algorithm never removes elements. I am not currently sure whether this computation is exact.
Finally,
if n <= setsize:
compares the base overhead of a set plus any space used by an external hash table to the n pointers required by a list of the input elements. (It doesn't seem to account for the overallocation performed by list(population), so it may be underestimating the cost of the list.)

Fastest way to sort string to match second string - only adjacent swaps allowed

I want to get the minimum number of letter-swaps needed to convert one string to match a second string. Only adjacent swaps are allowed.
Inputs are: length of strings, string_1, string_2
Some examples:
Length | String 1 | String 2 | Output
-------+----------+----------+-------
3 | ABC | BCA | 2
7 | AABCDDD | DDDBCAA | 16
7 | ZZZAAAA | ZAAZAAZ | 6
Here's my code:
def letters(number, word_1, word_2):
result = 0
while word_1 != word_2:
index_of_letter = word_1.find(word_2[0])
result += index_of_letter
word_1 = word_1.replace(word_2[0], '', 1)
word_2 = word_2[1:]
return result
It gives the correct results, but the calculation should stay under 20 seconds.
Here are two sets of input data (1 000 000 characters long strings): https://ufile.io/8hp46 and https://ufile.io/athxu.
On my setup the first one is executed in around 40 seconds and the second in 4 minutes.
How to calculate the result in less than 20 seconds?
#KennyOstrom's is 90% there. The inversion count is indeed the right angle to look at this problem.
The only bit that is missing is that we need a "relative" inversion count, meaning the number of inversions not to get to normal sort order but to the other word's order. We therefore need to compute the permutation that stably maps word1 to word2 (or the other way round), and then compute the inversion count of that. Stability is important here, because obviously there will be lots of nonunique letters.
Here is a numpy implementation that takes only a second or two for the two large examples you posted. I did not test it extensively, but it does agree with #trincot's solution on all test cases. For the two large pairs it finds 1819136406 and 480769230766.
import numpy as np
_, word1, word2 = open("lit10b.in").read().split()
word1 = np.frombuffer(word1.encode('utf8')
+ (((1<<len(word1).bit_length()) - len(word1))*b'Z'),
dtype=np.uint8)
word2 = np.frombuffer(word2.encode('utf8')
+ (((1<<len(word2).bit_length()) - len(word2))*b'Z'),
dtype=np.uint8)
n = len(word1)
o1 = np.argsort(word1, kind='mergesort')
o2 = np.argsort(word2, kind='mergesort')
o1inv = np.empty_like(o1)
o1inv[o1] = np.arange(n)
order = o2[o1inv]
sum_ = 0
for i in range(1, len(word1).bit_length()):
order = np.reshape(order, (-1, 1<<i))
oo = np.argsort(order, axis = -1, kind='mergesort')
ioo = np.empty_like(oo)
ioo[np.arange(order.shape[0])[:, None], oo] = np.arange(1<<i)
order[...] = order[np.arange(order.shape[0])[:, None], oo]
hw = 1<<(i-1)
sum_ += ioo[:, :hw].sum() - order.shape[0] * (hw-1)*hw // 2
print(sum_)
Your algorithm runs in O(n2) time:
The find() call will take O(n) time
The replace() call will create a complete new string which takes O(n) time
The outer loop executes O(n) times
As others have stated, this can be solved by counting inversions using merge sort, but in this answer I try to stay close to your algorithm, keeping the outer loop and result += index_of_letter, but changing the way index_of_letter is calculated.
The improvement can be done as follows:
preprocess the word_1 string and note the first position of each distinct letter in word_1 in a dict keyed by these letters. Link each letter with its next occurrence. I think it is most efficient to create one list for this, having the size of word_1, where at each index you store the index of the next occurrence of the same letter. This way you have a linked list for each distinct letter. This preprocessing can be done in O(n) time, and with it you can replace the find call with a O(1) lookup. Every time you do this, you remove the matched letter from the linked list, i.e. the index in the dict moves to the index of the next occurrence.
The previous change will give the absolute index, not taking into account the removals of letters that you have in your algorithm, so this will give wrong results. To solve that, you can build a binary tree (also preprocessing), where each node represents an index in word_1, and which gives the actual number of non-deleted letters preceding a given index (including itself as well if not deleted yet). The nodes in the binary tree never get deleted (that might be an idea for a variant solution), but the counts get adjusted to reflect a deletion of a character. At most O(logn) nodes need to get a decremented value upon such a deletion. But apart from that no string would be rebuilt like with replace. This binary tree could be represented as a list, corresponding to nodes in in-order sequence. The values in the list would be the numbers of non-deleted letters preceding that node (including itself).
The initial binary tree could be depicted as follows:
The numbers in the nodes reflect the number of nodes at their left side, including themselves. They are stored in the numLeft list. Another list parent precalculates at which indexes the parents are located.
The actual code could look like this:
def letters(word_1, word_2):
size = len(word_1) # No need to pass size as argument
# Create a binary tree for word_1, organised as a list
# in in-order sequence, and with the values equal to the number of
# non-matched letters in the range up to and including the current index:
treesize = (1<<size.bit_length()) - 1
numLeft = [(i >> 1 ^ ((i + 1) >> 1)) + 1 for i in range(0, treesize)]
# Keep track of parents in this tree (could probably be simpler, I welcome comments).
parent = [(i & ~((i^(i+1)) + 1)) | (((i ^ (i+1))+1) >> 1) for i in range(0, treesize)]
# Create a linked list for each distinct character
next = [-1] * size
head = {}
for i in range(len(word_1)-1, -1, -1): # go backwards
c = word_1[i]
# Add index at front of the linked list for this character
if c in head:
next[i] = head[c]
head[c] = i
# Main loop counting number of swaps needed for each letter
result = 0
for i, c in enumerate(word_2):
# Extract next occurrence of this letter from linked list
j = head[c]
head[c] = next[j]
# Get number of preceding characters with a binary tree lookup
p = j
index_of_letter = 0
while p < treesize:
if p >= j: # On or at right?
numLeft[p] -= 1 # Register that a letter has been removed at left side
if p <= j: # On or at left?
index_of_letter += numLeft[p] # Add the number of left-side letters
p = parent[p] # Walk up the tree
result += index_of_letter
return result
This runs in O(nlogn) where the logn factor is provided by the upwards walk in the binary tree.
I tested on thousands of random inputs, and the above code produces the same results as your code in all cases. But... it runs a lot faster on the larger inputs.
I am going by the assumption that you just want to find the number of swaps, quickly, without needing to know what exactly to swap.
google how to count inversions. It is often taught with merge-sort. Several of the results are on stack overflow, like Merge sort to count split inversions in Python
Inversions are the number of adjacent swaps to get to a sorted string.
Count the inversions in string 1.
Count the inversions in string 2.
Error edited out here, see correction in correct answer. I would normally just delete a wrong answer but this answer is referenced in correct answer.
It makes sense, and it happens to work for all three of your small test cases, so I'm going to just assume this is the answer you want.
Using some code that I happen to have lying around from retaking some algorithms classes on free online classes (for fun):
print (week1.count_inversions('ABC'), week1.count_inversions('BCA'))
print (week1.count_inversions('AABCDDD'), week1.count_inversions('DDDBCAA'))
print (week1.count_inversions('ZZZAAAA'), week1.count_inversions('ZAAZAAZ'))
0 2
4 20
21 15
That lines up with the values you gave above: 2, 16, and 6.

How to extend the iteration process into large size in Python 3

After tossed a coin n times (n=10), there are 2^10=1024 possible outcomes. I used
lst = [list(i) for i in itertools.product([0, 1], repeat=n)]
to obtain n=10 all possible outcomes, and then I want to find out the number of group (m) which is defined as a maximum consecutive sequence for identical side of coin. For example, HHHTTHTTHT, the number of groups for this outcome is 6, which are (HHH)(TT)(H)(TT)(H)(T). I employed
group=[len([len(list(grp)) for k, grp in groupby(x)]) for x in lst]
to find out the number of groups (m) for each corresponding combination of 10 tossed coin. Finally, I obtain the number of possible combinations whose the number of groups is greater than 6 (m), such as,
Group6=len(list(filter(lambda x:x>m,group)))
However, when the number of tosses increases, for instance, (n=200,m=110) or (n=500, m=260). I still used the same code as the above in Python, but it time-consumed and I think it exceeded the memory size of python. Could someone please help me to figure out how to address this issue, when n and m are quite large? Thanks
Iterating over all possible tosses is definitely not going to work for large n (and here n is already large from 20 I guess). One can only calculate this efficiently using some combinatorics.
Let's first start with a simple example: minimum two groups for four tosses: if we want to calculate the number of tosses that result in two our more groups. There are a few possibilities (we number the groups like 1, 2, etc.):
Two groups:
1112
1122
1222
Three groups:
1123
1223
1233
Four groups:
1234
We know that the value of group G1 will be different then the one for G2 and the one for G2 different from G3 and so on. So it holds that:
G1 != G2 != G3 != ... != Gn
Since there are only two values (heads and tails) this means that if we determine the first group, we have deterimed the value for all groups. If G1 is heads, all even groups are tails and all odd groups are heads.
So this means that for every combination above, there are exactly two configurations. This means that for the case n=4,m=1, there are 2×7=14 configurations (which is exactly what we get with get when we work it out with the program in your question).
Now the only problem we still have to face is how we are going to count these super-configurations. We can do this by introducing what I would call upgrade notation:
You notate an increase of the group with 1 and the same group with 0.
So now 1223 becomes 0101: we upgrade the group on the second and the fourth index. And 1234 is 0111. How does this help? Well in fact for k groups, we only have to count the number of combinations with k-1 upgrades. So this means the number of combinations (k-1 nCr n-1). With n the length of the string (or the total number of tosses), and k the number of groups*. Now (k-1 nCr n) is defined as: (n-1)!/((k-1)!*(n-k)!). With ! the factorial. I borrowed a way to calculate nCr from here:
import operator as op
from functools import reduce
def ncr(n, r):
r = min(r, n-r)
if r == 0: return 1
return reduce(op.mul,range(n-r+1,n+1))//reduce(op.mul,range(1,r+1))
And now the last step is only to calculate the number of combinations (times two) for every k higher than m up to n. So:
def group_num(n,m):
total = 0
for i in range(m+1,n+1):
total += ncr(n-1,i-1)
return 2*total
or putting it into a one-liner:
def group_num(n,m):
return 2*sum(ncr(n-1,i-1) for i in range(m+1,n+1))
Now we can test our code:
>>> group_num(4,1)
14
>>> group_num(10,6)
260
>>> group_num(200,110)
125409844583698900384745448848066934911164089598228258053888
>>> group_num(500,260)
606609270097725645141493459934317664675833205307408583743573981944035656294544972476175251556943050783856517254296239386495518290119646417132819099088
Which are the expected numbers (for the first two). As you can see the total amount blows up enormously so even the fastest algorithm to count the number of tosses one at a time will easily be outperformed by this approach (the results were calculated in less than a second).
The ncr function can be calculated in O(n) (but this can even be improved by using a memoize on the factorial for instance); and the group_num function is thus computed (without memoize optimization) in O((n-m)×n). Which thus completely outfactors the exponential behavior.

Singpath Python Error. "Your code took too long to return."

I was playing around with the Singpath Python practice questions. And came across a simple question which asks the following:
Given an input of a list of numbers and a high number,
return the number of multiples
of each of those numbers that are less than the maximum number.
For this case the list will contain a maximum of 3 numbers
that are all relatively prime to each other.
I wrote this simple program, it ran perfectly fine:
"""
Given an input of a list of numbers and a high number,
return the number of multiples
of each of those numbers that are less than the maximum number.
For this case the list will contain a maximum of 3 numbers
that are all relatively prime to each other.
>>> countMultiples([3],30)
9
>>> countMultiples([3,5],100)
46
>>> countMultiples([3,5,7],30)
16
"""
def countMultiples(l, max):
j = []
for num in l:
i = 1
count = 0
while num * i < max:
if num * i not in j:
j.append(num * i)
i += 1
return len(j)
print countMultiples([3],30)
print countMultiples([3,5],100)
print countMultiples([3, 5, 7],30)
But when I try to run the same on SingPath, it gave me this error
Your code took too long to return.
Your solution may be stuck in an infinite loop. Please try again.
Has anyone experienced the same issues with Singpath?
I suspect the error you're getting means exactly what it says. For some input that the test program gives your function, it takes too long to return. I don't know anything about singpath myself, so I don't know exactly how long that might be. But I'd guess that they give you enough time to solve the problem if you use the best algorithm.
You can see for yourself that your code is slow if you pass in a very large max value. Try passing 10000 as max and you may end up waiting for a minute or two to get a result.
There are a couple of reasons your code is slow in these situations. The first is that you have a list of every multiple that you've found so far, and you are searching the list to see if the latest value has already been seen. Each search takes time proportional to the length of the list, so for the whole run of the function, it takes quadratic time (relative to the result value).
You could improve on this quite a lot by using a set instead of a list. You can test if an object is in a set in (amortized) constant time. But if j is a set, you don't actually need to test if a value is already in it before adding, since sets ignore duplicated values anyway. This means you can just add a value to the set without any care about whether it was there already.
def countMultiples(l, max):
j = set() # use a set object, rather than a list
for num in l:
i = 1
count = 0
while num * i < max:
j.add(num*i) # add items to the set unconditionally
i += 1
return len(j) # duplicate values are ignored, and won't be counted
This runs a fair amount faster than the original code, and max values of a million or more will return in a not too unreasonable time. But if you try values larger still (say, 100 million or a billion), you'll eventually still run into trouble. That's because your code uses a loop to find all the multiples, which takes linear time (relative to the result value). Fortunately, there is a better algorithm.
(If you want to figure out the better approach on your own, you might want to stop reading here.)
The better way is to use division to find how many times you can multiply each value to get a value less than max. The number of multiples of num that are strictly less than max is (max-1) // num (the -1 is because we don't want to count max itself). Integer division is much faster than doing a loop!
There is an added complexity though. If you divide to find the number of multiples, you don't actually have the multiples themselves to put in a set like we were doing above. This means that any integer that is a multiple of more than than one of our input numbers will be counted more than once.
Fortunately, there's a good way to fix this. We just need to count how many integers were over counted, and subtract that from our total. When we have two input values, we'll have double counted every integer that is a multiple of their least common multiple (which, since we're guaranteed that they're relatively prime, means their product).
If we have three values, We can do the same subtraction for each pair of numbers. But that won't be exactly right either. The integers that are multiples of all three of our input numbers will be counted three times, then subtracted back out three times as well (since they're multiples of the LCM of each pair of values). So we need to add a final value to make sure those multiples of all three values are included in the final sum exactly once.
import itertools
def countMultiples(numbers, max):
count = 0
for i, num in enumerate(numbers):
count += (max-1) // num # count multiples of num that are less than max
for a, b in itertools.combinations(numbers, 2):
count -= (max-1) // (a*b) # remove double counted numbers
if len(numbers) == 3:
a, b, c = numbers
count += (max-1) // (a*b*c) # add the vals that were removed too many times
return count
This should run in something like constant time for any value of max.
Now, that's probably as efficient as you need to be for the problem you're given (which will always have no more than three values). But if you wanted a solution that would work for more input values, you can write a general version. It uses the same algorithm as the previous version, and uses itertools.combinations a lot more to get different numbers of input values at a time. The number of products of the LCM of odd numbers of values get added to the count, while the number of products of the LCM of even numbers of values are subtracted.
import itertools
from functools import reduce
from operator import mul
def lcm(nums):
return reduce(mul, nums) # this is only correct if nums are all relatively prime
def countMultiples(numbers, max):
count = 0
for n in range(len(numbers)):
for nums in itertools.combinations(numbers, n+1):
count += (-1)**n * (max-1) // lcm(nums)
return count
Here's an example output of this version, which is was computed very quickly:
>>> countMultiples([2,3,5,7,11,13,17], 100000000000000)
81947464300342

Python: Number ranges that are extremely large?

val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = range(0, val)
shuffle(numbers)
I cannot find a simple way to make this work with extremely large inputs - can anyone help?
I saw a question like this - but I could not implement the range function they described in a way that works with shuffle. Thanks.
To get a random permutation of the range [0, n) in a memory efficient manner; you could use numpy.random.permutation():
import numpy as np
numbers = np.random.permutation(n)
If you need only small fraction of values from the range e.g., to get k random values from [0, n) range:
import random
from functools import partial
def sample(n, k):
# assume n is much larger than k
randbelow = partial(random.randrange, n)
# from random.py
result = [None] * k
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow()
while j in selected:
j = randbelow()
selected_add(j)
result[i] = j
return result
print(sample(10**100, 10))
If you don't need the full list of numbers (and if you are getting billions, its hard to imagine why you would need them all), you might be better off taking a random.sample of your number range, rather than shuffling them all. In Python 3, random.sample can work on a range object too, so your memory use can be quite modest.
For example, here's code that will sample ten thousand random numbers from a range up to whatever maximum value you specify. It should require only a relatively small amount of memory beyond the 10000 result values, even if your maximum is 100 billion (or whatever enormous number you want):
import random
def get10kRandomNumbers(maximum):
pop = range(1, maximum+1) # this is memory efficient in Python 3
sample = random.sample(pop, 10000)
return sample
Alas, this doesn't work as nicely in Python 2, since xrange objects don't allow maximum values greater than the system's integer type can hold.
An important point to note is that it will be impossible for a computer to have the list of numbers in memory if it is larger than a few billion elements: its memory footprint becomes larger than the typical RAM size (as it takes about 4 GB for 1 billion 32-bit numbers).
In the question, val is a long integer, which seems to indicate that you are indeed using more than a billion integer, so this cannot be done conveniently in memory (i.e., shuffling will be slow, as the operating system will swap).
That said, if the number of elements is small enough (let's say smaller than 0.5 billion), then a list of elements can fit in memory thanks to the compact representation offered by the array module, and be shuffled. This can be done with the standard module array:
import array, random
numbers = array.array('I', xrange(10**8)) # or 'L', if the number of bytes per item (numbers.itemsize) is too small with 'I'
random.shuffle(numbers)

Categories