In some implementations of the game of Tetris, there is an algorithm called Random Generator which generates an infinite sequence of permutations of the set of one-sided tetrominoes based on the following algorithm:
Random Generator generates a sequence of all seven one-sided
tetrominoes (I, J, L, O, S, T, Z) permuted randomly, as if they were
drawn from a bag. Then it deals all seven tetrominoes to the piece
sequence before generating another bag.
Elements of this infinite sequence are only generated when necessary. i.e. a random permutation of 7 one-sided tetrominoes is appended to a queue of tetrominoes when more pieces are required than the queue can provide.
I believe there are two primary methods of doing this in Python.
The first method uses itertools.permutations and random.choice
import itertools, random, collections
bag = "IJLOSTZ"
bigbag = list(itertools.permutations(bag))
sequence = collections.deque(random.choice(bigbag))
sequence.extend(random.choice(bigbag))
sequence.extend(random.choice(bigbag))
# . . . Extend as necessary
The second method uses only random.shuffle.
import random, collections
bag = ['I', 'J', 'L', 'O', 'S', 'T', 'Z']
random.shuffle(bag)
sequence = collections.deque(bag)
random.shuffle(bag)
sequence.extend(bag)
random.shuffle(bag)
sequence.extend(bag)
# . . . Extend as necessary
What are the advantages/disadvantages of either method, assuming that the player of Tetris is skilled and the Random Generator must produce a large sequence of one-sided tetrominoes?
I'd say that the time to shuffle a tiny list is simply trivial, so don't worry about it. Either method should be "equally random", so there's no basis for deciding there.
But rather than muck with both lists and deques, I'd use a tile generator instead:
def get_tile():
from random import shuffle
tiles = list("IJLOSTZ")
while True:
shuffle(tiles)
for tile in tiles:
yield tile
Short, sweet, and obvious.
Making that peekable
Since I'm old, when I hear "peekable queue" I think "circular buffer". Allocate the memory for a fixed-size buffer once, and keep track of "the next item" with an index variable that wraps around. Of course this pays off a lot more in C than in Python, but for concreteness:
class PeekableQueue:
def __init__(self, item_getter, maxpeek=50):
self.getter = item_getter
self.maxpeek = maxpeek
self.b = [next(item_getter) for _ in range(maxpeek)]
self.i = 0
def pop(self):
result = self.b[self.i]
self.b[self.i] = next(self.getter)
self.i += 1
if self.i >= self.maxpeek:
self.i = 0
return result
def peek(self, n):
if not 0 <= n <= self.maxpeek:
raise ValueError("bad peek argument %r" % n)
nthruend = self.maxpeek - self.i
if n <= nthruend:
result = self.b[self.i : self.i + n]
else:
result = self.b[self.i:] + self.b[:n - nthruend]
return result
q = PeekableQueue(get_tile())
So you consume the next tile via q.pop(), and at any time you can get a list of the next n tiles that will be popped via q.peek(n). And there's no organic Tetris player in the universe fast enough for the speed of this code to make any difference at all ;-)
There are 7! = 5040 permutations of a sequence of 7 distinct objects. Thus, generating all permutations is very costly in terms of both time complexity (O(n!*n)) and space complexity (O(n!*n)). However choosing a random permutation from the sequence of permutations is easy. Let's look at the code for choice from random.py.
def choice(self, seq):
"""Choose a random element from a non-empty sequence."""
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
As you can see the calculation of the index is O(1) since len(seq) is O(1) for any sequence and self.random() is also O(1). Fetching an element from the list type in Python is also O(1) so the entire function is O(1).
On the other hand, using random.shuffle will swap the elements of your bag in-place. Thus it will use O(1) space complexity. However in terms of time complexity it is not so efficient. Let's look at the code for shuffle from random.py
def shuffle(self, x, random=None, int=int):
"""x, random=random.random -> shuffle list x in place; return None.
Optional arg random is a 0-argument function returning a random
float in [0.0, 1.0); by default, the standard random.random.
"""
if random is None:
random = self.random
for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = int(random() * (i+1))
x[i], x[j] = x[j], x[i]
random.shuffle implements the Fisher-Yates shuffle which, "is similar to randomly picking numbered tickets out of a hat, or cards from a deck, one after another until there are no more left." However the number of computations are clearly greater than the first method since len(x)-1 calls to random() must be made and it also requires len(x)-1 swap operations. Each swap operation requires 2 fetches from the list and the generation of a 2-tuple for unpacking and assignment.
Based on all of this information, I would guess that the first method mentioned uses up a lot of memory to store the permutations and requires O(n!*n) time complexity overhead, but in the long-run it is probably much more efficient than the second method and will likely keep the framerate stable in an actual implementation of a Tetris game since there will be less computations to do during the actual game loop. The permutations can be generated before the display is even initialized which is nice for giving the illusion that your game does not perform many computations.
Here I post finalized code using Tim Peters' suggestion of a generator and a circular buffer. Since the size of the circular buffer would be known before the buffer's creation, and it would never change, I did not implement all of the features that circular buffers usually have (you can find that on the Wikipedia article). In any case, it works perfectly for the Random Generator algorithm.
def random_generator():
import itertools, random
bag = "IJLOSTZ"
bigbag = list(itertools.permutations(bag))
while True:
for ost in random.choice(bigbag):
yield ost
def popleft_append(buff, start_idx, it):
""" This function emulates popleft and append from
collections.deque all in one step for a circular buffer
of size n which is always full,
The argument to parameter "it" must be an infinite
generator iterable, otherwise it.next() may throw
an exception at some point """
left_popped = buff[start_idx]
buff[start_idx] = it.next()
return (start_idx + 1) % len(buff), left_popped
def circular_peek(seq, start_idx):
return seq[start_idx:len(seq)] + seq[:start_idx]
# Example usage for peek queue of size 5
rg = random_generator()
pqsize = 5
# Initialize buffer using pqsize elements from generator
buff = [rg.next() for _ in xrange(pqsize)]
start_idx = 0
# Game loop
while True:
# Popping one OST (one-sided tetromino) from queue and
# adding a new one, also updating start_idx
start_idx, left_popped = popleft_append(buff, start_idx, rg)
# To show the peek queue currently
print circular_peek(buff, start_idx)
Related
I am seeking to sample n random permutations of a list in Python.
This is my code:
obj = [ 5 8 9 ... 45718 45719 45720]
#type(obj) = numpy.ndarray
pairs = random.sample(list(permutations(obj,2)),k= 150)
Although the code does what I want it to, it causes memory issues. I sometimes receive the error Memory error when running on CPU, and when running on GPU, my virtual machine crashes.
How can I make the code work in a more memory-efficient manner?
This avoids using permutations at all:
count = len(obj)
def index2perm(i,obj):
i1, i2 = divmod(i,len(obj)-1)
if i1 <= i2:
i2 += 1
return (obj[i1],obj[i2])
pairs = [index2perm(i,obj) for i in random.sample(range(count*(count-1)),k=3)]
Building on Pablo Ruiz's excellent answer, I suggest wrapping his sampling solution into a generator function that yields unique permutations by keeping track of what it has already yielded:
import numpy as np
def unique_permutations(sequence, r, n):
"""Yield n unique permutations of r elements from sequence"""
seen = set()
while len(seen) < n:
# This line of code adapted from Pablo Ruiz's answer:
candidate_permutation = tuple(np.random.choice(sequence, r, replace=False))
if candidate_permutation not in seen:
seen.add(candidate_permutation)
yield candidate_permutation
obj = list(range(10))
for permutation in unique_permutations(obj, 2, 15):
# do something with the permutation
# Or, to save the result as a list:
pairs = list(unique_permutations(obj, 2, 15))
My assumption is that you are sampling a small subset of the very large number of possible permutations, in which case collisions will be rare enough that keeping a seen set will not be expensive.
Warnings: this function is an infinite loop if you ask for more permutations than are possible given the inputs. It will also get increasingly slow an n gets close to the number of possible permutations, since collisions will get increasingly frequent.
If I were to put this function in my code base, I would put a shield at the top that calculated the number of possible permutations and raised a ValueError exception if n exceeded that number, and maybe output a warning if n exceeded one tenth that number, or something like that.
You can avoid listing the permutation iterator that could be massive in memory. You can generate random permutations by sampling the list with replace=False.
import numpy as np
obj = np.array([5,8,123,13541,42])
k = 15
permutations = [tuple(np.random.choice(obj, 2, replace=False)) for _ in range(k)]
print(permutations)
This problem becomes much harder, if you for example impose no repetition in your random permutations.
Edit, no repetitions code
I think this is the best possible approach for the non repetition case.
We index all possible permutations from 1 to n**2-n in a permutation matrix where the diagonal should be avoided. We sample the indexes without repetitions and without listing them, then we map the samples to the coordinates of the permutations and then we get the permutations from the indexes of matrix.
import random
import numpy as np
obj = np.array([1,2,3,10,43,19,323,142,334,33,312,31,12])
k = 150
obj_len = len(obj)
indexes = random.sample(range(obj_len**2-obj_len), k)
def mapm(m):
return m + m //(obj_len) +1
permutations = [(obj[mapm(i)//obj_len], obj[mapm(i)%obj_len]) for i in indexes]
This approach is not based on any assumption, does not load the permutations and also the performance is not based on a while loop failing to insert duplicates, as no duplicates are ever generated.
Suppose I have a Python list of arbitrary length k. Now, suppose I would like a random sample of n , (where n <= k!) distinct permutations of that list. I was tempted to try:
import random
import itertools
k = 6
n = 10
mylist = list(range(0, k))
j = random.sample(list(itertools.permutations(mylist)), n)
for i in j:
print(i)
But, naturally, this code becomes unusably slow when k gets too large. Given that the number of permutations that I may be looking for n is going to be relatively small compared to the total number of permutations, computing all of the permutations is unnecessary. Yet it's important that none of the permutations in the final list are duplicates.
How would you achieve this more efficiently? Remember, mylist could be a list of anything, I just used list(range(0, k)) for simplicity.
You can generate permutations, and keep track of the ones you have already generated. To make it more versatile, I made a generator function:
import random
k = 6
n = 10
mylist = list(range(0, k))
def perm_generator(seq):
seen = set()
length = len(seq)
while True:
perm = tuple(random.sample(seq, length))
if perm not in seen:
seen.add(perm)
yield perm
rand_perms = perm_generator(mylist)
j = [next(rand_perms) for _ in range(n)]
for i in j:
print(i)
Naïve implementation
Bellow the naïve implementation I did (well implemented by #Tomothy32, pure PSL using generator):
import numpy as np
mylist = np.array(mylist)
perms = set()
for i in range(n): # (1) Draw N samples from permutations Universe U (#U = k!)
while True: # (2) Endless loop
perm = np.random.permutation(k) # (3) Generate a random permutation form U
key = tuple(perm)
if key not in perms: # (4) Check if permutation already has been drawn (hash table)
perms.update(key) # (5) Insert into set
break # (6) Break the endless loop
print(i, mylist[perm])
It relies on numpy.random.permutation which randomly permute a sequence.
The key idea is:
to generate a new random permutation (index randomly permuted);
to check if permutation already exists and store it (as tuple of int because it must hash) to prevent duplicates;
Then to permute the original list using the index permutation.
This naïve version does not directly suffer to factorial complexity O(k!) of itertools.permutations function which does generate all k! permutations before sampling from it.
About Complexity
There is something interesting about the algorithm design and complexity...
If we want to be sure that the loop could end, we must enforce N <= k!, but it is not guaranteed. Furthermore, assessing the complexity requires to know how many time the endless-loop will actually loop before a new random tuple is found and break it.
Limitation
Let's encapsulate the function written by #Tomothy32:
import math
def get_perms(seq, N=10):
rand_perms = perm_generator(mylist)
return [next(rand_perms) for _ in range(N)]
For instance, this call work for very small k<7:
get_perms(list(range(k)), math.factorial(k))
But will fail before O(k!) complexity (time and memory) when k grows because it boils down to randomly find a unique missing key when all other k!-1 keys have been found.
Always look on the bright side...
On the other hand, it seems the method can generate a reasonable amount of permuted tuples in a reasonable amount of time when N<<<k!. Example, it is possible to draw more than N=5000 tuples of length k where 10 < k < 1000 in less than one second.
When k and N are kept small and N<<<k!, then the algorithm seems to have a complexity:
Constant versus k;
Linear versus N.
This is somehow valuable.
Yesterday I wrote two possible reverse functions for lists to demonstrate some one different ways to do list inversion. But then I noticed that the function using branching recursion (rev2) is actually faster than the function using linear recursion (rev1), even though the branching function takes more calls to finish and the same number of calls (minus one) of non-trivial calls (that are actually more computation intensive) than the non-trivial calls of the linearly recursive function. Nowhere am I explicitly triggering parallelism, so where does the performance difference come from that makes a function with more calls that are more involved take less time?
from sys import argv
from time import time
nontrivial_rev1_call = 0 # counts number of calls involving concatentation, indexing and slicing
nontrivial_rev2_call = 0 # counts number of calls involving concatentation, len-call, division and sclicing
length = int(argv[1])
def rev1(l):
global nontrivial_rev1_call
if l == []:
return []
nontrivial_rev1_call += 1
return rev1(l[1:])+[l[0]]
def rev2(l):
global nontrivial_rev2_call
if l == []:
return []
elif len(l) == 1:
return l
nontrivial_rev2_call += 1
return rev2(l[len(l)//2:]) + rev2(l[:len(l)//2])
lrev1 = rev1(list(range(length)))
print ('Calls to rev1 for a list of length {}: {}'.format(length, nontrivial_rev1_call))
lrev2 = rev2(list(range(length)))
print ('Calls to rev2 for a list of length {}: {}'.format(length, nontrivial_rev2_call))
print()
l = list(range(length))
start = time()
for i in range(1000):
lrev1 = rev1(l)
end = time()
print ("Average time taken for 1000 passes on a list of length {} with rev1: {} ms".format(length, (end-start)/1000*1000))
start = time()
for i in range(1000):
lrev2 = rev2(l)
end = time()
print ("Average time taken for 1000 passes on a list of length {} with rev2: {} ms".format(length, (end-start)/1000*1000))
Example call:
$ python reverse.py 996
calls to rev1 for a list of length 996: 996
calls to rev2 for a list of length 996: 995
Average time taken for 1000 passes on a list of length 996 with rev1: 7.90629506111145 ms
Average time taken for 1000 passes on a list of length 996 with rev2: 1.3290061950683594 ms
Short answer: It's not that much the calls here, but it is the amount of copying of the lists. As a result the linear recursion has time complexity O(n2) wheras the branching recursion has time complexity O(n log n).
The recursive call here does not operate in constant time: it operates in the length of the list it copies. Indeed, if you copy a list of n elements, it will require O(n) time.
Now if we perform the linear recursion, it means we will perform O(n) calls (the maximum call depth is O(n)). Each time, we will copy the list entirely, except for one item. So the time complexity is:
n
---
\ n * (n+1)
/ k = -----------
--- 2
k=1
So the time complexity of the algorithm is - given the calls itself are done in O(1) - O(n2).
In case we perform branching recursion, we make two copies of the list, each with a length that is approximately half. So every level of recursion will take O(n) time (since these halves result in copies of the list as well, and if we sum these up, we make an entire copy at every recursive level). But the number of levels scales logwise:
log n
-----
\
/ n = n log n
-----
k=1
So the time complexity is here O(n log n) (here log is the 2-log, but that does not matter in terms of big oh).
Using views
Instead of copying lists, we can use views: here we keep a reference to the same list, but use two integers that specify the span of the list. For example:
def rev1(l, frm, to):
global nontrivial_rev1_call
if frm >= to:
return []
nontrivial_rev1_call += 1
result = rev1(l, frm+1, to)
result.append(l[frm])
return result
def rev2(l, frm, to):
global nontrivial_rev2_call
if frm >= to:
return []
elif to-frm == 1:
return l[frm]
nontrivial_rev2_call += 1
mid = (frm+to)//2
return rev2(l, mid, to) + rev2(l, frm, mid)
If we now run the timeit module, we obtain:
>>> timeit.timeit(partial(rev1, list(range(966)), 0, 966), number=10000)
2.176353386021219
>>> timeit.timeit(partial(rev2, list(range(966)), 0, 966), number=10000)
3.7402000919682905
This is because we no longer copy the lists, and thus the append(..) function works in O(1) amortized cost. Whereas for the branching recursion, we append two lists, so it works in O(k) with k the sum of the length of the two lists. So now we compare O(n) (linear recursion), with O(n log n) (branching recursion).
I need to deterministically generate a randomized list containing the numbers from 0 to 2^32-1.
This would be the naive (and totally nonfunctional) way of doing it, just so it's clear what I'm wanting.
import random
numbers = range(2**32)
random.seed(0)
random.shuffle(numbers)
I've tried making the list with numpy.arange() and using pycrypto's random.shuffle() to shuffle it. Making the list ate up about 8gb of ram, then shuffling raised that to around 25gb. I only have 32gb to give. But that doesn't matter because...
I've tried cutting the list into 1024 slices and trying the above, but even one of these slices takes way too long. I cut one of these slices into 128 yet smaller slices, and that took about 620ms each. If it grew linearly, then that means the whole thing would take about 22 and a half hours to complete. That sounds alright, but it doesn't grow linearly.
Another thing I've tried is generating random numbers for every entry and using those as indices for their new location. I then go down the list and attempt to place the number at the new index. If that index is already in use, the index is incremented until it finds a free one. This works in theory, and it can do about half of it, but near the end it keeps having to search for new spots, wrapping around the list several times.
Is there any way to pull this off? Is this a feasible goal at all?
Computing all the values seems impossible, since Crypto compute a random integer in about a milisecond, so the whole job take days.
Here is a Knuth algorithm implementation as a generator:
from Crypto.Random.random import randint
import numpy as np
def onthefly(n):
numbers=np.arange(n,dtype=np.uint32)
for i in range(n):
j=randint(i,n-1)
numbers[i],numbers[j]=numbers[j],numbers[i]
yield numbers[i]
For n=10 :
gen=onthefly(10)
print([next(gen) for i in range(9)])
print(next(gen))
#[9, 0, 2, 6, 4, 8, 7, 3, 1]
#5
For n=2**32, the generator take a minute to initialize, but calls are O(1).
If you have a continuous range of numbers, you don't need to store them at all. It is easy to devise a bidirectional mapping between the value in a shuffled list and its position in that list. The idea is to use a pseudo-random permutation and this is exactly what block ciphers provide.
The trick is to find a block cipher that matches exactly your requirement of 32-bit integers. There are very few such block ciphers, but the Simon and Speck ciphers (released by the NSA) are parameterisable and support a block size of 32-bit (usually block sizes are much larger).
This library seems to provide an implementation of that. We can devise the following functions:
def get_value_from_index(key, i):
cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
return cipher.encrypt(i)
def get_index_from_value(key, val):
cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
return cipher.decrypt(val)
The library works with Python's big integers, so you might not even need to encode them.
A 64-bit key (for example 0x123456789ABCDEF0) is not much. You could use a similar construction that increased the key size in DES to Triple DES. Keep in mind that keys should be chosen randomly and they have to be constant if you want determinism.
If you don't want to use an algorithm by the NSA for that, I would understand. There are others, but I can't find them now. The Hasty Pudding cipher is even more flexible, but I don't know if there is an implementation of that for Python.
The class I created uses a bitarray of keep track which numbers have already been used. With the comments, I think the code is pretty self explanatory.
import bitarray
import random
class UniqueRandom:
def __init__(self):
""" Init boolean array of used numbers and set all to False
"""
self.used = bitarray.bitarray(2**32)
self.used.setall(False)
def draw(self):
""" Draw a previously unused number
Return False if no free numbers are left
"""
# Check if there are numbers left to use; return False if none are left
if self._free() == 0:
return False
# Draw a random index
i = random.randint(0, 2**32-1)
# Skip ahead from the random index to a undrawn number
while self.used[i]:
i = (i+1) % 2**32
# Update used array
self.used[i] = True
# return the selected number
return i
def _free(self):
""" Check how many places are unused
"""
return self.used.count(False)
def main():
r = UniqueRandom()
for _ in range(20):
print r.draw()
if __name__ == '__main__':
main()
Design considerations
While Garrigan Stafford's answer is perfectly fine, the memory footprint of this solution is much smaller (a bit more than 4 GB). Another difference between our answers is that Garrigan's algorithm takes more time to generate a random number when the amount of generated numbers increases (because he keeps iterating until an unused number is found). This algorithm just looks up the next unused number if a certain number is already used. This makes the time it takes to draw a number every time practically the same, regardless of how far the pool of free numbers is exhausted.
Here is a permutation RNG which I wrote which uses the fact that a squaring a number mod a prime (plus some intricacies) gives a pseudorandom permutation.
https://github.com/pytorch/pytorch/blob/09b4f4f2ff88088306ecedf1bbe23d8aac2d3f75/torch/utils/data/_utils/index_utils.py
Short version:
def _is_prime(n):
if n == 2:
return True
if n == 1 or n % 2 == 0:
return False
for d in range(3, floor(sqrt(n)) + 1, 2): # can use isqrt in Python 3.8
if n % d == 0:
return False
return True
class Permutation(Range):
"""
Generates a random permutation of integers from 0 up to size.
Inspired by https://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/
"""
size: int
prime: int
seed: int
def __init__(self, size: int, seed: int):
self.size = size
self.prime = self._get_prime(size)
self.seed = seed % self.prime
def __getitem__(self, index):
x = self._map(index)
while x >= self.size:
# If we map to a number greater than size, then the cycle of successive mappings must eventually result
# in a number less than size. Proof: The cycle of successive mappings traces a path
# that either always stays in the set n>=size or it enters and leaves it,
# else the 1:1 mapping would be violated (two numbers would map to the same number).
# Moreover, `set(range(size)) - set(map(n) for n in range(size) if map(n) < size)`
# equals the `set(map(n) for n in range(size, prime) if map(n) < size)`
# because the total mapping is exhaustive.
# Which means we'll arrive at a number that wasn't mapped to by any other valid index.
# This will take at most `prime-size` steps, and `prime-size` is on the order of log(size), so fast.
# But usually we just need to remap once.
x = self._map(x)
return x
#staticmethod
def _get_prime(size):
"""
Returns the prime number >= size which has the form (4n-1)
"""
n = size + (3 - size % 4)
while not _is_prime(n):
# We expect to find a prime after O(log(size)) iterations
# Using a brute-force primehood test, total complexity is O(log(size)*sqrt(size)), which is pretty good.
n = n + 4
return n
def _map(self, index):
a = self._permute_qpr(index)
b = (a + self.seed) % self.prime
c = self._permute_qpr(b)
return c
def _permute_qpr(self, x):
residue = pow(x, 2, self.prime)
if x * 2 < self.prime:
return residue
else:
return self.prime - residue
So one way is to keep track of which numbers you have already given out and keep handing out new random numbers one at a time, consider
import random
random.seed(0)
class RandomDeck:
def __init__(self):
self.usedNumbers = set()
def draw(self):
number = random.randint(0,2**32)
while number in self.usedNumbers:
number = random.randint(0,2**32)
self.usedNumbers.append(number)
return number
def shuffle(self):
self.usedNumbers = set()
As you can see we essentially have a deck of random numbers between 0 and 2^32 but we only store the numbers we have given out to ensure we don't have repeats. Then you can re-shuffle the deck by forgetting all the numbers you have already given out.
This should be efficient in most use cases as long as you don't draw ~1 million numbers without a reshuffle.
I have a list a_tot with 1500 elements and I would like to divide this list into two lists in a random way. List a_1 would have 1300 and list a_2 would have 200 elements. My question is about the best way to randomize the original list with 1500 elements. When I have randomized the list, I could take one slice with 1300 and another slice with 200.
One way is to use the random.shuffle, another way is to use the random.sample. Any differences in the quality of the randomization between the two methods? The data in list 1 should be a random sample as well as the data in list2.
Any recommendations?
using shuffle:
random.shuffle(a_tot) #get a randomized list
a_1 = a_tot[0:1300] #pick the first 1300
a_2 = a_tot[1300:] #pick the last 200
using sample
new_t = random.sample(a_tot,len(a_tot)) #get a randomized list
a_1 = new_t[0:1300] #pick the first 1300
a_2 = new_t[1300:] #pick the last 200
The source for shuffle:
def shuffle(self, x, random=None, int=int):
"""x, random=random.random -> shuffle list x in place; return None.
Optional arg random is a 0-argument function returning a random
float in [0.0, 1.0); by default, the standard random.random.
"""
if random is None:
random = self.random
for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = int(random() * (i+1))
x[i], x[j] = x[j], x[i]
The source for sample:
def sample(self, population, k):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use xrange as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(xrange(10000000), 60)
"""
# XXX Although the documentation says `population` is "a sequence",
# XXX attempts are made to cater to any iterable with a __len__
# XXX method. This has had mixed success. Examples from both
# XXX sides: sets work fine, and should become officially supported;
# XXX dicts are much harder, and have failed in various subtle
# XXX ways across attempts. Support for mapping types should probably
# XXX be dropped (and users should pass mapping.keys() or .values()
# XXX explicitly).
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
n = len(population)
if not 0 <= k <= n:
raise ValueError, "sample larger than population"
random = self.random
_int = int
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize or hasattr(population, "keys"):
# An n-length list is smaller than a k-length set, or this is a
# mapping type so the other algorithm wouldn't work.
pool = list(population)
for i in xrange(k): # invariant: non-selected at [0,n-i)
j = _int(random() * (n-i))
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
else:
try:
selected = set()
selected_add = selected.add
for i in xrange(k):
j = _int(random() * n)
while j in selected:
j = _int(random() * n)
selected_add(j)
result[i] = population[j]
except (TypeError, KeyError): # handle (at least) sets
if isinstance(population, list):
raise
return self.sample(tuple(population), k)
return result
As you can see, in both cases, the randomization is essentially done by the line int(random() * n). So, the underlying algorithm is essentially the same.
There are two major differences between shuffle() and sample():
1) Shuffle will alter data in-place, so its input must be a mutable sequence. In contrast, sample produces a new list and its input can be much more varied (tuple, string, xrange, bytearray, set, etc).
2) Sample lets you potentially do less work (i.e. a partial shuffle).
It is interesting to show the conceptual relationships between the two by demonstrating that is would have been possible to implement shuffle() in terms of sample():
def shuffle(p):
p[:] = sample(p, len(p))
Or vice-versa, implementing sample() in terms of shuffle():
def sample(p, k):
p = list(p)
shuffle(p)
return p[:k]
Neither of these are as efficient at the real implementation of shuffle() and sample() but it does show their conceptual relationships.
The randomization should be just as good with both option. I'd say go with shuffle, because it's more immediately clear to the reader what it does.
random.shuffle() shuffles the given list in-place. Its length stays the same.
random.sample() picks n items out of the given sequence without replacement (which also might be a tuple or whatever, as long as it has a __len__()) and returns them in randomized order.
I think they are quite the same, except that one updated the original list, one use (read only) it. No differences in quality.
from random import shuffle
from random import sample
x = [[i] for i in range(10)]
shuffle(x)
sample(x,10)
shuffle update the output in same list but sample return the update
list sample provide the no of argument in pic facility but shuffle
provide the list of same length of input