Pandas create random samples without duplicates - python

I have a pandas dataframe containing ~200,000 rows and I would like to create 5 random samples of 1000 rows each however I do not want any of these samples to contain the same row twice.
To create a random sample I have been using:
import numpy as np
rows = np.random.choice(df.index.values, 1000)
sampled_df = df.ix[rows]
However just doing this several times would run the risk of having duplicates. Would the best way to handle this be keeping track of which rows are sampled each time?

You can use df.sample.
A dataframe with 100 rows and 5 columns:
df = pd.DataFrame(np.random.randn(100, 5), columns = list("abcde"))
Sample 5 rows:
df.sample(5)
Out[8]:
a b c d e
84 0.012201 -0.053014 -0.952495 0.680935 0.006724
45 -1.347292 1.358781 -0.838931 -0.280550 -0.037584
10 -0.487169 0.999899 0.524546 -1.289632 -0.370625
64 1.542704 -0.971672 -1.150900 0.554445 -1.328722
99 0.012143 -2.450915 -0.718519 -1.192069 -1.268863
This ensures those 5 rows are different. If you want to repeat this process, I'd suggest sampling number_of_rows * number_of_samples rows. For example if each sample is going to contain 5 rows and you need 10 samples, sample 50 rows. The first 5 will be the first sample, the second five will be the second...
all_samples = df.sample(50)
samples = [all_samples.iloc[5*i:5*i+5] for i in range(10)]

You can set replace to False in np.random.choice
rows = np.random.choice(df.index.values, 1000, replace=False)

Take a look on numpy.random docs
For your solution:
import numpy as np
rows = np.random.choice(df.index.values, 1000, replace=False)
sampled_df = df.ix[rows]
This will make random choices without replacement.
If you want to generate multiple samples that none will have any elements in common you will need to remove the elements from each choice after each iteration. You can usenumpy.setdiff1d for that.
import numpy as np
allRows = df.index.values
numOfSamples = 5
samples = list()
for i in xrange(numOfSamples):
choices = np.random.choice(allRows, 1000, replace=False)
samples.append(choices)
allRows = np.setdiff1d(allRows, choices)
Here is a working example with a range of numbers between 0 and 100:
In [58]: import numpy as np
In [59]: allRows = np.arange(100)
In [60]: numOfSamples = 5
In [61]: samples = list()
In [62]: for i in xrange(numOfSamples):
....: choices = np.random.choice(allRows, 5, replace=False)
....: samples.append(choices)
....: allRows = np.setdiff1d(allRows, choices)
....:
In [63]: samples
Out[63]:
[array([66, 24, 47, 31, 22]),
array([ 8, 28, 15, 62, 52]),
array([18, 65, 71, 54, 48]),
array([59, 88, 43, 7, 85]),
array([97, 36, 55, 56, 14])]
In [64]: allRows
Out[64]:
array([ 0, 1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 16, 17, 19, 20, 21,
23, 25, 26, 27, 29, 30, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 44,
45, 46, 49, 50, 51, 53, 57, 58, 60, 61, 63, 64, 67, 68, 69, 70, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 86, 87, 89, 90, 91,
92, 93, 94, 95, 96, 98, 99])

Related

Modifying alternate indices of 3d numpy array

I have a numpy array with shape (140, 23, 2) being 140 frames, 23 objects, and x,y locations. The data has been generated by a GAN and when I animate the movement it's very jittery. I want to smooth it by converting the coordinates for each object so every odd number index to be the mid-point between the even numbered indices either side of it. e.g.
x[1] = (x[0] + x[2]) / 2
x[3] = (x[2] + x[4]) / 2
Below is my code:
def smooth_coordinates(df):
# df shape is (140, 23, 2)
# iterate through each object (23)
for j in range(len(df[0])):
# iterate through 140 frames
for i in range(len(df)):
# if it's an even number and index allows at least 1 index after it
if (i%2 != 0) and (i < (len(df[0])-2)):
df[i][j][0] = ( (df[i-1][j][0]+df[i+1][j][0]) /2 )
df[i][j][1] = ( (df[i-1][j][1]+df[i+1][j][1]) /2 )
return df
Aside from it being very inefficient my input df and output df are identical. Any suggestions for how to achieve this more efficiently?
import numpy as np
a = np.random.randint(100, size= [140, 23, 2]) # input array
b = a.copy()
i = np.ogrid[1: a.shape[0]-1: 2] # odd indicies
i
>>> [ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25,
27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51,
53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77,
79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99, 101, 103,
105, 107, 109, 111, 113, 115, 117, 119, 121, 123, 125, 127, 129,
131, 133, 135, 137]
(a == b).all() # testing for equality
>>> True
a[i] = (a[i-1] + a[i+1]) / 2 # averaging positions across frames
(a == b).all() # testing for equality again
>>> False

Faster/lazier way to evenly and randomly split m*n into n group (each has m elements) in python

I want to split m*n elements (e.g., 1, 2, ..., m*n) into n group randomly and evenly such that each group has m random elements. Each group will process k (k>=1) elements at one time from its own group and at the same speed (via some synchronization mechanism), until all group has processed all their own elements. Actually each group is in an independent process/thread.
I use numpy.random.choice(m*n, m*n, replace=False) to generate the permutation first, and then index the permuted result from each group.
The problem is that when m*n is very large (e.g., >=1e8), the speed is very slow (tens of seconds or minutes).
Is there any faster/lazier way to do this? I think maybe this can be done in a lazier way, which is not generating the permuted result in the first time, but generate a generator first, and in each group, generate k elements at each time, and its effect should be identical to the method I currently use. But I don't know how to achieve this lazy way. And I am not sure whether this can be implemented actually.
You can make a generator that will progressively shuffle (a copy of) the list and lazily yield distinct groups:
import random
def rndGroups(A,size):
A = A.copy() # work on a copy (if needed)
p = len(A) # target position of random item
for _ in range(0,len(A),size): # work in chunks of group size
for _ in range(size): # Create one group
i = random.randrange(p) # random index in remaining items
p -= 1 # update randomized position
A[i],A[p] = A[p],A[i] # swap items
yield A[p:p+size] # return shuffled sub-range
Output:
A = list(range(100))
iG = iter(rndGroups(A,10)) # 10 groups of 10 items
s = set() # set to validate uniqueness
for _ in range(10): # 10 groups
g = next(iG) # get the next group from generator
s.update(g) # to check that all items are distinct
print(g)
print(len(s)) # must get 100 distinct values from groups
[87, 19, 85, 90, 35, 55, 86, 58, 96, 68]
[38, 92, 93, 78, 39, 62, 43, 20, 66, 44]
[34, 75, 72, 50, 42, 52, 60, 81, 80, 41]
[13, 14, 83, 28, 53, 5, 94, 67, 79, 95]
[9, 33, 0, 76, 4, 23, 2, 3, 32, 65]
[61, 24, 31, 77, 36, 40, 47, 49, 7, 97]
[63, 15, 29, 25, 11, 82, 71, 89, 91, 30]
[12, 22, 99, 37, 73, 69, 45, 1, 88, 51]
[74, 70, 98, 26, 59, 6, 64, 46, 27, 21]
[48, 17, 18, 8, 54, 10, 57, 84, 16, 56]
100
This will take just as long as pre-shuffling the list (if not longer) but it will let you start/feed threads as you go, thus augmenting the parallelism

Numpy random array from 0 to 99, including both

I'm trying to create a np array with size (80,10) so each row has random values with range 0 to 99.
I've done that by
np.random.randint(99, size=(80, 10))
But I would like to always include both 0 and 99 as values in each row.
So two values in each row are already defined and the other 8 will be random.
How would I accomplish this? Is there a way to generate an array size (80,8) and just concatenate [0,99] to every row to make it (80,10) at the end?
As suggested in the comments by Tim, you can generate a matrix with random values not including 0 and 99. Then replace two random indices along the second axis with the values 0 and 99.
rand_arr = np.random.randint(low=1, high=98, size=(80, 10))
rand_indices = np.random.rand(80,10).argsort(axis=1)[:,:2]
np.put_along_axis(rand_arr, rand_indices, [0,99], axis=1)
The motivation for using argsort is that we want random indices along the second axis without replacement. Just generating a random integer matrix for values 0-10 with size=(80,2) will not guarantee this.
In your scenario, you could do np.argpartion with kth=2 instead of np.argsort. This should be more efficient.
I've tried a few things and this is what I came up with
def generate_matrix(low, high, shape):
x, y = shape
values = np.random.randint(low+1, high-1, size=(x, y-2))
predefined = np.tile([low, high], (x, 1))
values = np.hstack([values, predefined])
for row in values:
np.random.shuffle(row)
return values
Example usage
>>> generate_matrix(0, 99, (5, 10))
array([[94, 0, 45, 99, 18, 31, 78, 80, 32, 17],
[28, 99, 72, 3, 0, 14, 26, 37, 41, 80],
[18, 78, 71, 40, 99, 0, 85, 91, 8, 59],
[65, 99, 0, 45, 93, 94, 16, 33, 52, 53],
[22, 76, 99, 15, 27, 64, 91, 32, 0, 82]])
The way I approached it:
Generate an array of size (80, 8) in the range [1, 98] and then concatenate 0 and 99 for each row. But you probably need the 0/99 to occur at different indices for each row, so you have to shuffle them. Unfortunately, np.random.shuffle() only shuffles the rows among themselves. And if you use np.random.shuffle(arr.T).T, or random.Generator.permutation, you don't shuffle the columns independently. I haven't found a vectorised way to shuffle the rows independently other than using a Python loop.
Another way:
You can generate an array of size (80, 10) in the range [1, 98] and then substitute in random indices the values 0 and 99 for each row. Again, I couldn't find a way to generate unique indices per row (so that 0 doesn't overwrite 99 for example) without a Python loop. Since I couldn't find a way to avoid Python loops, I opted for the first way, which seemed more straightforward.
If you don't care about duplicates, create an array of zeros, replace columns 1-9 with random numbers and set column 10 to 99.
final = np.zeros(shape=(80, 10))
final[:,1:9] = np.random.randint(97, size=(80, 8))+1
final[:,9] = 99
Creating a matrix 80x10 with random values from 0 to 99 with no duplicates in the same row with 0 and 99 included in every row
import random
row99=[ n for n in range(1,99) ]
perm = [n for n in range(0,10) ]
m = []
for i in range(80):
random.shuffle(row99)
random.shuffle(perm)
r = row99[:10]
r[perm[0]] = 0
r[perm[1]] = 99
m.append(r)
print(m)
Partial output:
[
... other elements here ...
[70, 58, 0, 25, 41, 10, 90, 5, 42, 18],
[0, 57, 90, 71, 39, 65, 52, 24, 28, 77],
[55, 42, 7, 9, 32, 69, 90, 0, 64, 2],
[0, 59, 17, 35, 56, 34, 33, 37, 90, 71]]

Choosing randomly all the elements in the the list just once

How is it possible to randomly choose a number from a list with n elements, n time without picking the same element of the list twice. I wrote a code to choose the sequence number of the elements in the list but it is slow:
>>>redshift=np.array([0.92,0.17,0.51,1.33,....,0.41,0.82])
>>>redshift.shape
(1225,)
exclude=[]
k=0
ng=1225
while (k < ng):
flag1=0
sq=random.randint(0, ng)
while (flag1<1):
if sq in exclude:
flag1=1
sq=random.randint(0, ng)
else:
print sq
exclude.append(sq)
flag1=0
z=redshift[sq]
k+=1
It doesn't choose all the sequence number of elements in the list.
Since you are already using a numpy array, you may as well use the tools in that package.
You can use numpy.random.choice with replace=False. That will only use each element once:
>>> redshift=np.array([0.92,0.17,0.51,1.33,0.41,0.82])
>>> np.random.choice(redshift, redshift.size, replace=False)
array([ 0.41, 0.82, 0.17, 1.33, 0.92, 0.51])
Since each is only used once, if you try and get more than the array size elements you get a value error:
>>> np.random.choice(redshift, redshift.size+1, replace=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "mtrand.pyx", line 1051, in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:8075)
ValueError: Cannot take a larger sample than population when 'replace=False'
You can also use np.random.shuffle if you just want to shuffle the array (or a copy) in place:
>>> redshift
array([ 0.92, 0.17, 0.51, 1.33, 0.41, 0.82])
>>> np.random.shuffle(redshift)
>>> redshift
array([ 0.41, 0.82, 1.33, 0.51, 0.92, 0.17])
Please note that it is significantly faster to keep a numpy array in numpy rather than switch back to a Python data structure for doing something like get a sorted version:
>>> from timeit import timeit
>>> import random
>>> source=range(1000000)
>>> a=np.array(source)
>>> timeit('np.random.choice(a, a.size, replace=False)', setup='from __main__ import np, a', number=10)
2.971310766064562
>>> timeit('random.sample(list(a), a.size)', setup='from __main__ import random, a', number=10)
14.129850425058976
It this test case -- more than 4x faster.
If you want to keep the original list in the same order and don't want to create then shuffle a copy, you can use random.sample(lst, n) for any n <= len(lst):
>>> import random
>>> n = 10
>>> random.sample(xrange(n), n)
[4, 6, 5, 2, 3, 7, 9, 0, 1, 8]
Why not just shuffle the list and iterate through the elements:
from random import shuffle
a = list(range(100))
shuffle(a)
>>> print(a)
[5, 82, 96, 66, 47, 62, 49, 86, 55, 4, 21, 94, 34, 46, 10, 32, 83, 13, 25, 24, 58, 74, 14, 43, 18, 42, 56, 23, 52, 36, 15, 60, 79, 29, 0, 72, 38, 88, 41, 85, 57, 69, 30, 45, 70, 31, 84, 63, 92, 48, 68, 22, 40, 59, 95, 11, 39, 78, 89, 64, 6, 20, 91, 37, 61, 28, 71, 12, 8, 19, 1, 98, 50, 97, 26, 53, 73, 17, 16, 87, 33, 9, 99, 90, 93, 81, 7, 44, 65, 80, 54, 51, 67, 27, 3, 2, 76, 77, 75, 35]
One possibility is to use np.random.permutation, if you do not have any space constraints
import numpy as np
rng = np.random.RandomState(42)
redshift=np.array([0.92,0.17,0.51,1.33,0.41,0.82]) # A subset of your array
perm = rng.permutation(len(redshift))
redshift_perm = redshift[perm]
print redshift
print perm
print redshift_perm
# yields
# [ 0.92 0.17 0.51 1.33 0.41 0.82]
# [0 1 5 2 4 3]
# [ 0.92 0.17 0.82 0.51 0.41 1.33]
You can use random.sample, from DOCS:
Return a k length list of unique elements chosen from the population
sequence. Used for random sampling without replacement.
Example:
from random import sample
my_list = list(range(10))
for value in sample(my_list,len(my_list)):
print value

python seed() not keeping same sequence

I'm using a random.seed() to try and keep the random.sample() the same as I sample more values from a list and at some point the numbers change.....where I thought the one purpose of the seed() function was to keep the numbers the same.
Heres a test I did to prove it doesn't keep the same numbers.
import random
a=range(0,100)
random.seed(1)
a = random.sample(a,10)
print a
then change the sample much higher and the sequence will change(at least for me they always do):
a = random.sample(a,40)
print a
I'm sort of a newb so maybe this is an easy fix but I would appreciate any help on this.
Thanks!
If you were to draw independent samples from the generator, what would happen would be exactly what you're expecting:
In [1]: import random
In [2]: random.seed(1)
In [3]: [random.randint(0, 99) for _ in range(10)]
Out[3]: [13, 84, 76, 25, 49, 44, 65, 78, 9, 2]
In [4]: random.seed(1)
In [5]: [random.randint(0, 99) for _ in range(40)]
Out[5]: [13, 84, 76, 25, 49, 44, 65, 78, 9, 2, 83, 43 ...]
As you can see, the first ten numbers are indeed the same.
It is the fact that random.sample() is drawing samples without replacement that's getting in the way. To understand how these algorithms work, see Reservoir Sampling. In essence what happens is that later samples can push earlier samples out of the result set.
One alternative might be to shuffle a list of indices and then take either 10 or 40 first elements:
In [1]: import random
In [2]: a = range(0,100)
In [3]: random.shuffle(a)
In [4]: a[:10]
Out[4]: [48, 27, 28, 4, 67, 76, 98, 68, 35, 80]
In [5]: a[:40]
Out[5]: [48, 27, 28, 4, 67, 76, 98, 68, 35, 80, ...]
It seems that random.sample is deterministic only if the seed and sample size are kept constant. In other words, even if you reset the seed, generating a sample with a different length is not "the same" random operation, and may give a different initial subsequence than generating a smaller sample with the same seed. In other words, the same random numbers are being generated internally, but the way sample uses them to derive the random sequence is different depending on how large a sample you ask for.
You are assuming an implementation of random.sample something like this:
def samples(lst, k):
n = len(lst)
indices = []
while len(indices) < k:
index = random.randrange(n)
if index not in indices:
indices.append(index)
return [lst[i] for i in indices]
Which gives:
>>> random.seed(1)
>>> samples(list(range(20)), 5)
[4, 18, 2, 8, 3]
>>> random.seed(1)
>>> samples(list(range(20)), 10)
[4, 18, 2, 8, 3, 15, 14, 12, 6, 0]
However, that isn't how random.sample is actually implemented; seed does work how you think, it's sample that doesn't!
You simply need to re-seed it:
a = list(range(100))
random.seed(1) # seed first time
random.sample(a, 10)
>> [17, 72, 97, 8, 32, 15, 63, 57, 60, 83]
random.seed(1) # seed second time with same value
random.sample(a, 40)
>> [17, 72, 97, 8, 32, 15, 63, 57, 60, 83, 48, 26, 12, 62, 3, 49, 55, 77, 0, 92, 34, 29, 75, 13, 40, 85, 2, 74, 69, 1, 89, 27, 54, 98, 28, 56, 93, 35, 14, 22]
But in your case you're using a generator, not a list, so after sampling the first time a will shrink (from 100 to 90), and you will lose the elements that you had sampled, so it won't work. So just use a list and seed before every sampling.

Categories