I'm trying to create a np array with size (80,10) so each row has random values with range 0 to 99.
I've done that by
np.random.randint(99, size=(80, 10))
But I would like to always include both 0 and 99 as values in each row.
So two values in each row are already defined and the other 8 will be random.
How would I accomplish this? Is there a way to generate an array size (80,8) and just concatenate [0,99] to every row to make it (80,10) at the end?
As suggested in the comments by Tim, you can generate a matrix with random values not including 0 and 99. Then replace two random indices along the second axis with the values 0 and 99.
rand_arr = np.random.randint(low=1, high=98, size=(80, 10))
rand_indices = np.random.rand(80,10).argsort(axis=1)[:,:2]
np.put_along_axis(rand_arr, rand_indices, [0,99], axis=1)
The motivation for using argsort is that we want random indices along the second axis without replacement. Just generating a random integer matrix for values 0-10 with size=(80,2) will not guarantee this.
In your scenario, you could do np.argpartion with kth=2 instead of np.argsort. This should be more efficient.
I've tried a few things and this is what I came up with
def generate_matrix(low, high, shape):
x, y = shape
values = np.random.randint(low+1, high-1, size=(x, y-2))
predefined = np.tile([low, high], (x, 1))
values = np.hstack([values, predefined])
for row in values:
np.random.shuffle(row)
return values
Example usage
>>> generate_matrix(0, 99, (5, 10))
array([[94, 0, 45, 99, 18, 31, 78, 80, 32, 17],
[28, 99, 72, 3, 0, 14, 26, 37, 41, 80],
[18, 78, 71, 40, 99, 0, 85, 91, 8, 59],
[65, 99, 0, 45, 93, 94, 16, 33, 52, 53],
[22, 76, 99, 15, 27, 64, 91, 32, 0, 82]])
The way I approached it:
Generate an array of size (80, 8) in the range [1, 98] and then concatenate 0 and 99 for each row. But you probably need the 0/99 to occur at different indices for each row, so you have to shuffle them. Unfortunately, np.random.shuffle() only shuffles the rows among themselves. And if you use np.random.shuffle(arr.T).T, or random.Generator.permutation, you don't shuffle the columns independently. I haven't found a vectorised way to shuffle the rows independently other than using a Python loop.
Another way:
You can generate an array of size (80, 10) in the range [1, 98] and then substitute in random indices the values 0 and 99 for each row. Again, I couldn't find a way to generate unique indices per row (so that 0 doesn't overwrite 99 for example) without a Python loop. Since I couldn't find a way to avoid Python loops, I opted for the first way, which seemed more straightforward.
If you don't care about duplicates, create an array of zeros, replace columns 1-9 with random numbers and set column 10 to 99.
final = np.zeros(shape=(80, 10))
final[:,1:9] = np.random.randint(97, size=(80, 8))+1
final[:,9] = 99
Creating a matrix 80x10 with random values from 0 to 99 with no duplicates in the same row with 0 and 99 included in every row
import random
row99=[ n for n in range(1,99) ]
perm = [n for n in range(0,10) ]
m = []
for i in range(80):
random.shuffle(row99)
random.shuffle(perm)
r = row99[:10]
r[perm[0]] = 0
r[perm[1]] = 99
m.append(r)
print(m)
Partial output:
[
... other elements here ...
[70, 58, 0, 25, 41, 10, 90, 5, 42, 18],
[0, 57, 90, 71, 39, 65, 52, 24, 28, 77],
[55, 42, 7, 9, 32, 69, 90, 0, 64, 2],
[0, 59, 17, 35, 56, 34, 33, 37, 90, 71]]
Related
I have two lists of marks for the same set of students. For example:
A = [22, 2, 88, 3, 93, 84]
B = [66, 0, 6, 33, 99, 45]
If I accept only students above a threshold according to list A then I can look up their marks in list B. For example, if I only accept students with at least a mark of 80 from list A then their marks in list B are [6, 99, 45].
I would like to compute the smallest threshold for A which gives at least 90% of students in the derived set in B getting at least 50. In this example the threshold will have to be 93 which gives the list [99] for B.
Another example:
A = [3, 36, 66, 88, 99, 52, 55, 42, 10, 70]
B = [5, 30, 60, 80, 80, 60, 45, 45, 15, 60]
In this case we have to set the threshold to 66 which then gives 100% of [60, 80, 80, 60] getting at least 50.
This is an O(nlogn + m) approach (due to sorting) where n is the length of A and m is the length of B:
from operator import itemgetter
from itertools import accumulate
def find_threshold(lst_a, lst_b):
# find the order of the indices of lst_a according to the marks
indices, values = zip(*sorted(enumerate(lst_a), key=itemgetter(1)))
# find the cumulative sum of the elements of lst_b above 50 sorted by indices
cumulative = list(accumulate(int(lst_b[j] > 50) for j in indices))
for i, value in enumerate(values):
# find the index where the elements above 50 is greater than 90%
if cumulative[-1] - cumulative[i - 1] > 0.9 * (len(values) - i):
return value
return None
print(find_threshold([22, 2, 88, 3, 93, 84], [66, 0, 6, 33, 99, 45]))
print(find_threshold([3, 36, 66, 88, 99, 52, 55, 42, 10, 70], [5, 30, 60, 80, 80, 60, 45, 45, 15, 60]))
Output
93
66
First, define a function that will tell you if 90% of students in a set scored more than 50:
def setb_90pc_pass(b):
return sum(score >= 50 for score in b) >= len(b) * 0.9
Next, loop over scores in a in ascending order, setting each of them as the threshold. Filter out your lists according that threshold, and check if they fulfill your condition:
for threshold in sorted(A):
filtered_a, filtered_b = [], []
for ai, bi in zip(A, B):
if ai >= threshold:
filtered_a.append(ai)
filtered_b.append(bi)
if setb_90pc_pass(filtered_b):
break
print(threshold)
I want to split m*n elements (e.g., 1, 2, ..., m*n) into n group randomly and evenly such that each group has m random elements. Each group will process k (k>=1) elements at one time from its own group and at the same speed (via some synchronization mechanism), until all group has processed all their own elements. Actually each group is in an independent process/thread.
I use numpy.random.choice(m*n, m*n, replace=False) to generate the permutation first, and then index the permuted result from each group.
The problem is that when m*n is very large (e.g., >=1e8), the speed is very slow (tens of seconds or minutes).
Is there any faster/lazier way to do this? I think maybe this can be done in a lazier way, which is not generating the permuted result in the first time, but generate a generator first, and in each group, generate k elements at each time, and its effect should be identical to the method I currently use. But I don't know how to achieve this lazy way. And I am not sure whether this can be implemented actually.
You can make a generator that will progressively shuffle (a copy of) the list and lazily yield distinct groups:
import random
def rndGroups(A,size):
A = A.copy() # work on a copy (if needed)
p = len(A) # target position of random item
for _ in range(0,len(A),size): # work in chunks of group size
for _ in range(size): # Create one group
i = random.randrange(p) # random index in remaining items
p -= 1 # update randomized position
A[i],A[p] = A[p],A[i] # swap items
yield A[p:p+size] # return shuffled sub-range
Output:
A = list(range(100))
iG = iter(rndGroups(A,10)) # 10 groups of 10 items
s = set() # set to validate uniqueness
for _ in range(10): # 10 groups
g = next(iG) # get the next group from generator
s.update(g) # to check that all items are distinct
print(g)
print(len(s)) # must get 100 distinct values from groups
[87, 19, 85, 90, 35, 55, 86, 58, 96, 68]
[38, 92, 93, 78, 39, 62, 43, 20, 66, 44]
[34, 75, 72, 50, 42, 52, 60, 81, 80, 41]
[13, 14, 83, 28, 53, 5, 94, 67, 79, 95]
[9, 33, 0, 76, 4, 23, 2, 3, 32, 65]
[61, 24, 31, 77, 36, 40, 47, 49, 7, 97]
[63, 15, 29, 25, 11, 82, 71, 89, 91, 30]
[12, 22, 99, 37, 73, 69, 45, 1, 88, 51]
[74, 70, 98, 26, 59, 6, 64, 46, 27, 21]
[48, 17, 18, 8, 54, 10, 57, 84, 16, 56]
100
This will take just as long as pre-shuffling the list (if not longer) but it will let you start/feed threads as you go, thus augmenting the parallelism
I have a list of integers and I want to perform operations like addition, multiplication, floor division on every element of list slice (sub array) or at certain indexes (eg. range(start, end, jump) ) efficiently. The number being added or multiplied by each element of list slice is constant (say 'k').
For example:
nums = [23, 44, 65, 78, 87, 11, 33, 44, 3]
for i in range(2, 7, 2):
nums[i] //= 2 # here 2 is the constant 'k'
print(nums)
>>> [23, 44, 32, 78, 43, 11, 16, 44, 3]
I have to perform these operations several times on different slices/ranges and the constant 'k' varies for different slices/ranges. The obvious way to do this is to run a for loop and modify the value of elements, but that isn't fast enough. You can do this efficiently by using a numpy array because it supports bulk assignment/modification but I am looking for a way to do this in pure python.
One way to avoid the for loop is the following:
>>> nums = [23, 44, 65, 78, 87, 11, 33, 44, 3]
>>> nums[2:7:2] = [x//2 for x in nums[2:7:2]]
>>> nums
[23, 44, 32, 78, 43, 11, 16, 44, 3]
I have a pandas dataframe containing ~200,000 rows and I would like to create 5 random samples of 1000 rows each however I do not want any of these samples to contain the same row twice.
To create a random sample I have been using:
import numpy as np
rows = np.random.choice(df.index.values, 1000)
sampled_df = df.ix[rows]
However just doing this several times would run the risk of having duplicates. Would the best way to handle this be keeping track of which rows are sampled each time?
You can use df.sample.
A dataframe with 100 rows and 5 columns:
df = pd.DataFrame(np.random.randn(100, 5), columns = list("abcde"))
Sample 5 rows:
df.sample(5)
Out[8]:
a b c d e
84 0.012201 -0.053014 -0.952495 0.680935 0.006724
45 -1.347292 1.358781 -0.838931 -0.280550 -0.037584
10 -0.487169 0.999899 0.524546 -1.289632 -0.370625
64 1.542704 -0.971672 -1.150900 0.554445 -1.328722
99 0.012143 -2.450915 -0.718519 -1.192069 -1.268863
This ensures those 5 rows are different. If you want to repeat this process, I'd suggest sampling number_of_rows * number_of_samples rows. For example if each sample is going to contain 5 rows and you need 10 samples, sample 50 rows. The first 5 will be the first sample, the second five will be the second...
all_samples = df.sample(50)
samples = [all_samples.iloc[5*i:5*i+5] for i in range(10)]
You can set replace to False in np.random.choice
rows = np.random.choice(df.index.values, 1000, replace=False)
Take a look on numpy.random docs
For your solution:
import numpy as np
rows = np.random.choice(df.index.values, 1000, replace=False)
sampled_df = df.ix[rows]
This will make random choices without replacement.
If you want to generate multiple samples that none will have any elements in common you will need to remove the elements from each choice after each iteration. You can usenumpy.setdiff1d for that.
import numpy as np
allRows = df.index.values
numOfSamples = 5
samples = list()
for i in xrange(numOfSamples):
choices = np.random.choice(allRows, 1000, replace=False)
samples.append(choices)
allRows = np.setdiff1d(allRows, choices)
Here is a working example with a range of numbers between 0 and 100:
In [58]: import numpy as np
In [59]: allRows = np.arange(100)
In [60]: numOfSamples = 5
In [61]: samples = list()
In [62]: for i in xrange(numOfSamples):
....: choices = np.random.choice(allRows, 5, replace=False)
....: samples.append(choices)
....: allRows = np.setdiff1d(allRows, choices)
....:
In [63]: samples
Out[63]:
[array([66, 24, 47, 31, 22]),
array([ 8, 28, 15, 62, 52]),
array([18, 65, 71, 54, 48]),
array([59, 88, 43, 7, 85]),
array([97, 36, 55, 56, 14])]
In [64]: allRows
Out[64]:
array([ 0, 1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 16, 17, 19, 20, 21,
23, 25, 26, 27, 29, 30, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 44,
45, 46, 49, 50, 51, 53, 57, 58, 60, 61, 63, 64, 67, 68, 69, 70, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 86, 87, 89, 90, 91,
92, 93, 94, 95, 96, 98, 99])
I'm using a random.seed() to try and keep the random.sample() the same as I sample more values from a list and at some point the numbers change.....where I thought the one purpose of the seed() function was to keep the numbers the same.
Heres a test I did to prove it doesn't keep the same numbers.
import random
a=range(0,100)
random.seed(1)
a = random.sample(a,10)
print a
then change the sample much higher and the sequence will change(at least for me they always do):
a = random.sample(a,40)
print a
I'm sort of a newb so maybe this is an easy fix but I would appreciate any help on this.
Thanks!
If you were to draw independent samples from the generator, what would happen would be exactly what you're expecting:
In [1]: import random
In [2]: random.seed(1)
In [3]: [random.randint(0, 99) for _ in range(10)]
Out[3]: [13, 84, 76, 25, 49, 44, 65, 78, 9, 2]
In [4]: random.seed(1)
In [5]: [random.randint(0, 99) for _ in range(40)]
Out[5]: [13, 84, 76, 25, 49, 44, 65, 78, 9, 2, 83, 43 ...]
As you can see, the first ten numbers are indeed the same.
It is the fact that random.sample() is drawing samples without replacement that's getting in the way. To understand how these algorithms work, see Reservoir Sampling. In essence what happens is that later samples can push earlier samples out of the result set.
One alternative might be to shuffle a list of indices and then take either 10 or 40 first elements:
In [1]: import random
In [2]: a = range(0,100)
In [3]: random.shuffle(a)
In [4]: a[:10]
Out[4]: [48, 27, 28, 4, 67, 76, 98, 68, 35, 80]
In [5]: a[:40]
Out[5]: [48, 27, 28, 4, 67, 76, 98, 68, 35, 80, ...]
It seems that random.sample is deterministic only if the seed and sample size are kept constant. In other words, even if you reset the seed, generating a sample with a different length is not "the same" random operation, and may give a different initial subsequence than generating a smaller sample with the same seed. In other words, the same random numbers are being generated internally, but the way sample uses them to derive the random sequence is different depending on how large a sample you ask for.
You are assuming an implementation of random.sample something like this:
def samples(lst, k):
n = len(lst)
indices = []
while len(indices) < k:
index = random.randrange(n)
if index not in indices:
indices.append(index)
return [lst[i] for i in indices]
Which gives:
>>> random.seed(1)
>>> samples(list(range(20)), 5)
[4, 18, 2, 8, 3]
>>> random.seed(1)
>>> samples(list(range(20)), 10)
[4, 18, 2, 8, 3, 15, 14, 12, 6, 0]
However, that isn't how random.sample is actually implemented; seed does work how you think, it's sample that doesn't!
You simply need to re-seed it:
a = list(range(100))
random.seed(1) # seed first time
random.sample(a, 10)
>> [17, 72, 97, 8, 32, 15, 63, 57, 60, 83]
random.seed(1) # seed second time with same value
random.sample(a, 40)
>> [17, 72, 97, 8, 32, 15, 63, 57, 60, 83, 48, 26, 12, 62, 3, 49, 55, 77, 0, 92, 34, 29, 75, 13, 40, 85, 2, 74, 69, 1, 89, 27, 54, 98, 28, 56, 93, 35, 14, 22]
But in your case you're using a generator, not a list, so after sampling the first time a will shrink (from 100 to 90), and you will lose the elements that you had sampled, so it won't work. So just use a list and seed before every sampling.