How to generate random samples from a population in Python?

How to generate random samples from a population in Python? - python

I'm trying to address this question:
Generate 1,000 random samples of size 50 from population. Calculate the mean of each of these samples (so you should have 1,000 means) and put them in a list norm_samples_50.
My guess is I have to use the randn function, but I can't quite guess on how to form the syntax based on the question above. I've done the research and can't find an answer that fits.

A very efficient solution using Numpy.
import numpy
sample_list = []
for i in range(50): # 50 times - we generate a 1000 of 0-1000random -
rand_list = numpy.random.randint(0,1000, 1000)
# generates a list of 1000 elements with values 0-1000
sample_list.append(sum(rand_list)/50) # sum all elements
Python one-liner
from numpy.random import randint
sample_list = [sum(randint(0,1000,1000))/50 for _ in range(50)]
Why use Numpy? It is very efficient and very accurate (decimal). This library is made just for these types of computations and numbers. Using random from the standard lib is fine but not nearly as speedy or reliable.

Is this what you wanted?
import random
# Creating a population replace with your own:
population = [random.randint(0, 1000) for x in range(1000)]
# Creating the list to store all the means of each sample:
means = []
for x in range(1000):
# Creating a random sample of the population with size 50:
sample = random.sample(population,50)
# Getting the sum of values in the sample then dividing by 50:
mean = sum(sample)/50
# Adding this mean to the list of means
means.append(mean)

Related

Is there an efficient way to create a binomial experiment of N bernoulli trials in a numpy array?

Suppose I have a coin with that lands on heads with probability P. The experiment to be performed is to continue flipping the coin x number of times. This experiment is to be repeated 1000 times.
Question Is there an efficient/vectorized approach to generate an array of random 1's(with probability P) and 0's (with probability 1-P)?
If I try something like this:
np.full(10,rng().choice((0,1),p= [.3,.7]))
The entire array is filled with the same selection. I have seen solutions that involve a fixed ratio of zeros to ones.
a = np.ones(n+m)
a[:m] = 0
np.random.shuffle(a)
However I'm not sure how to preserve the stochastic nature of the experiments with this set up.
Presently I am just looping through each iteration as follows, but it is very slow once the number of experiments gets large.
(The actual experiment involves terminating each trial when two consecutive heads are flipped, which is why there is a while loop in the code. For purposes of making the question specific I didn't want to address that here. )
Set = [0,1]
T = np.ones(Episodes)
for i in range(Episodes):
a = rng().choice(Set, p=[(1 - p), p])
counter = 1
while True:
b = rng().choice(Set, p=[(1-p),p])
counter += 1
if (a == 1) & (b == 1):
break
a = b
T[i] = counter
Any insights would be appreciated, thanks!

Answers provided by #Quang Hong and #Kevin as listed in the comments above. Just reposting with default_rng() so it is easier to reference later. But they are the true heroes here.
from numpy import default_rng as rng
rng().binomial(1, p = .7, size=(10,10))
rng().choice((0,1),p = [.3,.7], size=(10,10))

How to sample M times from N different normal distributions in python? Is there a "faster" way in terms of processing time?

I need to sample multiple (M) times from N different normal distributions. This repeated sampling will happen in turn several thousand times. I want to do this in the most efficient way, because I would like to not die of old age before this process ends. The code would look something like this:
import numpy as np
# bunch of stuff that is unrelated to the problem
number_of_repeated_processes = 5000
number_of_samples_per_process = 20
# the normal distributions I'm sampling from are described by 2 vectors:
#
# myMEANS <- an numpy array of length 10 containing the means for the distributions
# myVAR <- an numpy array of length 10 containing the variance for the distributions
for i in range(number_of_repeated_processes):
# myRESULT is a list of arrays containing the results for the sampling
#
myRESULT = [np.random.normal(loc=myMEANS[j], scale=myVAR[j], size = number_of_samples_per_process) for j in range(10)]
#
# here do something with myRESULT
# end for loop
The questions is... is there a better way to obtain the myRESULT matrix

np.random.normal accepts means-var as an array directly and you can choose a size that covers all the sampling in one run without loops:
myRESULT = np.random.normal(loc=myMEANS, scale=myVAR, size = (number_of_samples_per_process, number_of_repeated_processes,myMEANS.size))
This will return a number_of_samples_per_process by number_of_repeated_processes column for each mean-var pair in your myMEANS-myVAR array. For example, to access your samples of myMEANS[i]-myVAR[i], use myRESULT[...,i]. This should boost your performance somewhat.

Python factor level combinations

I'm trying to create a python version of the attentional network task. See this as a reference (page 3): http://www.researchgate.net/publication/7834908_The_Activation_of_Attentional_Networks
I have a total of 216 trials. Half of which will be "congruent", half are "incongruent". Furthermore, a third of the 216 will be "nocue", another third will be "center", and the final third will be "spatial"
Each of the 216 trials will be some combination of the above (e.g. congruent-spatial, incongruent-none)
This is how I'm creating those trials right now:
import pandas as pd
import numpy as np
import random
#set number of trials
numTrials = 216
numCongruent = numTrials / 2
numCue = numTrials / 3
#create shuffled congruency conditions
congruent = ["congruent"] * numCongruent
incongruent = ["incongruent"] * numCongruent
congruentConditions = congruent + incongruent
random.shuffle(congruentConditions)
#create shuffled cue conditions
noCue = ["none"] * numCue
centerCue = ["center"] * numCue
spatialCue = ["spatial"] * numCue
cueConditions = noCue + centerCue + spatialCue
random.shuffle(cueConditions)
#put everything into a dataframe
df = pd.DataFrame()
congruentArray = np.asarray(congruentConditions)
cueArray = np.asarray(cueConditions)
df["congruent"] = congruentArray
df["cue"] = cueArray
print df
2 questions...
Now, this works, but one important point is ensuring even distribution of the levels.
For example, I need to ensure that all of the "congruent" trials have an equal number of "nocue", "center", and "spatial" trials. And conversely, all of the "nocue" trials, for example, need to half an equal number of "congruent" and "incongruent" trials.
This is currently not ensured given the way I'm randomly shuffling the conditions. This would even out over an infinite sample size, but that is not the case here.
How would I ensure an equal distribution?
I've taken a look at the cartesian product (https://docs.python.org/2/library/itertools.html#itertools.product), but I'm not entirely that will help me achieve the equality problem
Once the above has been solved, I then need to ensure that in the final shuffled list, each trial type (e.g. congruent-spatial) follows each other trial type an equal number of times in the list order

One easy option is to generate a list of the 216 trials and shuffle it:
In [16]: opt1 = ["congruent", "incongruent"]
In [17]: opt2 = ["nocue", "center", "spatial"]
In [18]: from itertools import product
In [19]: trials = list(product(opt1, opt2))*36
In [20]: np.random.shuffle(trials)
trials will then be a randomly ordered list with 36 of each of the pairs.
EDIT: Your edit is a harder problem, and honestly, I'd need to think more about it to figure out if there is a solution or to prove that you can't have that desired property.
If "close enough" to even works, the best I could come up with is a bogosort approach: shuffle the list, check whether all of the a->b counts are between 4-8, and start over if they're not. Generally runs in 1-5 seconds on my machine:
def checkvals(v):
return all(x in (4,5,6,7,8) for x in v[1].value_counts().values)
def checkall(trials):
return all(checkvals(v) for k, v in pd.DataFrame(zip(trials, trials[1:])).groupby(0))
while not checkall(trials):
np.random.shuffle(trials)

Python: Number ranges that are extremely large?

val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = range(0, val)
shuffle(numbers)
I cannot find a simple way to make this work with extremely large inputs - can anyone help?
I saw a question like this - but I could not implement the range function they described in a way that works with shuffle. Thanks.

To get a random permutation of the range [0, n) in a memory efficient manner; you could use numpy.random.permutation():
import numpy as np
numbers = np.random.permutation(n)
If you need only small fraction of values from the range e.g., to get k random values from [0, n) range:
import random
from functools import partial
def sample(n, k):
# assume n is much larger than k
randbelow = partial(random.randrange, n)
# from random.py
result = [None] * k
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow()
while j in selected:
j = randbelow()
selected_add(j)
result[i] = j
return result
print(sample(10**100, 10))

If you don't need the full list of numbers (and if you are getting billions, its hard to imagine why you would need them all), you might be better off taking a random.sample of your number range, rather than shuffling them all. In Python 3, random.sample can work on a range object too, so your memory use can be quite modest.
For example, here's code that will sample ten thousand random numbers from a range up to whatever maximum value you specify. It should require only a relatively small amount of memory beyond the 10000 result values, even if your maximum is 100 billion (or whatever enormous number you want):
import random
def get10kRandomNumbers(maximum):
pop = range(1, maximum+1) # this is memory efficient in Python 3
sample = random.sample(pop, 10000)
return sample
Alas, this doesn't work as nicely in Python 2, since xrange objects don't allow maximum values greater than the system's integer type can hold.

An important point to note is that it will be impossible for a computer to have the list of numbers in memory if it is larger than a few billion elements: its memory footprint becomes larger than the typical RAM size (as it takes about 4 GB for 1 billion 32-bit numbers).
In the question, val is a long integer, which seems to indicate that you are indeed using more than a billion integer, so this cannot be done conveniently in memory (i.e., shuffling will be slow, as the operating system will swap).
That said, if the number of elements is small enough (let's say smaller than 0.5 billion), then a list of elements can fit in memory thanks to the compact representation offered by the array module, and be shuffled. This can be done with the standard module array:
import array, random
numbers = array.array('I', xrange(10**8)) # or 'L', if the number of bytes per item (numbers.itemsize) is too small with 'I'
random.shuffle(numbers)

Methods for quickly calculating standard deviation of large number set in Numpy

What's the best(fastest) way to do this?
This generates what I believe is the correct answer, but obviously at N = 10e6 it is painfully slow. I think I need to keep the Xi values so I can correctly calculate the standard deviation, but are there any techniques to make this run faster?
def randomInterval(a,b):
r = ((b-a)*float(random.random(1)) + a)
return r
N = 10e6
Sum = 0
x = []
for sample in range(0,int(N)):
n = randomInterval(-5.,5.)
while n == 5.0:
n = randomInterval(-5.,5.) # since X is [-5,5)
Sum += n
x = np.append(x, n)
A = Sum/N
for sample in range(0,int(N)):
summation = (x[sample] - A)**2.0
standard_deviation = np.sqrt((1./N)*summation)

You made a decent attempt, but should make sure you understand this and don't copy explicitly since this is HW
import numpy as np
N = int(1e6)
a = np.random.uniform(-5,5,size=(N,))
standard_deviation = np.std(a)
This assumes you can use a package like numpy (you tagged it as such). If you can, there are a whole host of methods that allow you to create and do operations on arrays of data, thus avoiding explicit looping (it's done under the hood in an efficient manner). It would be good to take a look at the documentation to see what features are available and how to use them:
http://docs.scipy.org/doc/numpy/reference/index.html

Using the formulas found on this wiki page for Variance, you could compute it in one loop without storing a list of the random numbers (assuming you didn't need them elsewhere).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to generate random samples from a population in Python? - python

Related

Is there an efficient way to create a binomial experiment of N bernoulli trials in a numpy array?

How to sample M times from N different normal distributions in python? Is there a "faster" way in terms of processing time?

Python factor level combinations

Python: Number ranges that are extremely large?

Methods for quickly calculating standard deviation of large number set in Numpy

Categories

Resources