Sequential Sampling - python

To sample from N(1,2) with sample size 100 and calculating the mean of this sample we can do this:
import numpy as np
s = np.random.normal(1, 2, 100)
mean = np.mean(s)
Now if we want to produce 10000 samples and save mean of each of them we can do:
sample_means = []
for x in range(10000):
sample = np.random.normal(1, 2, 100)
sample_means.append (sample.mean())
How can I do it when we want to sample sequentially from N(1,2) and estimate the distribution mean sequentially?

IIUC you meant accumulative
sample = np.random.normal(1,2,(10000, 100))
sample_mean = []
for i,_ in enumerate(sample):
sample_mean.append(sample[:i+1,:].ravel().mean())
Then sample_mean contains the accumulative samples mean
sample_mean[:10]
[1.1185342714036368,
1.3270808654923423,
1.3266440422140355,
1.2542028664103761,
1.179358517854582,
1.1224645540064788,
1.1416887857272255,
1.1156887336750463,
1.0894328800573165,
1.0878896099712452]

Maybe list comprehension?
sample_means = [np.random.normal(1, 2, 100).mean() for i in range(10000)]
TIP Use lower case to name variables in Python

Related

Generating and Storing Samples of an Exponential Distribution with a name for each sample using a loop

I've got a weird question for a class project. Assuming X ~ Exp(Lambda), Lambda=1.6, I have to generate 100 samples of X, with the indices corresponding to the sample size of each generated sample (S1, S2 ... S100). I've worked out a simple loop which generate the required samples in array, but i am not able to rename the array.
First attempt:
import numpy as np
import matplotlib.pyplot as plt
samples = []
for i in range(1,101,1):
samples.append(np.random.exponential(scale= 1/1.6, size= i))
Second attempt:
import numpy as np
import matplotlib.pyplot as plt
for i in range(1,101,1):
samples = np.random.exponential(scale= 1/1.2, size= i)
col = f'samples {i}'
df_samples[col] = exponential_sample
df_samples = pd.DataFrame(samples)
An example how I would like to visualize the data:
# drawing 50 random samples of size 2 from the exponentially distributed population
sample_size = 2
df2 = pd.DataFrame(index= ['x1', 'x2'] )
for i in range(1, 51):
exponential_sample = np.random.exponential((1/rate), sample_size)
col = f'sample {i}'
df2[col] = exponential_sample
# Taking a peek at the samples
df2
But instead of having a simple size = 2, I would like to have sample size = i. This way, I will be able to generate 1 rows for the first column (S1), 2 rows for the second column (S2), until I reach 100 rows for the 100th column (S100).
You cannot stick vectors of different lengths easily into a df so your mock-up code would not work, but you can concat one vector at a time:
df = pd.DataFrame()
for i in range(100,10100,100):
tmp = pd.DataFrame({f'S{i}':np.random.exponential(scale= 1/1.2, size= i)})
df = pd.concat([df, tmp], axis=1)
Use a dict instead maybe?
samples = {}
for i in range(100,10100,100):
samples[i] = np.random.exponential(scale= 1/1.2, size= i)
Then you can convert it into a pandas Dataframe if you like.

Generate values in separate dataframe

I trying to generate random data with Pandas.
Data is need to be stored in two columns. The first column needs to contain categorical variables (from Stratum_1 until Stratum_19) each of these stratums can contain a random number of values.
Second column needs to have data in the range between 1 to 180000000 with a standard deviation of 453210, a mean of 170000, and a number of rows 100000.
I try to
categorical = {'name': ['Stratum_1','Stratum_2','Stratum_3','Stratum_4','Stratum_5','Stratum_6','Stratum_7','Stratum_8','Stratum_9',
'Stratum_10','Stratum_11','Stratum_12','Stratum_13','Stratum_14','Stratum_15','Stratum_16','Stratum_17','Stratum_18','Stratum_19']}
desired_mean = 170000
desired_std_dev = 453210
df = pd.DataFrame(np.random.randint(0,180000000,size=(100000, 1)),columns=list('1'))
I tried with this code above but don't know how to implement categorical and numerical values together with desired mean and standard deviation. So can anybody help how to solve this problem and generate?
I decided to use the gamma distribution to generate your desired sample after thinking that the given parameters are not suitable for the normal distribution.
Code
import numpy as np
import pandas as pd
# desired parameters
n_rows = 100000
lower, upper = 1, 180000000
mu, sigma = 170000, 453210
# amount of shift
delta = lower
# parameters for the gamma distribution
shape = ((mu - delta) / sigma) ** 2
scale = sigma**2 / (mu - delta)
# Create a dataframe
categories = {'name': [f'Stratum_{i}' for i in range(1, 19 + 1)]}
df = pd.DataFrame(categories).sample(n=n_rows, replace=True).reset_index(drop=True)
# Generate samples along with your desired parameters
generator = np.random.default_rng()
while True:
df['value'] = generator.gamma(shape=shape, scale=scale, size=n_rows) + delta
if df.value.max() <= upper:
break
# Show statistics
print(df.describe())
Output
value
count
100,000
mean
169,403 (Target: 170,000)
std
449,668 (Target: 453,210)
min
1
25%
39.4267
50%
5529.28
75%
105,748
max
9.45114e+06
Try:
import numpy as np
categorical = {'name': ['Stratum_1','Stratum_2','Stratum_3','Stratum_4','Stratum_5','Stratum_6','Stratum_7','Stratum_8','Stratum_9',
'Stratum_10','Stratum_11','Stratum_12','Stratum_13','Stratum_14','Stratum_15','Stratum_16','Stratum_17','Stratum_18','Stratum_19']}
desired_mean = 170000
desired_std_dev = 453210
df = pd.DataFrame({'num':np.random.normal(170000, 453210,size=(300000, 1)).reshape(-1), 'cat':np.random.choice(categorical['name'], 300000)})
df[(0<df['num'])&(df['num']<180000000)].sample(100000)
result:

How to generate random numbers at the tails of an exponential distribution?

I want to generate random numbers like from np.random.exponential but clipped / truncated at values a,b. For example, if a=100, b=500 then I want the function to generate random numbers following e^(-x) in the range [100, 500].
An inefficient way would be:
rands = np.random.exponential(size=10**7)
rands = rands[(rands>a) and (rands<b)]
Is there an existing package that can do this for me? Ideally for various distributions, not just exponential.
If we clip the values after using the exponential generator, there are two problems with approach proposed in the question.
First, we lose values (For example, if we wanted 10**7 values, we might only get 10^6 values)
Second, np.random.exponential() returns values between 0 and 1, so we can't simply use 100 and 500 as the lower and upper bounds. We must scale the generated random numbers before scaling.
I wrote the workaround using exp(uniform). I tested your solution using smaller values of a and b (so that we don't get empty arrays). A timed approach shows this is faster by around 50%
import time
import numpy as np
import matplotlib.pyplot as plt
def truncated_exp_OP(a,b, how_many):
rands = np.random.exponential(size=how_many)
rands = rands[(rands>a) & (rands<b)]
return rands
def truncated_exp_NK(a,b, how_many):
a = -np.log(a)
b = -np.log(b)
rands = np.exp(-(np.random.rand(how_many)*(b-a) + a))
return rands
timeTakenOP = []
for i in range(20):
startTime = time.time()
r = truncated_exp_OP(0.001,0.39, 10**7)
endTime = time.time()
timeTakenOP.append(endTime - startTime)
print ("OP solution: ", np.mean(timeTakenOP))
plt.hist(r.flatten(), 300);
plt.show()
timeTakenNK = []
for i in range(20):
startTime = time.time()
r = truncated_exp_NK(100,500, 10**7)
endTime = time.time()
timeTakenNK.append(endTime - startTime)
print ("NK solution: ", np.mean(timeTakenNK))
plt.hist(r.flatten(), 300);
plt.show()
Average run time :
OP solution: 0.28491891622543336 vs
NK solution: 0.1437338709831238
The histogram plots of the random numbers are shown below:
OP's approach:
This approach:

How to efficiently convert a list into probability distribution?

I am trying to convert a list into probability distribution.
x = [2, 4]
I want it the following array in that order.
probability_array = [1-(2+4)/10, 2/10, 4/10]
So I did the following...
y = 1 - (2 + 4)/10
new_x = [2/10, 4/10]
probability_array = [y] + new_x
The problem is I'm working with 10,000 data sets like x. Is there a faster way to do this?
I think you can do this easily with numpy. Here is an example of correctness
x=[[1, 2], [3,4]]
x=np.array(x)
sum1 = np.sum(x, axis=1).reshape(2,1)
prob = x/sum1
I think it would be pretty fast even if size of x>10000. Let's take 100 features for 10000 examples
x=np.random.randint(1, 100, size=1000000)
print(x.shape)
start=time.time()
x=x.reshape(-1, 10000)
sum1=np.sum(x, axis=1).reshape((-1, 1))
prob=x/sum1
stop=time.time()
print(stop-start)
This takes around 0.021 sec on my MBP.

Root mean square of a function in python

I want to calculate root mean square of a function in Python. My function is in a simple form like y = f(x). x and y are arrays.
I tried Numpy and Scipy Docs and couldn't find anything.
I'm going to assume that you want to compute the expression given by the following pseudocode:
ms = 0
for i = 1 ... N
ms = ms + y[i]^2
ms = ms / N
rms = sqrt(ms)
i.e. the square root of the mean of the squared values of elements of y.
In numpy, you can simply square y, take its mean and then its square root as follows:
rms = np.sqrt(np.mean(y**2))
So, for example:
>>> y = np.array([0, 0, 1, 1, 0, 1, 0, 1, 1, 1]) # Six 1's
>>> y.size
10
>>> np.mean(y**2)
0.59999999999999998
>>> np.sqrt(np.mean(y**2))
0.7745966692414834
Do clarify your question if you mean to ask something else.
You could use the sklearn function
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_actual,[0 for _ in y_actual], squared=False)
numpy.std(x) tends to rms(x) in cases of mean(x) value tends to 0 (thanks to #Seb), like it can be with sound records, vibrations, and other signals of fluctuations from zero.
rms = lambda x_seq: (sum(x*x for x in x_seq)/len(x_seq))**(1/2)
In case you'd like to frame your array before compute RMS, this is a numpy solution:
nframes = 1000
rms = np.array([
np.sqrt(np.mean(arr**2))
for arr in np.array_split(arr,nframes)
])
If you'd like to specify frame length instead of frame counts, you'd do this first:
frame_length = 200
arr_length = arr.shape[0]
nframes = arr_length // frame_length +1

Categories