Generate data based on exponential distribution - python

I want to generate a dataset of 30 entries such between the range of (50-5000) such that it follows an increasing curve(log curve) i.e. increasing in the start and then stagnant in the end.
I came across the from scipy.stats import expon but I am not sure how to use the package in my scenario.
Can anyone help.
A possible output would look like [300, 1000, 1500, 1800, 1900, ...].

First you need to generate 30 random x values (uniformly). Then you get log(x). Ideally, log(x) should be in range [50, 5000). However, in such case you would need e^50 <= x <= e^5000 (overflow!!). A possible solution is to generate random x values in [min_x, max_x), get the logarithmic values and then scale them to the desired range [50, 5000).
import numpy as np
min_y = 50
max_y = 5000
min_x = 1
# any number max_x can be chosen
# this number controls the shape of the logarithm, therefore the final distribution
max_x = 10
# generate (uniformly) and sort 30 random float x in [min_x, max_x)
x = np.sort(np.random.uniform(min_x, max_x, 30))
# get log(x), i.e. values in [log(min_x), log(max_x))
log_x = np.log(x)
# scale log(x) to the new range [min_y, max_y)
y = (max_y - min_y) * ((log_x - np.log(min_x)) / (np.log(max_x) - np.log(min_x))) + min_y

Related

Numpy mapping a 3D array onto a 2D array, but arrays don't match

I'm not sure how the best way to go about this, but I have a data file with a list of coordinates x, y, and some magnitude. Lets say population.
X, Y, POP
1.2, 1.3, 1000
22.5, 2.5, 250
...
98.6, 1.7, 1500
So first, I round X, and Y to the nearest int and create a meshgrid based on a range of min and max values.
Xmin = np.amin(df['X'])
Xmax = np.amax(df['X'])
Ymin = np.amin(df['Y'])
Ymax = np.amax(df['Y'])
## I use a step of 10 as I don't have enough memory for a step counter of 1
## This is where the problem is
X = np.arange(int(Xmin), int(Xmax), 10)
Y = np.arange(int(Ymin), int(Ymax), 10)
xx, yy = np.meshgrid(X, Y)
So now I have a meshgrid, which kind of contains all of my coordinates. The problem now lies in the fact, that I want this new mesh grid to contain my magnitude values from the dataframe.
Thus I want to map my df onto the meshgrid, to give me something like this.
X, Y, POP
1, 1, 1000
10, 1, N/A
20, 1, 250
...
90, 1, 1500
This is what I used to solve my problem, but it is slow, and compares every value. I suppose my question boils down to, is there a quicker/more efficient method compared to my code below? Ultimately, I plan to fill the N/A values using a mean of the surrounding cells. But to do that, I want to first map everything to a nice uniform grid.
state = np.empty([X.shape[0], Y.shape[0]])
state[:] = np.nan
for i in range(0, len(X), 1):
for j in range(0, len(Y), 1):
for z in range(0, len(df['X']), 1):
if (abs(X[i] - df['X'][z]) < 10) and (abs(Y[j] - df['Y'][z]) < 10):
state[i,j] = df['POP'][z]
precision = -1 # the number of digits to round off
df.groupby(
[df['X'].round(precision), df['Y'].round(precision)]
).POP.sum().unstack().reindex(np.arange(...), np.arange(...))
I summed up values falling into the same cell, you might want to use average.

given percentiles find distribution function python

From https://stackoverflow.com/a/30460089/2202107, we can generate CDF of a normal distribution:
import numpy as np
import matplotlib.pyplot as plt
N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)
# plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()
Question: How do we generate the "original" normal distribution, given only x (eg X2) and y (eg F2) coordinates?
My first thought was plt.plot(x,np.gradient(y)), but gradient of y was all zero (data points are evenly spaced in y, but not in x) These kind of data is often met in percentile calculations. The key is to get the data evenly space in x and not in y, using interpolation:
x=X2
y=F2
num_points=10
xinterp = np.linspace(-2,2,num_points)
yinterp = np.interp(xinterp, x, y)
# for normalizing that sum of all bars equals to 1.0
tot_val=1.0
normalization_factor = tot_val/np.trapz(np.ones(len(xinterp)),yinterp)
plt.bar(xinterp, normalization_factor * np.gradient(yinterp), width=0.2)
plt.show()
output looks good to me:
I put my approach here for examination. Let me know if my logic is flawed.
One issue is: when num_points is large, the plot looks bad, but it's a issue in discretization, not sure how to avoid it.
Related posts:
I failed to understand why the answer was so complicated in https://stats.stackexchange.com/a/6065/131632
I also didn't understand why my approach was different than Generate distribution given percentile ranks

Creating a matrix of random data in Python

I am trying to create a matrix in python that is 30 × 10 and has randomly generated numbers inside of it. But my numbers in the matrix have to follow the condition:
Randomly generate 30 data points from the sine function, where each data point (x,y) has the form
x = [x0, x1, x2,..., x10], x ∈ [0, 2π]
y = sin(x) + ε, ε ∈ N(0,0.3)
How might I be able to go about this?
Right now I only have a 1 × 10 matrix
def generate_sin_data():
x = np.random.rand()
y = np.sin(x)
features = [x**0, x**1, x**2, x**3, x**4,x**5, x**6, x**7, x**8, x**9,x**10]
return x,y,features
I'm not 100% certain I follow everything, but we can break it down. Here's how you can generate 30 random numbers between 0 and 2π:
import numpy as np
x = np.random.random(30) * 2*np.pi
Here, x is a 1D array of 30 numbers. Check this with x.shape.
Now if you add a dimension, it's easy to generate a matrix of powers up to 10 using NumPy's broadcasting feature. The question seems to ask for 11 numbers (0 to 10) not 10, so I'll do that:
X = x.reshape(-1, 1) ** np.arange(0, 11)
That reshape effectively turns x into a column vector. Now check X.shape and it's (30, 11), which is what I think you were after. Notice we use a big X for a matrix — this convention will help you keep track of things. Each column of X is the original function raised to a power from 0 to 10. (Note that each column comes from the same set of random numbers — I'm not sure if that's what you want?)
If you want y as a function of x (the vector) then do like so:
ϵ = np.random.random(30) * 0.3
y = np.sin(x) + ϵ
import numpy as np
# 30 random uniform values in [0, 2*pi)
_x = np.random.uniform(0, 2*np.pi, 30)
# matrix of 30x10:
x = np.array([
[v ** i for i in range(10)]
for v in _x
])
# random 30x10 normal noise:
eps = np.random.normal(0, 0.3, [30, 10])
# final result 30x10 matrix:
y = np.sin(x) + eps

projectile motion simple simulation using numpy matplotlib python

I am trying to graph a projectile through time at various angles. The angles range from 25 to 60 and each initial angle should have its own line on the graph. The formula for "the total time the projectile is in the air" is the formula for t. I am not sure how this total time comes into play, because I am supposed to graph the projectile at various times with various initial angles. I imagine that I would need x,x1,x2,x3,x4,x5 and the y equivalents in order to graph all six of the various angles. But I am confused on what to do about the time spent.
import numpy as np
import matplotlib.pylab as plot
#initialize variables
#velocity, gravity
v = 30
g = -9.8
#increment theta 25 to 60 then find t, x, y
#define x and y as arrays
theta = np.arange(25,65,5)
t = ((2 * v) * np.sin(theta)) / g #the total time projectile remains in the #air
t1 = np.array(t) #why are some negative
x = ((v * t1) * np.cos(theta))
y = ((v * t1) * np.sin(theta)) - ((0.5 * g) * (t ** 2))
plot.plot(x,y)
plot.show()
First of all g is positive! After fixing that, let's see some equations:
You know this already, but lets take a second and discuss something. What do you need to know in order to get the trajectory of a particle?
Initial velocity and angle, right? The question is: find the position of the particle after some time given that initial velocity is v=something and theta=something. Initial is important! That's the time when we start our experiment. So time is continuous parameter! You don't need the time of flight.
One more thing: Angles can't just be written as 60, 45, etc, python needs something else in order to work, so you need to write them in numerical terms, (0,90) = (0,pi/2).
Let's see the code:
import numpy as np
import matplotlib.pylab as plot
import math as m
#initialize variables
#velocity, gravity
v = 30
g = 9.8
#increment theta 25 to 60 then find t, x, y
#define x and y as arrays
theta = np.arange(m.pi/6, m.pi/3, m.pi/36)
t = np.linspace(0, 5, num=100) # Set time as 'continous' parameter.
for i in theta: # Calculate trajectory for every angle
x1 = []
y1 = []
for k in t:
x = ((v*k)*np.cos(i)) # get positions at every point in time
y = ((v*k)*np.sin(i))-((0.5*g)*(k**2))
x1.append(x)
y1.append(y)
p = [i for i, j in enumerate(y1) if j < 0] # Don't fall through the floor
for i in sorted(p, reverse = True):
del x1[i]
del y1[i]
plot.plot(x1, y1) # Plot for every angle
plot.show() # And show on one graphic
You are making a number of mistakes.
Firstly, less of a mistake, but matplotlib.pylab is supposedly used to access matplotlib.pyplot and numpy together (for a more matlab-like experience), I think it's more suggested to use matplotlib.pyplot as plt in scripts (see also this Q&A).
Secondly, your angles are in degrees, but math functions by default expect radians. You have to convert your angles to radians before passing them to the trigonometric functions.
Thirdly, your current code sets t1 to have a single time point for every angle. This is not what you need: you need to compute the maximum time t for every angle (which you did in t), then for each angle create a time vector from 0 to t for plotting!
Lastly, you need to use the same plotting time vector in both terms of y, since that's the solution to your mechanics problem:
y(t) = v_{0y}*t - g/2*t^2
This assumes that g is positive, which is again wrong in your code. Unless you set the y axis to point downwards, but the word "projectile" makes me think this is not the case.
So here's what I'd do:
import numpy as np
import matplotlib.pyplot as plt
#initialize variables
#velocity, gravity
v = 30
g = 9.81 #improved g to standard precision, set it to positive
#increment theta 25 to 60 then find t, x, y
#define x and y as arrays
theta = np.arange(25,65,5)[None,:]/180.0*np.pi #convert to radians, watch out for modulo division
plt.figure()
tmax = ((2 * v) * np.sin(theta)) / g
timemat = tmax*np.linspace(0,1,100)[:,None] #create time vectors for each angle
x = ((v * timemat) * np.cos(theta))
y = ((v * timemat) * np.sin(theta)) - ((0.5 * g) * (timemat ** 2))
plt.plot(x,y) #plot each dataset: columns of x and columns of y
plt.ylim([0,35])
plot.show()
I made use of the fact that plt.plot will plot the columns of two matrix inputs versus each other, so no loop over angles is necessary. I also used [None,:] and [:,None] to turn 1d numpy arrays to 2d row and column vectors, respectively. By multiplying a row vector and a column vector, array broadcasting ensures that the resulting matrix behaves the way we want it (i.e. each column of timemat goes from 0 to the corresponding tmax in 100 steps)
Result:

Weighted mean in numpy/python

I have a big continuous array of values that ranges from (-100, 100)
Now for this array I want to calculate the weighted average described here
since it's continuous I want also to set breaks for the values every 20
i.e the values should be discrete as
-100
-80
-60
....
60
80
100
How can I do this in NumPy or python in general?
EDIT: the difference here from the normal mean, that the mean is calculated according to the frequency of values
You actually have 2 different questions.
How to make data discrete, and
How to make a weighted average.
It's usually better to ask 1 question at a time, but anyway.
Given your specification:
xmin = -100
xmax = 100
binsize = 20
First, let's import numpy and make some data:
import numpy as np
data = numpy.array(range(xmin, xmax))
Then let's make the binnings you are looking for:
bins_arange = numpy.arange(xmin, xmax + 1, binsize)
From this we can convert the data to the discrete form:
counts, edges = numpy.histogram(data, bins=bins_arange)
Now to calculate the weighted average, we can use the binning middle (e.g. numbers between -100 and -80 will be on average -90):
bin_middles = (edges[:-1] + edges[1:]) / 2
Note that this method does not require the binnings to be evenly "spaced", contrary to the integer division method.
Then let's make some weights:
weights = numpy.array(range(len(counts)) / sum(range(len(counts))
Then to bring it all together:
average = np.sum(bin_middles * counts * 1) / sum(counts)
weighted_average = np.sum(bin_middles * counts * weights) / sum(counts)
For the discretization (breaks), here is a method using the python integer division :
import numpy as np
values = np.array([0, 5, 10, 11, 21, 24, 48, 60])
(values/20) *20
# or (a/10).astype(int)*10 to force rounding
that will print :
aarray([ 0, 0, 0, 0, 20, 20, 40, 60])
For the weighted mean, if you have another array with the weights for each point, you can use :
weighted_means = sum([ w*v for w,v in zip(weights, values)]) / sum( w*w )

Categories