Latin hypercube sampling

Latin hypercube sampling - python

LHS method provides sampling values between zero to 1. If I want to set bounds, for example, for one dimension value should be -0 to 15? How can I do that in pyDOE python?
from pyDOE import *
n = 2
samples = 50
d = lhs(n, samples, criterion='center')
x1 = d[:,0]
x2 = d[:,1]
My x1 values should be between -10 to 10, and x2 should be 1 to 20.

Multiple each data point in x1 (or x2) by the range of your bounds e.g. 10 - (-10) ie. 20, and add it to the lower bound.
x1_new = [None for i in range(len(x1))]
for i,j in enumerate(x1):
x1_new[i] = -10 + 20*j
... I think?

I figured it out
import pyDOE as pyd
bounds = np.array([[-10,10],[1,20]]) #[Bounds]
# print(xbounds)
X = pyd.lhs(2, 100, criterion='centermaximin')
X[:,0] = (X[:,0]*(bounds[0,1]-bounds[0,0])+bounds[0,0])
X[:,1] = (X[:,1]*(bounds[1,1]-bounds[1,0])+bounds[1,0])

Related

Optimisation problem with two pandas dataframes using GEKKO

I have two dataframes with the same format looking like the following:
df1
Value_0 Value_1 Value_2 ...
Date
2020-11-07 7.830 19.630 30.584 ...
2020-11-08 11.100 34.693 40.589 ...
2020-11-09 12.455 34.693 41.236 ...
.
.
.
df2
Value_0 Value_1 Value_2 ...
Date
2020-11-07 153.601 61.014 55.367 ...
2020-11-08 119.011 70.560 49.052 ...
2020-11-09 133.925 103.417 61.651 ...
.
.
.
I'm trying to:
Make a linear interpolation between each consecutive matching points (so y1 = df1.Value_0, y2 = df1.Value_1, x1 = df2.Value_0, x2 = df2.Value_1).
Maximize the product of df1 and df2 for each Date and column pair considering all posible values from the interpolation.
My current approach is the following (This goes inside a loop to evaluate each pair of columns and then store the optimisation only for the highest value, but I'm neglecting it here for sake of simplicity):
i = 0 # Example for only one use case
# Initial model
m = gekko()
# Variables
y1 = np.array(df1['Value_'+str(i)])
y2 = np.array(df1['Value_'+str(i+1)])
x1 = np.array(df2['Value_'+str(i)])
x2 = np.array(df2['Value_'+str(i+1)])
s = [None]*len(y1)
c = [None]*len(y1)
ex = [None]*len(y1)
for j in range(len(y1)):
s[j] = (y1[j]-y2[j])/(x1[j]-x2[j]) # slope
c[j] = (x1[j]*y2[j] - x2[j]*y1[j])/(x1[j]-x2[j]) # y intersect
ex[j] = -c[j]/s[j] # x intersect
p = m.Var(lb=0, ub=y2) # specific boundaries for case when i=0
n = m.Var(lb=x2, ub=ex) # specific boundaries for case when i=0
# Constraint
m.Equation((s[j]*n)+c[j]==p for j in range(len(y1))) # equation of a line
# Objective function
m.Maximize(n*p)
m.solve(disp=False)
#print('p:'+str(p.value))
#print('n:'+str(n.value))
It's my first time using Gekko and I'm getting "#error: Inequality Definition
invalid inequalities: z > x < y". I would appreciate any clues regarding what's wrong with the code/variables definition.

The lower and upper bound needs to be a single value, unless you define a separate variable for each data row. Is this a suitable replacement?
p = m.Var(lb=0, ub=max(y2))
n = m.Var(lb=min(x2), ub=max(ex))
Try using IMODE=2 to define the equation once and apply it to each data row. Here is a modification of the script that runs in mode 2.
from gekko import gekko
import numpy as np
# Initial model
m = gekko()
# Variables
ns = 10
y1 = np.random.rand(ns)
y2 = np.random.rand(ns)+1
x1 = np.random.rand(ns)
x2 = np.random.rand(ns)+1
s = [None]*len(y1)
c = [None]*len(y1)
ex = [None]*len(y1)
for j in range(len(y1)):
# slope
s[j] = (y1[j]-y2[j])/(x1[j]-x2[j])
# y-intercept
c[j] = (x1[j]*y2[j] - x2[j]*y1[j])/(x1[j]-x2[j])
# x-intercept
ex[j] = -c[j]/s[j]
s = m.Param(s); c = m.Param(c)
p = m.Var(lb=0, ub=max(y2))
n = m.Var(lb=min(x2), ub=max(ex))
# Constraint
m.Equation(s*n+c==p)
m.options.IMODE=2
# Objective function
m.Maximize(n*p)
m.solve(disp=True)
If separate upper and lower bounds are needed for each interpolation, then create an array of variables such as:
p = m.Array(m.Var,ns,lb=0)
n = m.Array(m.Var,ns)
More information on the modes of calculation are shown in the documentation.
Unique Inequality Constraints for Each Data Row
If unique inequality constraints are required for each data row, use inequality constraints as an equation definition.
p = m.Var(lb=0)
n = m.Var()
# implement constraints
x2 = m.Param(x2)
y2 = m.Param(y2)
ex = m.Param(ex)
m.Equation(p<y2)
m.Equations([n>x2,n<ex])
Below is a full script with the inequality constraints. The problem may be infeasible if all of the inequality constraints cannot be simultaneously satisfied.
from gekko import gekko
import numpy as np
# Initial model
m = gekko()
# Variables
ns = 10
y1 = np.random.rand(ns)
y2 = np.random.rand(ns)+1
x1 = np.random.rand(ns)
x2 = np.random.rand(ns)+1
s = [None]*len(y1)
c = [None]*len(y1)
ex = [None]*len(y1)
for j in range(len(y1)):
# slope
s[j] = (y1[j]-y2[j])/(x1[j]-x2[j])
# y-intercept
c[j] = (x1[j]*y2[j] - x2[j]*y1[j])/(x1[j]-x2[j])
# x-intercept
ex[j] = -c[j]/s[j]
s = m.Param(s); c = m.Param(c)
p = m.Var(lb=0)
n = m.Var()
# implement constraints
x2 = m.Param(x2)
y2 = m.Param(y2)
ex = m.Param(ex)
m.Equation(p<y2)
m.Equations([n>x2,n<ex])
# Constraint
m.Equation(s*n+c==p)
m.options.IMODE=2
# Objective function
m.Maximize(n*p)
m.solve(disp=True)

np.where() to eliminate data, where coordinates are too close to each other

I'm doing aperture photometry on a cluster of stars, and to get easier detection of background signal, I want to only look at stars further apart than n pixels (n=16 in my case).
I have 2 arrays, xs and ys, with the x- and y-values of all the stars' coordinates:
Using np.where I'm supposed to find the indexes of all stars, where the distance to all other stars is >= n
So far, my method has been a for-loop
import numpy as np
# Lists of coordinates w. values between 0 and 2000 for 5000 stars
xs = np.random.rand(5000)*2000
ys = np.random.rand(5000)*2000
# for-loop, wherein the np.where statement in question is situated
n = 16
for i in range(len(xs)):
index = np.where( np.sqrt( pow(xs[i] - xs,2) + pow(ys[i] - ys,2)) >= n)
Due to the stars being clustered pretty closely together, I expected a severe reduction in data, though even when I tried n=1000 I still had around 4000 datapoints left

Using just numpy (and part of the answer here)
X = np.random.rand(5000,2) * 2000
XX = np.einsum('ij, ij ->i', X, X)
D_squared = XX[:, None] + XX - 2 * X.dot(X.T)
out = np.where(D_squared.min(axis = 0) > n**2)
Using scipy.spatial.pdist
from scipy.spatial import pdist, squareform
D_squared = squareform(pdist(x, metric = 'sqeuclidean'))
out = np.where(D_squared.min(axis = 0) > n**2)
Using a KDTree for maximum fast:
from scipy.spatial import KDTree
X_tree = KDTree(X)
in_radius = np.array(list(X_tree.query_pairs(n))).flatten()
out = np.where(~np.in1d(np.arange(X.shape[0]), in_radius))

np.random.seed(seed=1)
xs = np.random.rand(5000,1)*2000
ys = np.random.rand(5000,1)*2000
n = 16
mask = (xs>=0)
for i in range(len(xs)):
if mask[i]:
index = np.where( np.sqrt( pow(xs[i] - x,2) + pow(ys[i] - y,2)) <= n)
mask[index] = False
mask[i] = True
x = xs[mask]
y = ys[mask]
print(len(x))
4220

You can use np.subtract.outer for creating the pairwise comparisons. Then you check for each row whether the distance is below 16 for exactly one item (which is the comparison with the particular start itself):
distances = np.sqrt(
np.subtract.outer(xs, xs)**2
+ np.subtract.outer(ys, ys)**2
)
indices = np.nonzero(np.sum(distances < 16, axis=1) == 1)

Calculating mean value of a 2D array as a function of distance from the center in Python

I'm trying to calculate the mean value of a quantity(in the form of a 2D array) as a function of its distance from the center of a 2D grid. I understand that the idea is that I identify all the array elements that are at a distance R from the center, and then add them up and divide by the number of elements. However, I'm having trouble actually identifying an algorithm to go about doing this.
I have attached a working example of the code to generate the 2d array below. The code is for calculating some quantities that are resultant from gravitational lensing, so the way the array is made is irrelevant to this problem, but I have attached the entire code so that you could create the output array for testing.
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
n = 100 # grid size
c = 3e8
G = 6.67e-11
M_sun = 1.989e30
pc = 3.086e16 # parsec
Dds = 625e6*pc
Ds = 1726e6*pc #z=2
Dd = 1651e6*pc #z=1
FOV_arcsec = 0.0001
FOV_arcmin = FOV_arcsec/60.
pix2rad = ((FOV_arcmin/60.)/float(n))*np.pi/180.
rad2pix = 1./pix2rad
Renorm = (4*G*M_sun/c**2)*(Dds/(Dd*Ds))
#stretch = [10, 2]
# To create a random distribution of points
def randdist(PDF, x, n):
#Create a distribution following PDF(x). PDF and x
#must be of the same length. n is the number of samples
fp = np.random.rand(n,)
CDF = np.cumsum(PDF)
return np.interp(fp, CDF, x)
def get_alpha(args):
zeta_list_part, M_list_part, X, Y = args
alpha_x = 0
alpha_y = 0
for key in range(len(M_list_part)):
z_m_z_x = (X - zeta_list_part[key][0])*pix2rad
z_m_z_y = (Y - zeta_list_part[key][1])*pix2rad
alpha_x += M_list_part[key] * z_m_z_x / (z_m_z_x**2 + z_m_z_y**2)
alpha_y += M_list_part[key] * z_m_z_y / (z_m_z_x**2 + z_m_z_y**2)
return (alpha_x, alpha_y)
if __name__ == '__main__':
# number of processes, scale accordingly
num_processes = 1 # Number of CPUs to be used
pool = multiprocessing.Pool(processes=num_processes)
num = 100 # The number of points/microlenses
r = np.linspace(-n, n, n)
PDF = np.abs(1/r)
PDF = PDF/np.sum(PDF) # PDF should be normalized
R = randdist(PDF, r, num)
Theta = 2*np.pi*np.random.rand(num,)
x1= [R[k]*np.cos(Theta[k])*1 for k in range(num)]
y1 = [R[k]*np.sin(Theta[k])*1 for k in range(num)]
# Uniform distribution
#R = np.random.uniform(-n,n,num)
#x1= np.random.uniform(-n,n,num)
#y1 = np.random.uniform(-n,n,num)
zeta_list = np.column_stack((np.array(x1), np.array(y1))) # List of coordinates for the microlenses
x = np.linspace(-n,n,n)
y = np.linspace(-n,n,n)
X, Y = np.meshgrid(x,y)
M_list = np.array([0.1 for i in range(num)])
# split zeta_list, M_list, X, and Y
zeta_list_split = np.array_split(zeta_list, num_processes, axis=0)
M_list_split = np.array_split(M_list, num_processes)
X_list = [X for e in range(num_processes)]
Y_list = [Y for e in range(num_processes)]
alpha_list = pool.map(
get_alpha, zip(zeta_list_split, M_list_split, X_list, Y_list))
alpha_x = 0
alpha_y = 0
for e in alpha_list:
alpha_x += e[0]
alpha_y += e[1]
alpha_x_y = 0
alpha_x_x = 0
alpha_y_y = 0
alpha_y_x = 0
alpha_x_y, alpha_x_x = np.gradient(alpha_x*rad2pix*Renorm,edge_order=2)
alpha_y_y, alpha_y_x = np.gradient(alpha_y*rad2pix*Renorm,edge_order=2)
det_A = 1 - alpha_y_y - alpha_x_x + (alpha_x_x)*(alpha_y_y) - (alpha_x_y)*(alpha_y_x)
abs = np.absolute(det_A)
I = abs**(-1.)
O = np.log10(I+1)
plt.contourf(X,Y,O,100)
The array of interest is O, and I have attached a plot of how it should look like. It can be different based on the random distribution of points.
What I'm trying to do is to plot the mean values of O as a function of radius from the center of the grid. In the end, I want to be able to plot the average O as a function of distance from center in a 2d line graph. So I suppose the first step is to define circles of radius R, based on X and Y.
def circle(x,y):
r = np.sqrt(x**2 + y**2)
return r
Now I just have to figure out a way to find all the values of O, that have the same indices as equivalent values of R. Kinda confused on this part and would appreciate any help.

You can find the geometric coordinates of a circle with center (0,0) and radius R as such:
phi = np.linspace(0, 1, 50)
x = R*np.cos(2*np.pi*phi)
y = R*np.sin(2*np.pi*phi)
these values however will not fall on the regular pixel grid but in between.
In order to use them as sampling points you can either round the values and use them as indexes or interpolate the values from the near pixels.
Attention: The pixel indexes and the x, y are not the same. In your example (0,0) is at the picture location (50,50).

Splitting a value with multiple bins numpy histogram

I'm trying to program the process on this image:
On the image the 2 on the right-side is mapped to bin "80" since its corresponding value on the left-side is 80. The 4 on the right-side however has a corresponding value of 10 on the left-side, and because there is no bin for 10, the 4 needs to get split into two values.
To accomplish this I am using numpy's histogram with the "weight" parameter like this:
t1 = [80, 10]
t2 = [2, 4]
bins = np.arange(0, 200, 20)
h = np.histogram(t1,bins=bins,weights=t2)
The 2 gets mapped correctly, but the 4 gets mapped entirely to bin 0 (leftmost).
Output:
[4 0 0 0 2 0 0 0 0]
I think is due to the fact that the first bin is responsible for all directions in a range (0 to 20), instead of giving the magnitude when the direction doesn't equal to the exact same number as the bin.
So, I was wondering if anybody knows how I can rewrite this so the output will be:
[2 2 0 0 2 0 0 0 0]

Let's consider an easier task first:
Assume you would want to quantize the gradient direction (GD) as follows: floor(GD/20). You could use the following:
h = np.bincount(np.floor(GD.reshape((-1)) / 20).astype(np.int64), GM.reshape((-1)).astype(np.float64), minlength=13)
Where np.bincount simply accumulates the gradient magnitude (GM) based on the quantized gradient direction (GD). Notice that binlength controls the length of the histogram and it equals ceil(255/20).
However, you wanted soft assignment so you have to weight the GM contribution, you might want to try:
GD = GD.reshape((-1))
GM = GM.reshape((-1))
w = ((GD / 20) - np.floor(GD / 20)).astype(np.float64)
h1 = np.bincount(np.floor(GD / 20).astype(np.int64), GM.astype(np.float64) * (1.0-w), minlength=13)
h2 = np.bincount(np.ceil(GD / 20).astype(np.int64), GM.astype(np.float64) * w, minlength=13)
h = h1 + h2
p.s one might want to consider the np.bincount documentation https://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

Refer to Roy Jevnisek's answer, minlength should be 9 as there are 9 bins.
Also, since 180 degree is equivalent to 0 degree, the last element of h should be omitted and treated as the first element of h, as both the first and last elements of h represent the weighted count of 0 degree, ie:
h[0] = h[-1]
h = h[:-1]
Then the HOG can be plotted by:
GD = GD.reshape(-1)
GM = GM.reshape(-1)
w1 = (GD / 20) - np.floor(GD / 20)
w2 = np.ceil(GD / 20) - (GD / 20)
h1 = np.bincount(np.floor(GD / 20).astype('int32'), GM * w2, minlength=9)
h2 = np.bincount(np.ceil(GD / 20).astype('int32'), GM * w1, minlength=9)
h = h1 + h2
h[0] = h[-1]
h = h[:-1]
values = np.unique(np.floor(GD / 20).astype(np.int64))[:-1]
plt.title('Histogram of Oriented Gradients (HOG)')
plt.bar(values, h)
plt.show()

programming with scipy.optimize.linprog - variable coefficients

Trying to optimize using scipy.optimize.linprog a cost function, where the cost coefficients are function of the variables; e.g.
Cost = c1 * x1 + c2 * x2 # (x1,x2 are the variables)
for example
if x1 = 1, c1 = 0.5
if x1 = 2, c1 = 1.25
etc.
* Just to clarify *
we are looking for a minimum cost of variables; xi; i=1,2,3,...
xi are positive integers.
however, the cost coefficient per xi, is a function of the value of xi.
cost is x1*f1(x1) + x2*f2(x2) + ... + c0
fi - is a "rate" table; e.g. - f1(0) = 0; f1(1) = 2.00; f1(2) = 3.00, etc.
the xi are under constrains, and they can't be negative and can't be over qi =>
0 <= xi <= qi
fi() values are calculated for each possible value of xi
I hope it clarifies the model.

Here is some prototype-code to show you how, that your problem is quite hard (regarding formulation and performance; the former is visible in the code).
The implementation uses cvxpy for modelling (convex-programming only) and is based on the mixed-integer approach.
Code
import numpy as np
from cvxpy import *
"""
x0 == 0 -> f(x) = 0
x0 == 1 -> f(x) = 1
...
x1 == 0 -> f(x) = 1
x1 == 1 -> f(x) = 4
...
"""
rate_table = np.array([[0, 1, 3, 5], [1, 4, 5, 6], [1.3, 1.7, 2.25, 3.0]])
bounds_x = (0, 3) # inclusive; bounds are needed for linearization!
# Vars
# ----
n_vars = len(rate_table)
n_values_per_var = [len(x) for x in rate_table]
I = Bool(n_vars, n_values_per_var[0]) # simplified assumption: rate-table sizes equal
X = Int(n_vars)
X_ = Variable(n_vars, n_values_per_var[0]) # X_ = mul_elemwise(I*X) broadcasted
# Constraints
# -----------
constraints = []
# X is bounded
constraints.append(X >= bounds_x[0])
constraints.append(X <= bounds_x[1])
# only one value in rate-table active (often formulated with SOS-type-1 constraints)
for i in range(n_vars):
constraints.append(sum_entries(I[i, :]) <= 1)
# linearization of product of BIN * INT (INT needs to be bounded!)
# based on Erwin's answer here:
# https://www.or-exchange.org/questions/10775/how-to-linearize-product-of-binary-integer-and-integer-variables
for i in range(n_values_per_var[0]):
constraints.append(bounds_x[0] * I[:, i] <= X_[:, i])
constraints.append(X_[:, i] <= bounds_x[1] * I[:, i])
constraints.append(X - bounds_x[1]*(1-I[:, i]) <= X_[:, i])
constraints.append(X_[:, i] <= X - bounds_x[0]*(1-I[:, i]))
# Fix chosings -> if table-entry x used -> integer needs to be x
# assumptions:
# - table defined for each int
help_vec = np.arange(n_values_per_var[0])
constraints.append(I * help_vec == X)
# ONLY FOR DEBUGGING -> make simple max each X solution infeasible
constraints.append(sum_entries(mul_elemwise([1, 3, 2], square(X))) <= 15)
# Objective
# ---------
objective = Maximize(sum_entries(mul_elemwise(rate_table, X_)))
# Problem & Solve
# ---------------
problem = Problem(objective, constraints)
problem.solve() # choose other solver if needed, e.g. commercial ones like Gurobi, Cplex
print('Max-objective: ', problem.value)
print('X:\n' + str(X.value))
Output
('Max-objective: ', 20.70000000000001)
X:
[[ 3.]
[ 1.]
[ 1.]]
Idea
Transform the objective max: x0*f(x0) + x1*f(x1) + ...
into: x0*f(x0==0) + x0*f(x0==1) + ... + x1*f(x1==0) + x1*f(x1==1)+ ...
Introduce binary-variables to formulate:
f(x0==0) as I[0,0]*table[0,0]
f(x1==2) as I[1,2]*table[0,2]
Add constraints to limit the above I to have one nonzero entry only for each variable x_i (only one of the expanded objective-components will be active)
Linearize the product x0*f(x0==0) == x0*I[0,0]*table(0,0) (integer * binary * constant)
Fix the table-lookup: using table-entry with index x (of x0) should result in x0 == x
assuming, that there are no gaps in the table, this can be done formulated as I * help_vec == X) where help_vec == vector(lower_bound, ..., upper_bound)
cvxpy is automatically (by construction) proving, that our formulation is convex, which is needed for most solvers (and in general not easy to recognize).
Just for fun: bigger-problem and commercial-solver
Input generated by:
def gen_random_growing_table(size):
return np.cumsum(np.random.randint(1, 10, size))
SIZE = 100
VARS = 100
rate_table = np.array([gen_random_growing_table(SIZE) for v in range(VARS)])
bounds_x = (0, SIZE-1) # inclusive; bounds are needed for linearization!
...
...
constraints.append(sum_entries(square(X)) <= 150)
Output:
Explored 19484 nodes (182729 simplex iterations) in 129.83 seconds
Thread count was 4 (of 4 available processors)
Optimal solution found (tolerance 1.00e-04)
Warning: max constraint violation (1.5231e-05) exceeds tolerance
Best objective -1.594000000000e+03, best bound -1.594000000000e+03, gap 0.0%
('Max-objective: ', 1594.0000000000005)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Latin hypercube sampling - python

Multiple each data point in x1 (or x2) by the range of your bounds e.g. 10 - (-10) ie. 20, and add it to the lower bound. x1_new = [None for i in range(len(x1))] for i,j in enumerate(x1): x1_new[i] = -10 + 20*j ... I think?

I figured it out import pyDOE as pyd bounds = np.array([[-10,10],[1,20]]) #[Bounds] # print(xbounds) X = pyd.lhs(2, 100, criterion='centermaximin') X[:,0] = (X[:,0](bounds[0,1]-bounds[0,0])+bounds[0,0]) X[:,1] = (X[:,1](bounds[1,1]-bounds[1,0])+bounds[1,0])

Related

Optimisation problem with two pandas dataframes using GEKKO

np.where() to eliminate data, where coordinates are too close to each other

Calculating mean value of a 2D array as a function of distance from the center in Python

Splitting a value with multiple bins numpy histogram

programming with scipy.optimize.linprog - variable coefficients

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Latin hypercube sampling - python

Multiple each data point in x1 (or x2) by the range of your bounds e.g. 10 - (-10) ie. 20, and add it to the lower bound. x1_new = [None for i in range(len(x1))] for i,j in enumerate(x1): x1_new[i] = -10 + 20*j ... I think?

I figured it out import pyDOE as pyd bounds = np.array([[-10,10],[1,20]]) #[Bounds] # print(xbounds) X = pyd.lhs(2, 100, criterion='centermaximin') X[:,0] = (X[:,0]*(bounds[0,1]-bounds[0,0])+bounds[0,0]) X[:,1] = (X[:,1]*(bounds[1,1]-bounds[1,0])+bounds[1,0])

Related

Optimisation problem with two pandas dataframes using GEKKO

np.where() to eliminate data, where coordinates are too close to each other

Calculating mean value of a 2D array as a function of distance from the center in Python

Splitting a value with multiple bins numpy histogram

programming with scipy.optimize.linprog - variable coefficients

Categories

Resources

I figured it out import pyDOE as pyd bounds = np.array([[-10,10],[1,20]]) #[Bounds] # print(xbounds) X = pyd.lhs(2, 100, criterion='centermaximin') X[:,0] = (X[:,0](bounds[0,1]-bounds[0,0])+bounds[0,0]) X[:,1] = (X[:,1](bounds[1,1]-bounds[1,0])+bounds[1,0])