How to create a design of experiments with both continuous and discrete random variables with OpenTURNS?
I get that we can do:
X0 = ot.Normal()
X1 = ot.Normal()
distribution = ot.ComposedDistribution([X0,X1])
But this creates only a continuous joint distribution, from which I can sample from. But how to create a joint distribution of a continuous and a discrete variable? Can I sample from it then?
Actually, in general, OpenTURNS does not make much difference between continuous and discrete distributions. So, once we have created a Distribution, all we have to do is to use the getSample method to get a simple Monte-Carlo sample. The following example shows that we can push the idea a little further by creating a LHS design of experiments.
To create the first marginal of the distribution, we select a univariate discrete distribution. Many of them, like the Bernoulli or Geometric distributions, are implemented in the library. In this example we pick the UserDefined distribution that assigns equal weights to the values -2, -1, 1 and 2.
Then we create a Monte-Carlo experiment first with the getSample method and then with the MonteCarloExperiment method. Any other type of design of experiments can be generated based on this distribution and this is why we finally show how to create a LHS (Latin Hypercube) experiment.
import openturns as ot
sample = ot.Sample([-2., -1., 1., 2.],1)
X0 = ot.UserDefined(sample)
X1 = ot.Normal()
distribution = ot.ComposedDistribution([X0,X1])
# Monte-Carlo experiment, simplest version
sample = distribution.getSample(10)
print(sample)
# Monte-Carlo experiment
size = 100
experiment = ot.MonteCarloExperiment(distribution, size)
sample = experiment.generate()
The following script produces the associated graphics.
graph = ot.Graph("MonteCarloExperiment", "x0", "x1", True, "")
cloud = ot.Cloud(sample, "blue", "fsquare", "")
graph.add(cloud)
graph
The previous script prints:
[ v0 X0 ]
0 : [ 2 -0.0612243 ]
1 : [ 1 0.789099 ]
2 : [ -1 0.583868 ]
3 : [ -1 1.33198 ]
4 : [ -2 -0.934389 ]
5 : [ 2 0.559401 ]
6 : [ -1 0.860048 ]
7 : [ 1 -0.822009 ]
8 : [ 2 -0.548796 ]
9 : [ -1 1.46505 ]
and produces the following graphics:
It is straightforward to create a LHS on the same distribution.
size = 100
experiment = ot.LHSExperiment(distribution, size)
sample = experiment.generate()
Related
I'm tasked with using Parzen windows with the radial basis function kernel to determine which label to give to a given point.
My training data set has 4 dimensions (4 features per point).
My training label set contains the labels (which can be 0,1,2,... depending on how many classes we have) for all the points in my training set (It's a 1D-array).
My test data set contains a couple of points with 4 dimensions but no labels so it's a nx4 array.
We're interested in giving labels for each of the points in my test data set.
I first compute the rdf kernel $k(x_i,x)$: (using python and numpy)
for (i, ex) in enumerate(test_data):
squared_distances = (np.sum((np.abs(ex - self.train_inputs)) ** 2, axis=1)) ** (1.0 / 2)
k = np.exp(- squared_distances/2*(np.square(self.sigma)))
Let's assume that test_data looks like this :
[[ 0.40614 1.3492 -1.4501 -0.55949]
[ -1.3887 -4.8773 6.4774 0.34179]
[ -3.7503 -13.4586 17.5932 -2.7771 ]
[ -3.5637 -8.3827 12.393 -1.2823 ]
[ -2.5419 -0.65804 2.6842 1.1952 ]]
ex is a point from the test data set. here as an example :
[ 0.40614 1.3492 -1.4501 -0.55949]
self.train_inputs is the training data set and it looks like this
[[ 3.6216 8.6661 -2.8073 -0.44699]
[ 4.5459 8.1674 -2.4586 -1.4621 ]
[ 3.866 -2.6383 1.9242 0.10645]
...
[-1.1667 -1.4237 2.9241 0.66119]
[-2.8391 -6.63 10.4849 -0.42113]
[-4.5046 -5.8126 10.8867 -0.52846]]
k is an array containing all the distances between every x_i (in self.training_inputs) and our current test point x (which is ex in the code).
k = [0.99837982 0.9983832 0.99874063 ... 0.9988909 0.99706044 0.99698724]
It's of the same length as the number of points in self.train_inputs.
My understanding of the radial basis function is that the closest the training points are to the test point the greater the value of k(current training point, test point). However k can never exceed 1 or be below 0.
So the goal is to select the training point that is the closest to the test point. We do this by looking which has the greatest value in k. Then we take its index and use that same index on the array containing the labels only. Therefore we get the label we want our test point to take.
In code it translates to this (the additional code is put below the first code snippet above) :
best_arg = np.argmax(k) #selects the greatest value in k and gives back its index.
classes_pred[i] = self.train_labels[best_arg] #we use the index to select the label in the train labels array.
Here self.train_labels looks like :
[0. 0. 0. ... 1. 1. 1.]
This approach gives for ex = [ 0.40614 1.3492 -1.4501 -0.55949] and k = [0.99837982 0.9983832 0.99874063 ... 0.9988909 0.99706044 0.99698724] :
818 for the index containing the greatest value in the current k and 1. as the label given self.train_labels[818] = 1.
However it seems that I'm doing this wrong. Given an already implemented algorithm by my teacher I get some of the labels wrong (especially when we have more then two classes). My question is am I doing this wrong? If yes where? I'm new to machine learning btw.
I need a good algorithm for calculating the point that is closest to a collection of lines in python, preferably by using least squares. I found this post on a python implementation that doesn't work:
Finding the centre of multiple lines using least squares approach in Python
And I found this resource in Matlab that everyone seems to like... but I'm not sure how to convert it to python:
https://www.mathworks.com/matlabcentral/fileexchange/37192-intersection-point-of-lines-in-3d-space
I find it hard to believe that someone hasn't already done this... surely this is part of numpy or a standard package, right? I'm probably just not searching for the right terms - but I haven't been able to find it yet. I'd be fine with defining lines by two points each or by a point and a direction. Any help would be greatly appreciated!
Here's an example set of points that I'm working with:
initial XYZ points for the first set of lines
array([[-7.07107037, 7.07106748, 1. ],
[-7.34818339, 6.78264559, 1. ],
[-7.61352972, 6.48335745, 1. ],
[-7.8667115 , 6.17372055, 1. ],
[-8.1072994 , 5.85420065, 1. ]])
the angles that belong to the first set of lines
[-44.504854, -42.029223, -41.278573, -37.145774, -34.097022]
initial XYZ points for the second set of lines
array([[ 0., -20. , 1. ],
[ 7.99789129e-01, -19.9839984, 1. ],
[ 1.59830153e+00, -19.9360366, 1. ],
[ 2.39423914e+00, -19.8561769, 1. ],
[ 3.18637019e+00, -19.7445510, 1. ]])
the angles that belong to the second set of lines
[89.13244, 92.39087, 94.86425, 98.91849, 99.83488]
The solution should be the origin or very near it (the data is just a little noisy, which is why the lines don't perfectly intersect at a single point).
Here's a numpy solution using the method described in this link
def intersect(P0,P1):
"""P0 and P1 are NxD arrays defining N lines.
D is the dimension of the space. This function
returns the least squares intersection of the N
lines from the system given by eq. 13 in
http://cal.cs.illinois.edu/~johannes/research/LS_line_intersect.pdf.
"""
# generate all line direction vectors
n = (P1-P0)/np.linalg.norm(P1-P0,axis=1)[:,np.newaxis] # normalized
# generate the array of all projectors
projs = np.eye(n.shape[1]) - n[:,:,np.newaxis]*n[:,np.newaxis] # I - n*n.T
# see fig. 1
# generate R matrix and q vector
R = projs.sum(axis=0)
q = (projs # P0[:,:,np.newaxis]).sum(axis=0)
# solve the least squares problem for the
# intersection point p: Rp = q
p = np.linalg.lstsq(R,q,rcond=None)[0]
return p
Works
Edit: here is a generator for noisy test data
n = 6
P0 = np.stack((np.array([5,5])+3*np.random.random(size=2) for i in range(n)))
a = np.linspace(0,2*np.pi,n)+np.random.random(size=n)*np.pi/5.0
P1 = np.array([5+5*np.sin(a),5+5*np.cos(a)]).T
If this wikipedia equation carries any weight:
then you can use:
def nearest_intersection(points, dirs):
"""
:param points: (N, 3) array of points on the lines
:param dirs: (N, 3) array of unit direction vectors
:returns: (3,) array of intersection point
"""
dirs_mat = dirs[:, :, np.newaxis] # dirs[:, np.newaxis, :]
points_mat = points[:, :, np.newaxis]
I = np.eye(3)
return np.linalg.lstsq(
(I - dirs_mat).sum(axis=0),
((I - dirs_mat) # points_mat).sum(axis=0),
rcond=None
)[0]
If you want help deriving / checking that equation from first principles, then math.stackexchange.com would be a better place to ask.
surely this is part of numpy
Note that numpy gives you enough tools to express this very concisely already
Here's the final code that I ended up using. Thanks to kevinkayaks and everyone else who responded! Your help is very much appreciated!!!
The first half of this function simply converts the two collections of points and angles to direction vectors. I believe the rest of it is basically the same as what Eric and Eugene proposed. I just happened to have success first with Kevin's and ran with it until it was an end-to-end solution for me.
import numpy as np
def LS_intersect(p0,a0,p1,a1):
"""
:param p0 : Nx2 (x,y) position coordinates
:param p1 : Nx2 (x,y) position coordinates
:param a0 : angles in degrees for each point in p0
:param a1 : angles in degrees for each point in p1
:return: least squares intersection point of N lines from eq. 13 in
http://cal.cs.illinois.edu/~johannes/research/LS_line_intersect.pdf
"""
ang = np.concatenate( (a0,a1) ) # create list of angles
# create direction vectors with magnitude = 1
n = []
for a in ang:
n.append([np.cos(np.radians(a)), np.sin(np.radians(a))])
pos = np.concatenate((p0[:,0:2],p1[:,0:2])) # create list of points
n = np.array(n)
# generate the array of all projectors
nnT = np.array([np.outer(nn,nn) for nn in n ])
ImnnT = np.eye(len(pos[0]))-nnT # orthocomplement projectors to n
# now generate R matrix and q vector
R = np.sum(ImnnT,axis=0)
q = np.sum(np.array([np.dot(m,x) for m,x in zip(ImnnT,pos)]),axis=0)
# and solve the least squares problem for the intersection point p
return np.linalg.lstsq(R,q,rcond=None)[0]
#sample data
pa = np.array([[-7.07106638, 7.07106145, 1. ],
[-7.34817263, 6.78264524, 1. ],
[-7.61354115, 6.48336347, 1. ],
[-7.86671133, 6.17371816, 1. ],
[-8.10730426, 5.85419995, 1. ]])
paa = [-44.504854321138524, -42.02922380123842, -41.27857390748773, -37.145774853341386, -34.097022454778674]
pb = np.array([[-8.98220431e-07, -1.99999962e+01, 1.00000000e+00],
[ 7.99789129e-01, -1.99839984e+01, 1.00000000e+00],
[ 1.59830153e+00, -1.99360366e+01, 1.00000000e+00],
[ 2.39423914e+00, -1.98561769e+01, 1.00000000e+00],
[ 3.18637019e+00, -1.97445510e+01, 1.00000000e+00]])
pba = [88.71923357743934, 92.55801427272372, 95.3038321024299, 96.50212060095349, 100.24177145619092]
print("Should return (-0.03211692, 0.14173216)")
solution = LS_intersect(pa,paa,pb,pba)
print(solution)
I did find a way to calculate the center coordinate of a cluster of points. However, my method is quite slow when the number of initial coordinates is increased (I have about 100 000 coordinates).
The bottleneck is the for-loop in the code. I tried to remove it by using np.apply_along_axis, but discovered that this is nothing more than a hidden python-loop.
Is it possible to detect and average out various sized clusters of too close points in a vectorized way?
import numpy as np
from scipy.spatial import cKDTree
np.random.seed(7)
max_distance=1
#Create random points
points = np.array([[1,1],[1,2],[2,1],[3,3],[3,4],[5,5],[8,8],[10,10],[8,6],[6,5]])
#Create trees and detect the points and neighbours which needs to be fused
tree = cKDTree(points)
rows_to_fuse = np.array(list(tree.query_pairs(r=max_distance))).astype('uint64')
#Split the points and neighbours into two groups
points_to_fuse = points[rows_to_fuse[:,0], :2]
neighbours = points[rows_to_fuse[:,1], :2]
#get unique points_to_fuse
nonduplicate_points = np.ascontiguousarray(points_to_fuse)
unique_points = np.unique(nonduplicate_points.view([('', nonduplicate_points.dtype)]\
*nonduplicate_points.shape[1]))
unique_points = unique_points.view(nonduplicate_points.dtype).reshape(\
(unique_points.shape[0],\
nonduplicate_points.shape[1]))
#Empty array to store fused points
fused_points = np.empty((len(unique_points), 2))
####BOTTLENECK LOOP####
for i, point in enumerate(unique_points):
#Detect all locations where a unique point occurs
locs=np.where(np.logical_and((points_to_fuse[:,0] == point[0]), (points_to_fuse[:,1]==point[1])))
#Select all neighbours on these locations take the average
fused_points[i,:] = (np.average(np.hstack((point[0],neighbours[locs,0][0]))),np.average(np.hstack((point[1],neighbours[locs,1][0]))))
#Get original points that didn't need to be fused
points_without_fuse = np.delete(points, np.unique(rows_to_fuse.reshape((1, -1))), axis=0)
#Stack result
points = np.row_stack((points_without_fuse, fused_points))
Expected output
>>> points
array([[ 8. , 8. ],
[ 10. , 10. ],
[ 8. , 6. ],
[ 1.33333333, 1.33333333],
[ 3. , 3.5 ],
[ 5.5 , 5. ]])
EDIT 1: Example of 1 loop with desired result
Step 1: Create variables for the loop
#outside loop
points_to_fuse = np.array([[100,100],[101,101],[100,100]])
neighbours = np.array([[103,105],[109,701],[99,100]])
unique_points = np.array([[100,100],[101,101]])
#inside loop
point = np.array([100,100])
i = 0
Step 2: Detect all locations where a unique point occurs in the points_to_fuse array
locs=np.where(np.logical_and((points_to_fuse[:,0] == point[0]), (points_to_fuse[:,1]==point[1])))
>>> (array([0, 2], dtype=int64),)
Step 3: Create an array of the point and the neighbouring points at these locations and calculate the average
array_of_points = np.column_stack((np.hstack((point[0],neighbours[locs,0][0])),np.hstack((point[1],neighbours[locs,1][0]))))
>>> array([[100, 100],
[103, 105],
[ 99, 100]])
fused_points[i, :] = np.average(array_of_points, 0)
>>> array([ 100.66666667, 101.66666667])
Loop output after a complete run:
>>> print(fused_points)
>>> array([[ 100.66666667, 101.66666667],
[ 105. , 401. ]])
The bottleneck is not the loop which is necessary since all the neighborhoods have not the same size.
The pitfall is the points_to_fuse[:,0] == point[0] in the loop which trig a quadratic complexity. you can avoid that by sorting the points, by index.
An example to do that, even it doesn't solve the whole problem (after the generation of rows_to_fuse):
sorter=np.lexsort(rows_to_fuse.T)
sorted_points=rows_to_fuse[sorter]
uniques,counts=np.unique(sorted_points[:,1],return_counts=True)
indices=counts.cumsum()
neighbourhood=np.split(sorted_points,indices)[:-1]
means=[(points[ne[:,0]].sum(axis=0)+points[ne[0,1]])/(len(ne)+1) \
for ne in neighbourhood] # a simple python loop.
# + manage unfused points.
An other improvement is to compute means with numba if you want to speed the code, but the complexity is now ~ optimal I think.
I'm trying to build a program to map a 2d coordinate (latitude, longitude) to a float value. I have about 1 million rows of training data like
(41.140359, -8.612964) -> 65
... -> ...
I think this is a regression problem, except all of the regression examples I've found are only using 1 dimension, so I'm not sure.
What algorithm (or category of algorithms) should I use in this instance?
Before trying to find a function, plot your data on an excel of python plot, you may see the kind of function you are looking for.
In addition, excel has a regression computation module.
It is a regression problem and you can freely use e.g. linear regression to solve it. The examples are often one-dimensional so it is easy to understand, however they work for an arbitrary number of dimensions.
You can try to use linear regression first.
Lets give an example using numpy.linalg.lstsq:
>>> import numpy as np
>>> x = np.random.rand(10, 2)
>>> x
array([[ 0.7920302 , 0.05650698],
[ 0.76380636, 0.07123805],
[ 0.18650694, 0.89150851],
[ 0.22730377, 0.83013102],
[ 0.72369719, 0.07772721],
[ 0.26277287, 0.44253368],
[ 0.44421399, 0.98533921],
[ 0.91476656, 0.27183732],
[ 0.74745802, 0.08840694],
[ 0.60000819, 0.67162258]])
>>> y = np.random.rand(10)
>>> y
array([ 0.53341968, 0.63964031, 0.46097061, 0.68602146, 0.20041928,
0.42642768, 0.34039486, 0.93539655, 0.29946688, 0.57526445])
>>> m, c = np.linalg.lstsq(x, y)[0]
>>> print m,c
0.605269341974 0.370359070752
See documentation for more information about plotting and what those values represent.
I have a question similar to the question asked here:
simple way of fusing a few close points. I want to replace points that are located close to each other with the average of their coordinates. The closeness in cells is specified by the user (I am talking about euclidean distance).
In my case I have a lot of points (about 1-million). This method is working, but is time consuming as it uses a double for loop.
Is there a faster way to detect and fuse close points in a numpy 2d array?
To be complete I added an example:
points=array([[ 382.49056159, 640.1731949 ],
[ 496.44669161, 655.8583119 ],
[ 1255.64762859, 672.99699399],
[ 1070.16520917, 688.33538171],
[ 318.89390168, 718.05989421],
[ 259.7106383 , 822.2 ],
[ 141.52574427, 28.68594436],
[ 1061.13573287, 28.7094536 ],
[ 820.57417943, 84.27702407],
[ 806.71416007, 108.50307828]])
A scatterplot of the points is visible below. The red circle indicates the points located close to each other (in this case a distance of 27.91 between the last two points in the array). So if the user would specify a minimum distance of 30 these points should be fused.
In the output of the fuse function the last to points are fused. This will look like:
#output
array([[ 382.49056159, 640.1731949 ],
[ 496.44669161, 655.8583119 ],
[ 1255.64762859, 672.99699399],
[ 1070.16520917, 688.33538171],
[ 318.89390168, 718.05989421],
[ 259.7106383 , 822.2 ],
[ 141.52574427, 28.68594436],
[ 1061.13573287, 28.7094536 ],
[ 813.64416975, 96.390051175]])
If you have a large number of points then it may be faster to build a k-D tree using scipy.spatial.KDTree, then query it for pairs of points that are closer than some threshold:
import numpy as np
from scipy.spatial import KDTree
tree = KDTree(points)
rows_to_fuse = tree.query_pairs(r=30)
print(repr(rows_to_fuse))
# {(8, 9)}
print(repr(points[list(rows_to_fuse)]))
# array([[ 820.57417943, 84.27702407],
# [ 806.71416007, 108.50307828]])
The major advantage of this approach is that you don't need to compute the distance between every pair of points in your dataset.
You can use scipy's distance functions such as pdist in order to quickly find which points should be merged:
import numpy as np
from scipy.spatial.distance import pdist, squareform
d = squareform(pdist(a))
d = np.ma.array(d, mask=np.isclose(d, 0))
a[d.min(axis=1) < 30]
#array([[ 820.57417943, 84.27702407],
# [ 806.71416007, 108.50307828]])
NOTE
For large samples this method can cause memory errors since it is storing a full matrix containing the relative distances.