We have boring CSV with 10000 rows of ages (float), titles (enum/int), scores (float), ....
We have N columns each with int/float values in a table.
You can imagine this as points in ND space
We want to pick K points that would have maximised distance between each other.
So if we have 100 points in a tightly packed cluster and one point in the distance we would get something like this for three points:
or this
For 4 points it will become more interesting and pick some point in the middle.
So how to select K most distant rows (points) from N (with any complexity)? It looks like an ND point cloud "triangulation" with a given resolution yet not for 3d points.
I search for a reasonably fast approach (approximate - no precise solution needed) for K=200 and N=100000 and ND=6 (probably multigrid or ANN on KDTree based, SOM or triangulation based..).. Does anyone know one?
From past experience with a pretty similar problem, a simple solution of computing the mean Euclidean distance of all pairs within each group of K points and then taking the largest mean, works very well. As someone noted above, it's probably hard to avoid a loop on all combinations (not on all pairs). So a possible implementation of all this can be as follows:
import itertools
import numpy as np
from scipy.spatial.distance import pdist
Npoints = 3 # or 4 or 5...
# making up some data:
data = np.matrix([[3,2,4,3,4],[23,25,30,21,27],[6,7,8,7,9],[5,5,6,6,7],[0,1,2,0,2],[3,9,1,6,5],[0,0,12,2,7]])
# finding row indices of all combinations:
c = [list(x) for x in itertools.combinations(range(len(data)), Npoints )]
distances = []
for i in c:
distances.append(np.mean(pdist(data[i,:]))) # pdist: a method of computing all pairwise Euclidean distances in a condensed way.
ind = distances.index(max(distances)) # finding the index of the max mean distance
rows = c[ind] # these are the points in question
I propose an approximate solution. The idea is to start from a set of K points chosen in a way I'll explain below, and repeatedly loop through these points replacing the current one with the point, among the N-K+1 points not belonging to the set but including the current one, that maximizes the sum of the distances from the points of the set. This procedure leads to a set of K points where the replacement of any single point would cause the sum of the distances among the points of the set to decrease.
To start the process we take the K points that are closest to the mean of all points. This way we have good chances that on the first loop the set of K points will be spread out close to its optimum. Subsequent iterations will make adjustments to the set of K points towards a maximum of the sum of distances, which for the current values of N, K and ND appears to be reachable in just a few seconds. In order to prevent excessive looping in edge cases, we limit the number of loops nonetheless.
We stop iterating when an iteration does not improve the total distance among the K points. Of course, this is a local maximum. Other local maxima will be reached for different initial conditions, or by allowing more than one replacement at a time, but I don't think it would be worthwhile.
The data must be adjusted in order for unit displacements in each dimension to have the same significance, i.e., in order for Euclidean distances to be meaningful. E.g., if your dimensions are salary and number of children, unadjusted, the algorithm will probably yield results concentrated in the extreme salary regions, ignoring that person with 10 kids. To get a more realistic output you could divide salary and number of children by their standard deviation, or by some other estimate that makes differences in salary comparable to differences in number of children.
To be able to plot the output for a random Gaussian distribution, I have set ND = 2 in the code, but setting ND = 6, as per your request, is no problem (except you cannot plot it).
import matplotlib.pyplot as plt
import numpy as np
import scipy.spatial as spatial
N, K, ND = 100000, 200, 2
MAX_LOOPS = 20
SIGMA, SEED = 40, 1234
rng = np.random.default_rng(seed=SEED)
means, variances = [0] * ND, [SIGMA**2] * ND
data = rng.multivariate_normal(means, np.diag(variances), N)
def distances(ndarray_0, ndarray_1):
if (ndarray_0.ndim, ndarray_1.ndim) not in ((1, 2), (2, 1)):
raise ValueError("bad ndarray dimensions combination")
return np.linalg.norm(ndarray_0 - ndarray_1, axis=1)
# start with the K points closest to the mean
# (the copy() is only to avoid a view into an otherwise unused array)
indices = np.argsort(distances(data, data.mean(0)))[:K].copy()
# distsums is, for all N points, the sum of the distances from the K points
distsums = spatial.distance.cdist(data, data[indices]).sum(1)
# but the K points themselves should not be considered
# (the trick is that -np.inf ± a finite quantity always yields -np.inf)
distsums[indices] = -np.inf
prev_sum = 0.0
for loop in range(MAX_LOOPS):
for i in range(K):
# remove this point from the K points
old_index = indices[i]
# calculate its sum of distances from the K points
distsums[old_index] = distances(data[indices], data[old_index]).sum()
# update the sums of distances of all points from the K-1 points
distsums -= distances(data, data[old_index])
# choose the point with the greatest sum of distances from the K-1 points
new_index = np.argmax(distsums)
# add it to the K points replacing the old_index
indices[i] = new_index
# don't consider it any more in distsums
distsums[new_index] = -np.inf
# update the sums of distances of all points from the K points
distsums += distances(data, data[new_index])
# sum all mutual distances of the K points
curr_sum = spatial.distance.pdist(data[indices]).sum()
# break if the sum hasn't changed
if curr_sum == prev_sum:
break
prev_sum = curr_sum
if ND == 2:
X, Y = data.T
marker_size = 4
plt.scatter(X, Y, s=marker_size)
plt.scatter(X[indices], Y[indices], s=marker_size)
plt.grid(True)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
Output:
Splitting the data into 3 equidistant Gaussian distributions the output is this:
Assuming that if you read your csv file with N (10000) rows and D dimension (or features) into a N*D martix X. You can calculate the distance between each point and store it in a distance matrix as follows:
import numpy as np
X = np.asarray(X) ### convert to numpy array
distance_matrix = np.zeros((X.shape[0],X.shape[0]))
for i in range(X.shape[0]):
for j in range(i+1,X.shape[0]):
## We compute triangle matrix and copy the rest. Distance from point A to point B and distance from point B to point A are the same.
distance_matrix[i][j]= np.linalg.norm(X[i]-X[j]) ## Here I am calculating Eucledian distance. Other distance measures can also be used.
#distance_matrix = distance_matrix + distance_matrix.T - np.diag(np.diag(distance_matrix)) ## This syntax can be used to get the lower triangle of distance matrix, which is not really required in your case.
K = 5 ## Number of points that you want to pick
indexes = np.unravel_index(np.argsort(distance_matrix.ravel())[-1*K:], distance_matrix.shape)
print(indexes)
Bottom Line Up Front: Dealing with multiple equally distant points and the Curse of Dimensionality are going to be larger problems than just finding the points. Spoiler alert: There's a surprise ending.
I think this an interesting question but I'm bewildered by some of the answers. I think this is, in part, due to the sketches provided. You've no doubt noticed the answers look similar -- 2d, with clusters -- even though you indicated a wider scope was needed. Because others will eventually see this, I'm going to step through my thinking a bit slowly so bear with me for the early part.
It makes sense to start with a simplified example to see if we can generalize a solution with data that's easy to grasp and a linear 2D model is easiest of the easy.
We don't need to calculate all the distances though. We just need the ones at the extremes. So we can then take the top and bottom few values:
right = lin_2_D.nlargest(8, ['x'])
left = lin_2_D.nsmallest(8, ['x'])
graph = sns.scatterplot(x="x", y="y", data=lin_2_D, color = 'gray', marker = '+', alpha = .4)
sns.scatterplot(x = right['x'], y = right['y'], color = 'red')
sns.scatterplot(x = left['x'], y = left['y'], color = 'green')
fig = graph.figure
fig.set_size_inches(8,3)
What we have so far: Of 100 points, we've eliminated the need to calculate the distance between 84 of them. Of what's left we can further drop this by ordering the results on one side and checking the distance against the others.
You can imagine a case where you have a couple of data points way off the trend line that could be captured by taking the greatest or least y values, and all that starts to look like Walter Tross's top diagram. Add in a couple of extra clusters and you get what looks his bottom diagram and it appears that we're sort of making the same point.
The problem with stopping here is the requirement you mentioned is that you need a solution that works for any number of dimensions.
The unfortunate part is that we run into four challenges:
Challenge 1: As you increase the dimensions you can run into a large number of cases where you have multiple solutions when seeking midpoints. So you're looking for k furthest points but have a large number of equally valid possible solutions and no way prioritizing them. Here are two super easy examples illustrate this:
A) Here we have just four points and in only two dimensions. You really can't get any easier than this, right? The distance from red to green is trivial. But try to find the next furthest point and you'll see both of the black points are equidistant from both the red and green points. Imagine you wanted the furthest six points using the first graphs, you might have 20 or more points that are all equidistant.
edit: I just noticed the red and green dots are at the edges of their circles rather than at the center, I'll update later but the point is the same.
B) This is super easy to imagine: Think of a D&D 4 sided die. Four points of data in a three-dimensional space, all equidistant so it's known as a triangle-based pyramid. If you're looking for the closest two points, which two? You have 4 choose 2 (aka, 6) combinations possible. Getting rid of valid solutions can be a bit of a problem because invariably you face questions such as "why did we get rid of these and not this one?"
Challenge 2: The Curse of Dimensionality. Nuff Said.
Challenge 3 Revenge of The Curse of Dimensionality Because you're looking for the most distant points, you have to x,y,z ... n coordinates for each point or you have to impute them. Now, your data set is much larger and slower.
Challenge 4 Because you're looking for the most distant points, dimension reduction techniques such as ridge and lasso are not going to be useful.
So, what to do about this?
Nothing.
Wait. What?!?
Not truly, exactly, and literally nothing. But nothing crazy. Instead, rely on a simple heuristic that is understandable and computationally easy. Paul C. Kainen puts it well:
Intuitively, when a situation is sufficiently complex or uncertain,
only the simplest methods are valid. Surprisingly, however,
common-sense heuristics based on these robustly applicable techniques
can yield results which are almost surely optimal.
In this case, you have not the Curse of Dimensionality but rather the Blessing of Dimensionality. It's true you have a lot of points and they'll scale linearly as you seek other equidistant points (k) but the total dimensional volume of space will increase to power of the dimensions. The k number of furthest points you're is insignificant to the total number of points. Hell, even k^2 becomes insignificant as the number of dimensions increase.
Now, if you had a low dimensionality, I would go with them as a solution (except the ones that are use nested for loops ... in NumPy or Pandas).
If I was in your position, I'd be thinking how I've got code in these other answers that I could use as a basis and maybe wonder why should I should trust this other than it lays out a framework on how to think through the topic. Certainly, there should be some math and maybe somebody important saying the same thing.
Let me reference to chapter 18 of Computer Intensive Methods in Control and Signal Processing and an expanded argument by analogy with some heavy(-ish) math. You can see from the above (the graph with the colored dots at the edges) that the center is removed, particularly if you followed the idea of removing the extreme y values. It's a though you put a balloon in a box. You could do this a sphere in a cube too. Raise that into multiple dimensions and you have a hypersphere in a hypercube. You can read more about that relationship here.
Finally, let's get to a heuristic:
Select the points that have the most max or min values per dimension. When/if you run out of them pick ones that are close to those values if there isn't one at the min/max. Essentially, you're choosing the corners of a box For a 2D graph you have four points, for a 3D you have the 8 corners of the box (2^3).
More accurately this would be a 4d or 5d (depending on how you might assign the marker shape and color) projected down to 3d. But you can easily see how this data cloud gives you the full range of dimensions.
Here is a quick check on learning; for purposes of ease, ignore the color/shape aspect: It's easy to graphically intuit that you have no problem with up to k points short of deciding what might be slightly closer. And you can see how you might need to randomize your selection if you have a k < 2D. And if you added another point you can see it (k +1) would be in a centroid. So here is the check: If you had more points, where would they be? I guess I have to put this at the bottom -- limitation of markdown.
So for a 6D data cloud, the values of k less than 64 (really 65 as we'll see in just a moment) points are pretty easy. But...
If you don't have a data cloud but instead have data that has a linear relationship, you'll 2^(D-1) points. So, for that linear 2D space, you have a line, for linear 3D space, you'd have a plane. Then a rhomboid, etc. This is true even if your shape is curved. Rather than do this graph myself, I'm using the one from an excellent post on by Inversion Labs on Best-fit Surfaces for 3D Data
If the number of points, k, is less than 2^D you need a process to decide what you don't use. Linear discriminant analysis should be on your shortlist. That said, you can probably satisfice the solution by randomly picking one.
For a single additional point (k = 1 + 2^D), you're looking for one that is as close to the center of the bounding space.
When k > 2^D, the possible solutions will scale not geometrically but factorially. That may not seem intuitive so let's go back to the two circles. For 2D you have just two points that could be a candidate for being equidistant. But if that were 3D space and rotate the points about the line, any point in what is now a ring would suffice as a solution for k. For a 3D example, they would be a sphere. Hyperspheres (n-spheres) from thereon. Again, 2^D scaling.
One last thing: You should seriously look at xarray if you're not already familiar with it.
Hope all this helps and I also hope you'll read through the links. It'll be worth the time.
*It would be the same shape, centrally located, with the vertices at the 1/3 mark. So like having 27 six-sided dice shaped like a giant cube. Each vertice (or point nearest it) would fix the solution. Your original k+1 would have to be relocated too. So you would select 2 of the 8 vertices. Final question: would it be worth calculating the distances of those points against each other (remember the diagonal is slightly longer than the edge) and then comparing them to the original 2^D points? Bluntly, no. Satifice the solution.
If you're interested in getting the most distant points you can take advantage of all of the methods that were developed for nearest neighbors, you just have to give a different "metric".
For example, using scikit-learn's nearest neighbors and distance metrics tools you can do something like this
import numpy as np
from sklearn.neighbors import BallTree
from sklearn.neighbors.dist_metrics import PyFuncDistance
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
def inverted_euclidean(x1, x2):
# You can speed this up using cython like scikit-learn does or numba
dist = np.sum((x1 - x2) ** 2)
# We invert the euclidean distance and set nearby points to the biggest possible
# positive float that isn't inf
inverted_dist = np.where(dist == 0, np.nextafter(np.inf, 0), 1 / dist)
return inverted_dist
# Make up some fake data
n_samples = 100000
n_features = 200
X, _ = make_blobs(n_samples=n_samples, centers=3, n_features=n_features, random_state=0)
# We exploit the BallTree algorithm to get the most distant points
ball_tree = BallTree(X, leaf_size=50, metric=PyFuncDistance(inverted_euclidean))
# Some made up query, you can also provide a stack of points to query against
test_point = np.zeros((1, n_features))
distance, distant_points_inds = ball_tree.query(X=test_point, k=10, return_distance=True)
distant_points = X[distant_points_inds[0]]
# We can try to visualize the query results
plt.plot(X[:, 0], X[:, 1], ".b", alpha=0.1)
plt.plot(test_point[:, 0], test_point[:, 1], "*r", markersize=9)
plt.plot(distant_points[:, 0], distant_points[:, 1], "sg", markersize=5, alpha=0.8)
plt.show()
Which will plot something like:
There are many points that you can improve on:
I implemented the inverted_euclidean distance function with numpy, but you can try to do what the folks of scikit-learn do with their distance functions and implement them in cython. You could also try to jit compile them with numba.
Maybe the euclidean distance isn't the metric you would like to use to find the furthest points, so you're free to implement your own or simply roll with what scikit-learn provides.
The nice thing about using the Ball Tree algorithm (or the KdTree algorithm) is that for each queried point you have to do log(N) comparisons to find the furthest point in the training set. Building the Ball Tree itself, I think also requires log(N) comparison, so in the end if you want to find the k furthest points for every point in the ball tree training set (X), it will have almost O(D N log(N)) complexity (where D is the number of features), which will increase up to O(D N^2) with the increasing k.
Me and my team are participating in ESA Astro Pi challenge. Our program will ran on the ISS for 3 hours and we will our results back and analyze them.
We want to investigate the connection between the magnetic intensity measurements from Sense HAT magnetometer and predictions from the World Magnetic Model (WMM). We want to research the accuracy of the magnetometer on Sense HAT.
The program will get raw magnetometer data (X, Y and Z) in microteslas from Sense HAT and calculate values H and F as described in British geological survey's article (section 2.1). It will then save them to CSV file, along with timestamp and location calculated with ephem.
We will then compare values Z, H and F from ISS and WMM and create maps with our data and differences (like figures 6, 8 and 10). We will then research, how accurate are Sense HAT magnetometer data.
We want to compare our data with data from WMM to see how accurate is Sense HAT magnetometer, but we have a problem that orientation of magnetometer will always be different. Because of that, our data will always be (very) different from WMM so we won't be able to compare them correctly.
We talked with Astro Pi support team and they suggested to "normalise the angled measurements so it looks like they were taken by a device aligned North/South".
Unfortunately, we (and they) don't know how to do this, so they suggested to ask this question on Stack Exchange. I asked it on Math Stack Exchange, Physics Stack Exchange and Raspberry Pi Forums. Unforcenetly, they didn't received any answers, so I am asking this question again.
How can we do this? We have data for timestamp, ISS location (latitude, longitude, elevation), magnetic data (X, Y and Z) and also direction from the North.
We want to normalise our data so we will be able to correctly compare them with data from WMM.
Here is part of our program that calculates magnetometer values (which gets not normalised data):
compass = sense.get_compass_raw()
try:
# Get raw data (values are swapped because Sense HAT on ISS is in different position)
# x: northerly intensity
# y: easterly intensity
# z: vertical intensity
x = float(compass['z'])
y = float(compass['y'])
z = float(compass['x'])
except (ValueError, KeyError) as err:
# Write error to log (excluded from this snippet)
pass
try:
# h: horizontal intensity
# f: total intensity
# d: declination
# i: inclination
h = sqrt(x ** 2 + y ** 2)
f = sqrt(h ** 2 + z ** 2)
d = degrees(atan(y / x))
i = degrees(atan(z / h))
except (TypeError, ValueError, ZeroDivisionError) as err:
# Write error to log (excluded from this snippet)
pass
There is also some simple simulator available with our code: https://trinket.io/library/trinkets/cc87813ce7
Part of email from Astro Pi team about location and position of magnetometer:
Z is going down through the middle of the Sense Hat.
X runs between the USB ports and SD card slot.
Y runs across from the HDMI port to the 40 way pin header.
On the ISS the AstroPi orientation is that the Ethernet + USB ports face the deck and the SD card slot is towards the sky.
So, that's basically a rotation around the Y axis from flat. So you keep the Y axis the same and swap around Z and X.
It can help to look at the Google Street view of the interior of the ISS Columbus module to get a better idea how the AstroPi is positioned;
https://www.google.com/streetview/#international-space-station/columbus-research-laboratory
If you pan the camera down and to the right, you'll see a green light - that's the AstroPi. The direction of travel for the whole space station is towards the inflatable Earth ball you can see on the left.
So, broadly speaking, the SD card slot points towards azimuth as in away from the centre of the Earth (so the X axis).
The LED matrix is facing the direction of travel of the space station (the Z axis).
Because of the orbital path of the ISS the Z and Y axes will continually change direction relative to the poles as it moves around the Earth.
So, I am guessing you want to normalise the angled measurements so it looks like they were taken by a device aligned North/South?
I think you need to create local reference coordinate system similar to NEH (north,east,height/altitude/up) something like
Representing Points on a Circular Radar Math approach.
Its commonly used in aviation as a reference frame (heading is derived from it) so your reference frame is computed from your geo location and its axises pointing to North, East and Up.
Now the problem is what does it mean aligned North/South and normalizing.. ?
If reference device measure just projection than you would need to do something like this:
dot(measured_vector,reference_unit_direction)
where the direction would be the North direction but as unit vector.
If the reference device measure a full 3D too then you need to transform both reference and tested measured data into the same coordinate system. That is done by using
transform matrices
So simple matrix * vector multiplication will do ... Only then compute the values H,F,Z which I do not know what they are and too lazy to go through papers ... would expect E,H or B vectors instead.
However if you do not have the geo location at moment of measure then you have just the North direction in respect to the ISS in form of Euler angles so you can not construct 3D reference frame at all (unless you got 2 known vectors instead of just one like UP). In such case you need to go with the option 1 projection (using dot product and north direction vector). So you will handle just scalar values instead of 3D vectors afterwards.
[Edit1]
From the link of yours:
The geomagnetic field vector, B, is described by the orthogonal
components X (northerly intensity), Y (easterly intensity) and Z
(vertical intensity, positive downwards);
This is not my field of expertise so I might be wrong here but this is how I understand it:
B(Bx,By,Bz) - magnetic field vector
a(ax,ay,az) - acceleration
Now F is a magnitude of B so its invariant on rotation:
F = |B| = sqrt( Bx*Bx + By*By + Bz*Bz )
you need to compute the X,Y,Z values of B in the NED reference frame (North,East,Down) so you need the basis vectors first:
Down = a/|a| // gravity points down
North = B/|B| // north is close to B direction
East = cross(Down,North) // East is perpendicular to Down and North
North = cross(East,Down) // north is perpendicular to Down and East, this should convert North to the horizontal plane
You should render them to visually check if they point to correct directions if not negate them by reordering the cross operands (I might have the order wrong I am used to use Up vector instead). Now just convert B to NED :
X = dot(North,B)
Y = dot(East,B)
Z = dot(Down,B)
And now you can compute the H
H = sqrt( X*X +Y*Y )
The vector math needed for this you will find in the transform matrix link above.
beware this will work only if no accelerations are present (the sensor is not on a robotic arm during its operation, or ISS is not doing a burn...) Otherwise you need to obtain the NED frame differently (like from onboard systems)
If this not work correctly then you can compute NED from your ISS position but for that you would need to know the exact orientation and displacement of the sensor in respect to your simulation model that provide your location. I do not know what rotations ISS do so I would not touch that subject unless as a last resort.
I am afraid that I will not have time for coding for some time ... anyway coding without sample input data nor the coordinate system expalnations and all the input/output variables is insanity ... simple negation of axis will invalidate the whole thing and there is a lot of duplicities along the ways and to cover all of them you would end up with many many versions to try to...
Apps should be build up incrementally but I am afraid that without the access to simulation or real HW that is not possible. And there is a whole bunch of things that could go wrong ... making even simple programs a magnitude harder to code... I would first check the F as it does not require any "normalization" first to see if the results are off or not. If off it might suggest different units or god knows what ...
I've been tasked with writing a python based plugin for a graph drawing program that generates an STL model of a graph. A graph being an object made up of vertices and edges, where a vertex is represented by a 3D ball (a tessellated icosahedron), and an edge is represented with a cylinder that connects with two balls at either end. The end result of the 3D model is that it will get dumped out to an STL file for 3D printing. I'm able to generate the 3D models for the balls and cylinders without any issues, but I'm having some issues generating the overall model, and getting the balls and cylinders to connect properly.
My original idea was to create tessellated icosahedrons at the origin, then translate them out to the positions of the vertices. This works fine. I then, for each edge, I would create a cylinder at the origin, rotate it to the correct angle so that it points in the correct direction, then translate it to the midpoint between the two vertices so that the ends of the cylinders are embedded in the icosahedrons. This is where things are going wrong. I'm having some difficulties getting the rotations correct. To calculate the rotations, I'm doing the following:
First, I find the angle between the two points as follows (where source and target are both vertices in the graph, belonging to the edge that I'm currently processing):
deltaX = source.x - target.x
deltaY = source.y - target.y
deltaZ = source.z - target.z
xyAngle = math.atan2(deltaX, deltaY)
xzAngle = math.atan2(deltaX, deltaZ)
yzAngle = math.atan2(deltaY, deltaZ)
The angles being calculated seem reasonable, and as far as I can tell, do actually represent the angle between the vertices. For example, if I have a vertex at (1, 1, 0) and another vertex at (3, 3, 0), the angle edge connecting them does show up as a 45 degree angle between the two vertices. (That, or -135 degrees, depending which vertex is the source and which is the target).
Once I have the angles calculated, I create a cylinder and rotate it by the angles that have been calculated, like so, using some other classes that I've created:
c = cylinder()
c.createCylinder(edgeThickness, edgeLength)
c.rotateX(-yzAngle)
c.rotateY(xzAngle)
c.rotateZ(-xyAngle)
c.translate(edgePosition.x, edgePosition.y, edgePosition.z)
(Where edgePosition is the midpoint between the two vertices in the graph, edgeThickness is the radius of the cylinder being created, and edgeLength is the distance between the two vertices).
As mentioned, its the rotating of the cylinders that doesn't work as expected. It seems to do the correct rotation on the x/y plane, but as soon as an edge has vertices that differ in all three components (x, y, and z), the rotation fails. Here's an example of a graph that differs in the x, and y components, but not in the z component:
And here's the resulting STL file, as seen in Makerware (which is used to send the 3D models to the 3D printer):
(The extra cylinder looking bit in the bottom left is something I've currently left in for testing purposes - a cylinder that points in the direction of the z axis, located at the origin).
If I take that same graph and move the middle vertex out in the z axis, so now all the edges involve angles in all three axis, I get a result something like the following:
As show in the app:
The resulting STL file, as show in Makerware:
...and that same model as viewed from the side:
As you can see, the cylinders definitely aren't meeting up with the balls like I thought they would. My question is this: Is my approach to doing this flawed, or is it some small but critical mistake that I'm making somewhere in my rotations? I'm pretty sure it isn't a problem with the rotation functions themselves, as I've been able to independently verify that they work as expected. I also tried creating a rotate function that takes in a yaw, pitch, and roll and does all three at once, and it seemed to generate the same result, like so:
c.rotateYawPitchRoll(xzAngle, -yzAngle, -xyAngle)
So... anyone have any ideas on what I might be doing wrong?
UPDATE: As joojaa pointed out, it was a combination of calculating the correct angles as well as the order that they were applied. In order to get things working, I first calculate the rotation on the x axis, as follows:
zyAngle = math.atan2(deltaVector.z, deltaVector.y)
where deltaVector is the difference between the target and source vectors. This rotation is not yet applied though! The next step is to calculate the rotation on the y axis, as follows:
angle = vector.angleBetweenVectors(vector(target.x - source.x, target.y - source.y, target.z - source.z), vector(target.x - source.x, target.y - source.y, 0.0))
Once both rotations are calculated, they are then applied... in the reverse order! First, the x, then the y:
c.rotateY(angle)
c.rotateX(-zyAngle) #... where c is a cylinder object
There still seems to be a few bugs, but this seems to at least work for a simple test case.
Rotation happens in successive order, so the angles affect each other. It is not possible to use a Euler model to rotate them at once. This is why you can not just calculate the rotations based on the first static situation. Just imagine turning a cube so that it is standing on its corner upright. Yes the first rotation is 45 but the second is not since the cube is already turned by that time (draw a each step of the sequence and see what happens). Space rotations aren't trivial.
So you need to rotate one angle then re calculate the second angle and so forth. This is also why your first rotation works right. You only need 2 rotations unless your interested in making sure the rotation around the shaft has a certain direction.
I would suggest you use axis angles or matrices instead to do this. Mainly because in axis angles this is trivial the angle is the dot between the along tube start and end vectors and the axis is the cross between those 2. You can then convert those to Euler angles if you need. But probably you can just use the matrix directly. For ideas on how conversions and how the rotation could directly be calculated see: transformations.py by Christoph Gohlke. Also see the accompanying c source.
I think i need to expand this answer a bit
There is a really easy way out for this question that sidesteps all your and many other persons problems. The answer is do not use Euler angle rotation. Ive used a lot of brainpower to try to explain Euler rotations to problems that are ultimately solved more easily without Euler rotations. To justify i will leave just one reason for this if you want more think up of some more answers.
The reason most to use Euler rotation sequences is that you probably don't understand Euler angles. There are in fact only a handful of situations where they are good. No self respecting programmer uses Euler rotations to solve this issue. What you do is you use vector math instead.
So you have the direction vector from the source to target which is usually calculated:
along = normalize(target-source)
this is simply one of your matrix rows (or column notation is up to model maker), the one that corresponds to your cylinders original direction (the rows are just x y z w), then you need another vector perpendicular to this one. Choose a arbitrary vector like up (or left if your along is pointing close to up). cross product this up vector by your along for the second row direction. and finally put your source as the last row with 1 in the last column. Done fully formed affine matrix describing the cylinders prition. Much easier to understand since you can draw the vectors.
There are shorter ways but this one is easy to understand.
I'm working on a problem where I have a large set (>4 million) of data points located in a three-dimensional space, each with a scalar function value. This is represented by four arrays: XD, YD, ZD, and FD. The tuple (XD[i], YD[i], ZD[i]) refers to the location of data point i, which has a value of FD[i].
I'd like to superimpose a rectilinear grid of, say, 100x100x100 points in the same space as my data. This grid is set up as follows.
[XGrid, YGrid, ZGrid] = np.mgrid[Xmin:Xmax:Xstep, Ymin:Ymax:Ystep, Zmin:Zmax:Zstep]
XG = XGrid[:,0,0]
YG = YGrid[0,:,0]
ZG = ZGrid[0,0,:]
XGrid is a 3D array of the x-value at each point in the grid. XG is a 1D array of the x-values going from Xmin to Xmax, separated by a distance of XStep.
I'd like to use an interpolation algorithm I have to find the value of the function at each grid point based on the data surrounding it. In this algorithm I require 20 data points closest (or at least close) to my grid point of interest. That is, for grid point (XG[i], YG[j], ZG[k]) I want to find the 20 closest data points.
The only way I can think of is to have one for loop that goes through each data point and a subsequent embedded for loop going through all (so many!) data points, calculating the Euclidean distance, and picking out the 20 closest ones.
for i in range(0,XG.shape):
for j in range(0,YG.shape):
for k in range(0,ZG.shape):
Distance = np.zeros([XD.shape])
for a in range(0,XD.shape):
Distance[a] = (XD[a] - XG[i])**2 + (YD[a] - YG[j])**2 + (ZD[a] - ZG[k])**2
B = np.zeros([20], int)
for a in range(0,20):
indx = np.argmin(Distance)
B[a] = indx
Distance[indx] = float(inf)
This would give me an array, B, of the indices of the data points closest to the grid point. I feel like this would take too long to go through each data point at each grid point.
I'm looking for any suggestions, such as how I might be able to organize the data points before calculating distances, which could cut down on computation time.
Have a look at a seemingly simmilar but 2D problem and see if you cannot improve with ideas from there.
From the top of my head, I'm thinking that you can sort the points according to their coordinates (three separate arrays). When you need the closest points to the [X, Y, Z] grid point you'll quickly locate points in those three arrays and start from there.
Also, you don't really need the euclidian distance, since you are only interested in relative distance, which can also be described as:
abs(deltaX) + abs(deltaY) + abs(deltaZ)
And save on the expensive power and square roots...
No need to iterate over your data points for each grid location: Your grid locations are inherently ordered, so just iterate over your data points once, and assign each data point to the eight grid locations that surround it. When you're done, some grid locations may have too few data points. Check the data points of adjacent grid locations. If you have plenty of data points to go around (it depends on how your data is distributed), you can already select the 20 closest neighbors during the initial pass.
Addendum: You may want to reconsider other parts of your algorithm as well. Your algorithm is a kind of piecewise-linear interpolation, and there are plenty of relatively simple improvements. Instead of dividing your space into evenly spaced cubes, consider allocating a number of center points and dynamically repositioning them until the average distance of data points from the nearest center point is minimized, like this:
Allocate each data point to its closest center point.
Reposition each center point to the coordinates that would minimize the average distance from "its" points (to the "centroid" of the data subset).
Some data points now have a different closest center point. Repeat steps 1. and 2. until you converge (or near enough).