I have some questions regarding the Perlin noise and the pv.sample_function in general.
How would you go about applying Perlin noise to a sphere? I would like to have a little bit disformed sphere.
Can you apply Perlin noise to a mesh (sphere/plane) multiple times? I would like to have a plane with some rough 'waves' and high detailed noise on top of them (thus having big waves with little waves in them).
What exactly does the third parameter in the frequency do? After playing around with some values I haven't noticed how it affected the noise.
These are the two different frequencies/Perlin noises that I would like to apply to one plane. Additionally, it shows the plane they respectively create.
def smooth_and_plot(sampled : pv.core.grid.UniformGrid):
mesh = sampled.warp_by_scalar('scalars')
mesh = mesh.extract_surface()
# clean and smooth a little to reduce perlin noise artifacts
mesh = mesh.smooth(n_iter=100, inplace=True, relaxation_factor=0.07)
mesh.plot()
def gravel_plane():
freq = [180, 180, 50]
noise = pv.perlin_noise(0.2, freq, (0, 0, 0))
sampled = pv.sample_function(noise,
bounds=(-10, 2, -10, 10, -10, 10),
dim=(500, 500, 1))
smooth_and_plot(sampled)
def bumpy_plane():
freq = [0.5, 0.7, 0]
noise = pv.perlin_noise(0.5, freq, (-10, -10, -10))
sampled = pv.sample_function(noise,
bounds=(-10, 2, -10, 10, -10, 10),
dim=(500, 500, 1))
smooth_and_plot(sampled)
Let me answer your questions in reverse order for didactical reasons.
What exactly does the third parameter in the frequency do? After playing around with some values I haven't noticed how it affected the noise.
You didn't see an effect because you were looking at 2d samples, and changing the behaviour along the third axis. The three frequencies specify the granularity of the noise along the x, y and z axes, respectively. In other words, the generated implicit function is a scalar function of three variables. It's just that your sampling reduces the dimensionality to 2.
Frequency might be a surprising quantity when it comes to spatial quantities, but it works the same way as for time. High temporal frequency means short oscillation period, low temporal frequency means long oscillation period. High spatial frequency means short wavelength, low spatial frequency means long wavelength. To be specific, wavelength and frequency are inversely proportional.
So you'll see the effect of the third frequency when you start slicing along the z axis:
import pyvista as pv
freq = [0.5, 0.5, 2]
noise = pv.perlin_noise(0.5, freq, (0, 0, 0))
noise_cube = pv.sample_function(noise,
bounds=(-10, 10, -10, 10, -10, 10),
dim=(200, 200, 200))
noise_cube.slice_orthogonal(-9, -9, -9).plot()
As you can see, the blobs in the xy plane are circular, because the two in-plane frequencies are equal. But in both vertical planes the blobs are elongated: they are flatter in the z direction. This is because the frequency along the z axis is four times larger, leading to a wavelength that is four times smaller. This will lead to random blobs having a roughly 4:1 aspect ratio.
Can you apply Perlin noise to a mesh (sphere/plane) multiple times? I would like to have a plane with some rough 'waves' and high detailed noise on top of them (thus having big waves with little waves in them).
All that happens in your snippets is that a function is sampled on a pre-defined rectangular grid, and the resulting values are stored as scalars on the grid. If you want to superimpose two functions, all you have to do is sum up the scalars from two such function calls. This will be somewhat wasteful, as you are generating the same grid twice (and discarding one of the copies), but this is the least exhausting solution from a development point of view:
def bumpy_gravel_plane():
bounds = (-10, 2, -10, 10, -10, 10)
dim = (500, 500, 1)
freq = [180, 180, 50]
noise = pv.perlin_noise(0.2, freq, (0, 0, 0))
sampled_gravel = pv.sample_function(noise, bounds=bounds, dim=dim)
freq = [0.5, 0.7, 0]
noise = pv.perlin_noise(0.5, freq, (-10, -10, -10))
sampled_bumps = pv.sample_function(noise, bounds=bounds, dim=dim)
sampled = sampled_gravel
sampled['scalars'] += sampled_bumps['scalars']
smooth_and_plot(sampled)
How would you go about applying Perlin noise to a sphere? I would like to have a little bit disformed sphere.
The usual solution of generating a 2d texture and applying that to a sphere won't work here, because the noise is not periodic, so you can't easily close it like that. But if you think about it, the generated Perlin noise is a 3d function. You can just sample this 3d function directly on your sphere!
There's one small problem: I don't think you can do that with just pyvista. We'll have to get our hands slightly dirty, and by that I mean using a bare vtk method (namely EvaluateFunction() of the noise). Generate your sphere, and then query the noise function of your choice on its points. If you want the result to look symmetric, you'll have to set the same frequency along all three Cartesian axes:
def bumpy_sphere(R=10):
freq = [0.5, 0.5, 0.5]
noise = pv.perlin_noise(0.5, freq, (0, 0, 0))
sampled = pv.Sphere(radius=R, phi_resolution=100, theta_resolution=100)
# query the noise at each point manually
sampled['scalars'] = [noise.EvaluateFunction(point) for point in sampled.points]
smooth_and_plot(sampled)
Related
In a 3D Plotly plot the camera center defaults to (0,0,0), where, as far as I understand, (0,0,0) refers to the centre of the 3D volume occupied by the plot, not the coordinate (0,0,0).
These values can be changed via layout.scene.camera.center as documented here and here. However, I can't work out what units are being used, nor can I find this information in the documentation.
E.g. if I change the camera center to (1,1,1), where is this in relation to my plot? From a bit of experimenting I have discovered that:
(1,1,1) puts the camera center outside the volume occupied by my plot, but I can't figure out how far outside,
(0.5, 0.5, 0.5) put the camera center near, but not exactly on, one of the edges of the volume occupied by the plot; sometimes it is near a corner of the volume, sometimes it is along an edge.
Note: I'm not 100% sure that my answer relates to plotly-python, but it works that way in plotly-js so I suppose it should be the same.
By default camera's center is set to (0, 0, 0), that is the visual center of your plot. So, assuming following edge values on axes:
x: [10, 110],
y: [0, 50],
z: [1, 11],
Center point will have coords of (60, 25, 6) (e.g. for x: (10 + 110) / 2 == 60).
To calculate camera coords corresponding to some point within your plot's axes, you can use the following formula (given example is for x axis, but is valid for any):
multiplier = 0.5 * aspectratio.x
x = ((point.x - center.x) / halfLengthOfAxisX) * multiplier
So, in our example, if we wanted to center the camera on point (1, 2, 3), given aspect ratio 1, we would have:
multiplier = 0.5
halfLengthOfAxisX = 50 // Math.abs(center.x - Math.min(x))
x = ((1 - 60) / 50) * 0.5 // -0.59
You mentioned that (0.5, 0.5, 0.5) puts the camera near, but not exactly on one of the edges. That's probably caused by not taking aspectratio into the consideration. From what I know there is no way to retrieve it if it's calculated by Plotly (at least using Plotly.js; it could work differently in Python), so you may need to set it manually.
I have a huge array with coordinates describing a 3D curves, ~20000 points. I am trying to use less points, by ignoring some, say take 1 every 2 points. When I do this and I plot the reduced number of points the shape looks the same. However I would like to compare the two curves properly, similar to the chi squared test to see how much the reduced plot differs from the original.
Is there an easy, built-in way of doing this or does anyone have any ideas on how to approach the problem.
The general question of "line simplification" seems to be an entire field of research. I recommend you to have a look, for instance, at the Ramer–Douglas–Peucker algorithm. There are several python modules, I could find: rdp and simplification (which also implement the Py-Visvalingam-Whyatt algorithm).
Anyway, I am trying something for evaluating the difference between two polylines, using interpolation. Any curves can be compared, even without common points.
The first idea is to compute the distance along the path for both polylines. They are used as landmarks to go from one given point on the first curve to a relatively close point on the other curve.
Then, the points of the first curve can be interpolated on the other curve. These two datasets can now be compared, point by point.
On the graph, the black curve is the interpolation of xy2 on the curve xy1. So the distances between the black squares and the orange circles can be computed, and averaged.
This gives an average distance measure, but nothing to compare against and decide if the applied reduction is good enough...
def normed_distance_along_path( polyline ):
polyline = np.asarray(polyline)
distance = np.cumsum( np.sqrt(np.sum( np.diff(polyline, axis=1)**2, axis=0 )) )
return np.insert(distance, 0, 0)/distance[-1]
def average_distance_between_polylines(xy1, xy2):
s1 = normed_distance_along_path(xy1)
s2 = normed_distance_along_path(xy2)
interpol_xy1 = interp1d( s1, xy1 )
xy1_on_2 = interpol_xy1(s2)
node_to_node_distance = np.sqrt(np.sum( (xy1_on_2 - xy2)**2, axis=0 ))
return node_to_node_distance.mean() # or use the max
# Two example polyline:
xy1 = [0, 1, 8, 2, 1.7], [1, 0, 6, 7, 1.9] # it should work in 3D too
xy2 = [.1, .6, 4, 8.3, 2.1, 2.2, 2], [.8, .1, 2, 6.4, 6.7, 4.4, 2.3]
average_distance_between_polylines(xy1, xy2) # 0.45004578069119189
If you subsample the original curve, a simple way to assess the approximation error is by computing the maximum distance between the original curve and the line segments between the resampled vertices. The maximum distance occurs at the original vertices and it suffices to evaluate at these points only.
By the way, this provides a simple way to perform the subsampling by setting a maximum tolerance and decimating until the tolerance is exceeded.
You can also think of computing the average distance, but this probably involves nasty integrals and might give less visually pleasing results.
I want to do unit testing of simulation models and for that, I run a simulation once and store the results (a time series) as reference in a csv file (see an example here). Now when I change my model, I run the simulation again, store the new reults as a csv file as well and then I compare the results.
The results are usually not 100% identical, an example plot is shown below:
The reference results are plotted in black and the new results are plotted in green.
The difference of the two is plotted in the second plot, in blue.
As can be seen, at a step the difference can become arbitrarily high, while everywhere else the difference is almost zero.
Therefore, I would prefer to use a different algorithms for comparison than just subtracting the two, but I can only describe my idea graphically:
When plotting the reference line twice, first in a light color with a high line width and then again in a dark color and a small line width, then it will look like it has a pink tube around the centerline.
Note that during a step that tube will not only be in the direction of the ordinate axis, but also in the direction of the abscissa.
When doing my comparison, I want to know whether the green line stays within the pink tube.
Now comes my question: I do not want to compare the two time series using a graph, but using a python script. There must be something like this already, but I cannot find it because I am missing the right vocabulary, I believe. Any ideas? Is something like that in numpy, scipy, or similar? Or would I have to write the comparison myself?
Additional question: When the script says the two series are not sufficiently similar, I would like to plot it as described above (using matplotlib), but the line width has to be defined somehow in other units than what I usually use to define line width.
I would assume here that your problem can be simplified by assuming that your function has to be close to another function (e.g. the center of the tube) with the very same support points and then a certain number of discontinuities are allowed.
Then, I would implement a different discretization of function compared to the typical one that is used for L^2 norm (See for example some reference here).
Basically, in the continuous case, the L^2 norm relaxes the constrain of the two function being close everywhere, and allow it to be different on a finite number of points, called singularities
This works because there are an infinite number of points where to calculate the integral, and a finite number of points will not make a difference there.
However, since there are no continuous functions here, but only their discretization, the naive approach will not work, because any singularity will contribute potentially significantly to the final integral value.
Therefore, what you could do is to perform a point by point check whether the two functions are close (within some tolerance) and allow at most num_exceptions points to be off.
import numpy as np
def is_close_except(arr1, arr2, num_exceptions=0.01, **kwargs):
# if float, calculate as percentage of number of points
if isinstance(num_exceptions, float):
num_exceptions = int(len(arr1) * num_exceptions)
num = len(arr1) - np.sum(np.isclose(arr1, arr2, **kwargs))
return num <= num_exceptions
By contrast the standard L^2 norm discretization would lead to something like this integrated (and normalized) metric:
import numpy as np
def is_close_l2(arr1, arr2, **kwargs):
norm1 = np.sum(arr1 ** 2)
norm2 = np.sum(arr2 ** 2)
norm = np.sum((arr1 - arr2) ** 2)
return np.isclose(2 * norm / (norm1 + norm2), 0.0, **kwargs)
This however will fail for arbitrarily large peaks, unless you set such a large tolerance than basically anything results as "being close".
Note that the kwargs is used if you want to specify a additional tolerance constraints to np.isclose() or other of its options.
As a test, you could run:
import numpy as np
import numpy.random
np.random.seed(0)
num = 1000
snr = 100
n_peaks = 5
x = np.linspace(-10, 10, num)
# generate ground truth
y = np.sin(x)
# distributed noise
y2 = y + np.random.random(num) / snr
# distributed noise + peaks
y3 = y + np.random.random(num) / snr
peak_positions = [np.random.randint(num) for _ in range(n_peaks)]
for i in peak_positions:
y3[i] += np.random.random() * snr
# for distributed noise, both work with a 1/snr tolerance
is_close_l2(y, y2, atol=1/snr)
# output: True
is_close_except(y, y2, atol=1/snr)
# output: True
# for peak noise, since n_peaks < num_exceptions, this works
is_close_except(y, y3, atol=1/snr)
# output: True
# and if you allow 0 exceptions, than it fails, as expected
is_close_except(y, y3, num_exceptions=0, atol=1/snr)
# output: False
# for peak noise, this fails because the contribution from the peaks
# in the integral is much larger than the contribution from the rest
is_close_l2(y, y3, atol=1/snr)
# output: False
There are other approaches to this problem involving higher mathematics (e.g. Fourier or Wavelet transforms), but I would stick to the simplest.
EDIT (updated):
However, if the working assumption does not hold or you do not like, for example because the two functions have different sampling or they are described by non-injective relations.
In that case, you can follow the center of the tube using (x, y) data and the calculate the Euclidean distance from the target (the tube center), and check that this distance is point-wise smaller than the maximum allowed (the tube size):
import numpy as np
# assume it is something with shape (N, 2) meaning (x, y)
target = ...
# assume it is something with shape (M, 2) meaning again (x, y)
trajectory = ...
# calculate the distance minimum distance between each point
# of the trajectory and the target
def is_close_trajectory(trajectory, target, max_dist):
dist = np.zeros(trajectory.shape[0])
for i in range(len(dist)):
dist[i] = np.min(np.sqrt(
(target[:, 0] - trajectory[i, 0]) ** 2 +
(target[:, 1] - trajectory[i, 1]) ** 2))
return np.all(dist < max_dist)
# same as above but faster and more memory-hungry
def is_close_trajectory2(trajectory, target, max_dist):
dist = np.min(np.sqrt(
(target[:, np.newaxis, 0] - trajectory[np.newaxis, :, 0]) ** 2 +
(target[:, np.newaxis, 1] - trajectory[np.newaxis, :, 1]) ** 2),
axis=1)
return np.all(dist < max_dist)
The price of this flexibility is that this will be a significantly slower or memory-hungry function.
Assuming you have your list of results in the form we discussed in the comments already loaded:
from random import randint
import numpy
l1 = [(i,randint(0,99)) for i in range(10)]
l2 = [(i,randint(0,99)) for i in range(10)]
# I generate some random lists e.g:
# [(0, 46), (1, 33), (2, 85), (3, 63), (4, 63), (5, 76), (6, 85), (7, 83), (8, 25), (9, 72)]
# where the first element is the time and the second a value
print(l1)
# Then I just evaluate for each time step the difference between the values
differences = [abs(x[0][1]-x[1][1]) for x in zip(l1,l2)]
print(differences)
# And I can just print hte maximum difference and its index:
print(max(differences))
print(differences.index(max(differences)))
And with this data if you define that your "tube" is for example 10 large you can just check if the maxximum value that you find is greater than your thrashold in order to decide if those functions are similar enough or not
you will have to remove outliers from your dataset first if you need to skip a random spike.
you could also try the following?
from tslearn.metrics import dtw
print(dtw(arr1,arr2)*100/<lengthOfArray>)
Bit late to the game but I encountered the same conundrum recently and this seems to be the only question on on the site discussing this particular problem.
A basic solution is to use time and amplitude tolerance values to create a 'bounding box' style envelope (similar to your pink tube) around the data.
I'm sure there are more elegant ways to do this, but a very crudely coded brute force example would be something like the following using pandas:
import pandas as pd
data = pd.DataFrame()
data['benchmark'] = [0.1, 0.2, 0.3] # or whatever you pull from your expected value data set
data['under_test'] = [0.2, 0.3, 0.1] # or whatever you pull from your simulation results data set
sample_rate = 20 # or whatever the data sample rate is
st = 0.05 * sample_rate # shift tolerance adjusted to time series sample rate
# best to make it an integer so we can use standard
# series shift functions and whatnot
at = 0.05 # amplitude tolerance
bounding = pd.DataFrame()
# if we didn't care about time shifts, the following two would be sufficient
# (i.e. if the data didn't have severe discontinuities between samples)
bounding['top'] = data[['benchmark']] + at
bounding['bottom'] = data[['benchmark']] - at
# if you want to be able to tolerate large discontinuities
# the bounds can be widened along the time axis to accommodate for large jumps
bounding['bottomleft'] = data[['benchmark']].shift(-st) - at
bounding['topleft'] = data[['benchmark']].shift(-st) + at
bounding['topright'] = data[['benchmark']].shift(st) + at
bounding['bottomright'] = data[['benchmark']].shift(st) - at
# minimums and maximums give us a rough (but hopefully good enough) envelope
# these can be plotted as a parametric replacement of the 'pink tube' of line width
data['min'] = bounding.min(1)
data['max'] = bounding.max(1)
# see if the test data falls inside the envelope
data['pass/fail'] = data['under_test'].between(data['min'], data['max'])
# You now have a machine-readable column of booleans
# indicating which data points are outside the envelope
Given some list of numbers following some arbitrary distribution, how can I define bin positions for matplotlib.pyplot.hist() so that the area in each bin is equal to (or close to) some constant area, A? The area should be calculated by multiplying the number of items in the bin by the width of the bin and its value should be no greater than A.
Here is a MWE to display a histogram with normally distributed sample data:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(100)
plt.hist(x, bin_pos)
plt.show()
Here bin_pos is a list representing the positions of the boundaries of the bins (see related question here.
I found this question intriguing. The solution depends on whether you want to plot a density function, or a true histogram. The latter case turns out to be quite a bit more challenging. Here is more info on the difference between a histogram and a density function.
Density Functions
This will do what you want for a density function:
def histedges_equalN(x, nbin):
npt = len(x)
return np.interp(np.linspace(0, npt, nbin + 1),
np.arange(npt),
np.sort(x))
x = np.random.randn(1000)
n, bins, patches = plt.hist(x, histedges_equalN(x, 10), normed=True)
Note the use of normed=True, which specifies that we're calculating and plotting a density function. In this case the areas are identically equal (you can check by looking at n * np.diff(bins)). Also note that this solution involves finding bins that have the same number of points.
Histograms
Here is a solution that gives approximately equal area boxes for a histogram:
def histedges_equalA(x, nbin):
pow = 0.5
dx = np.diff(np.sort(x))
tmp = np.cumsum(dx ** pow)
tmp = np.pad(tmp, (1, 0), 'constant')
return np.interp(np.linspace(0, tmp.max(), nbin + 1),
tmp,
np.sort(x))
n, bins, patches = plt.hist(x, histedges_equalA(x, nbin), normed=False)
These boxes, however, are not all equal area. The first and last, in particular, tend to be about 30% larger than the others. This is an artifact of the sparse distribution of the data at the tails of the normal distribution and I believe it will persist anytime their is a sparsely populated region in a data set.
Side note: I played with the value pow a bit, and found that a value of about 0.56 had a lower RMS error for the normal distribution. I stuck with the square-root because it performs best when the data is tightly-spaced (relative to the bin-width), and I'm pretty sure there is a theoretical basis for it that I haven't bothered to dig into (anyone?).
The issue with equal-area histograms
As far as I can tell it is not possible to obtain an exact solution to this problem. This is because it is sensitive to the discretization of the data. For example, suppose the first point in your dataset is an outlier at -13 and the next value is at -3, as depicted by the red dots in this image:
Now suppose the total "area" of your histogram is 150 and you want 10 bins. In that case the area of each histogram bar should be about 15, but you can't get there because as soon as your bar includes the second point, its area jumps from 10 to 20. That is, the data does not allow this bar to have an area between 10 and 20. One solution for this might be to adjust the lower-bound of the box to increase its area, but this starts to become arbitrary and does not work if this 'gap' is in the middle of the data set.
What I want:
To display the results of my simple classification algorithm (see below) as a colormap in python (the data is in 2D), where each class is assigned a color, and the confidence of a prediction anywhere on the 2D map is proportional to the saturation of the color associated with the class prediction. The image below sort of illustrates what I want for a binary (two class problem) in which the red parts might suggest strong confidence in class 1, whereas blue parts would speak for class 2. The intermediate colors would suggest uncertainty about either. Obviously I want the color scheme to generalize to multiple classes, so I would need many colors and the scale would then go from white (uncertainty) to very colorful color associated with a class.
illustration http://www.nicolacarlon.it/out.png
Some Sample Code:
My sample code just uses a simple kNN algorithm where the nearest k data points are allowed to 'vote' on the class of a new point on the map. The confidence of the prediction is simply given by relative frequency of the winning class, out of the k which voted. I haven't dealt with ties and I know there are better probabilistic versions of this method, but all I want is to visualize my data to show a viewer the chances of a class being in a particular part of the 2D plane.
import numpy as np
import matplotlib.pyplot as plt
# Generate some training data from three classes
n = 100 # Number of covariates (sample points) for each class in training set.
mean1, mean2, mean3 = [-1.5,0], [1.5, 0], [0,1.5]
cov1, cov2, cov3 = [[1,0],[0,1]], [[1,0],[0,1]], [[1,0],[0,1]]
X1 = np.asarray(np.random.multivariate_normal(mean1,cov1,n))
X2 = np.asarray(np.random.multivariate_normal(mean2,cov2,n))
X3 = np.asarray(np.random.multivariate_normal(mean3,cov3,n))
plt.plot(X1[:,0], X1[:,1], 'ro', X2[:,0], X2[:,1], 'bo', X3[:,0], X3[:,1], 'go' )
plt.axis('equal'); plt.show() #Display training data
# Prepare the data set as a 3n*3 array where each row is a data point and its associated class
D = np.zeros((3*n,3))
D[0:n,0:2] = X1; D[0:n,2] = 1
D[n:2*n,0:2] = X2; D[n:2*n,2] = 2
D[2*n:3*n,0:2] = X3; D[2*n:3*n,2] = 3
def kNN(x, D, k=3):
x = np.asarray(x)
dist = np.linalg.norm(x-D[:,0:2], axis=1)
i = dist.argsort()[:k] #Return k indices of smallest to highest entries
counts = np.bincount(D[i,2].astype(int))
predicted_class = np.argmax(counts)
confidence = float(np.max(counts))/k
return predicted_class, confidence
print(kNN([-2,0], D, 20))
So, you can calculate two numbers for each point in the 2D plane
confidence (0 .. 1)
class (an integer)
One possibility is to calculate your own RGB map and show it with imshow. Like this:
import numpy as np
import matplotlib.pyplot as plt
# color vector with N x 3 colors, where N is the maximum number of classes and the colors are in RGB
mycolors = np.array([
[ 0, 0, 1],
[ 0, 1, 0],
[ 1, 0, 1],
[ 1, 1, 0],
[ 0, 1, 1],
[ 0, 0, 0],
[ 0, .5, 1]])
# negate the colors
mycolors = 1 - mycolors
# extents of the area
x0 = -2
x1 = 2
y0 = -2
y1 = 2
# grid over the area
X, Y = np.meshgrid(np.linspace(x0, x1, 1000), np.linspace(y0, y1, 1000))
# calculate the classification and probabilities
classes = classify_func(X, Y)
probabilities = prob_func(X, Y)
# create the basic color map by the class
img = mycolors[classes]
# fade the color by the probability (black for zero prob)
img *= probabilities[:,:,None]
# reverse the negative image back
img = 1 - img
# draw it
plt.imshow(img, extent=[x0,x1,y0,y1], origin='lower')
plt.axis('equal')
# save it
plt.savefig("mymap.png")
The trick of making negative colors is there just to make the maths a bit easier to undestand. The code can of course be written much denser.
I created two very simple functions to mimic the classification and probabilities:
def classify_func(X, Y):
return np.round(abs(X+Y)).astype('int')
def prob_func(X,Y):
return 1 - 2*abs(abs(X+Y)-classify_func(X,Y))
The former gives for the given area integer values from 0 to 4, and the latter gives smoothly changing probabilities.
The result:
If you do not like the way the colors fade towards zero probability, you may always create some non-linearity which is the applied when multiplying with the probabilities.
Here the functions classify_func and prob_func are given two arrays as the arguments, first one being the X coordinates where the values are to be calculated, and second one Y coordinates. This works well, if the underlying calculations are fully vectorized. With the code in the question this is not the case, as it only calculates single values.
In that case the code changes slightly:
x = np.linspace(x0, x1, 1000)
y = np.linspace(y0, y1, 1000)
classes = np.empty((len(y), len(x)), dtype='int')
probabilities = np.empty((len(y), len(x)))
for yi, yv in enumerate(y):
for xi, xv in enumerate(x):
classes[yi, xi], probabilities[yi, xi] = kNN((xv, yv), D)
Also as your confidence estimates are not 0..1, they need to be scaled:
probabilities -= np.amin(probabilities)
probabilities /= np.amax(probabilities)
After this is done, your map should look like this with extents -4,-4..4,4 (as per the color map: green=1, magenta=2, yellow=3):
To vectorize or not to vectorize - that is the question
This question pops up from time to time. There is a lot of information about vectorization in the web, but as a quick search did not reveal any short summaries, I'll give some thoughts here. This is quite a subjective matter, so everything just represents my humble opinions. Other people may have different opinions.
There are three factors to consider:
performance
legibility
memory use
Usually (but not always) vectorization makes code faster, more difficult to understand, and consume more memory. Memory use is not usually a big problem, but with large arrays it is something to think of (hundreds of megs are usually ok, gigabytes are troublesome).
Trivial cases aside (element-wise simple operations, simple matrix operations), my approach is:
write the code without vectorizations and check it works
profile the code
vectorize the inner loops if needed and possible (1D vectorization)
create a 2D vectorization if it is simple
For example, a pixel-by-pixel image processing operation may lead to a situation where I end up with one-dimensional vectorizations (for each row). Then the inner loop (for each pixel) is fast, and the outer loop (for each row) does not really matter. The code may look much simpler if it does not try to be usable with all possible input dimensions.
I am such a lousy algorithmist that in more complex cases I like to verify my vectorized code against the non-vectorized versions. Hence I almost invariably first create the non-vectorized code before optimizing it at all.
Sometimes vectorization does not offer any performance benefit. For example, the handy function numpy.vectorize can be used to vectorize practically any function, but its documentation states:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
(This function could have been used in the code above, as well. I chose the loop version for legibility for people not very familiar with numpy.)
Vectorization gives more performance only if the underlying vectorized functions are faster. They sometimes are, sometimes aren't. Only profiling and experience will tell. Also, it is not always necessary to vectorize everything. You may have an image processing algorithm which has both vectorized and pixel-by-pixel operations. There numpy.vectorize is very useful.
I would try to vectorize the kNN search algorithm above at least to one dimension. There is no conditional code (it wouldn't be a show-stopper but it would complicates things), and the algorithm is rather straight-forward. The memory consumption will go up, but with one-dimensional vectorization it does not matter.
And it may happen that along the way you notice that a n-dimensional generalization is not much more complicated. Then do that if memory allows.