I have data from distinct curves, and want to fit each of them individually. However, the data is mixed into a single array, so first I believe I need a way to separate the data.
I know that each of the individual curves is under the family A/x+B. As of now I cut out each of the curves by hand and curve fit, but would like to automate this process, have the computer separate these curves a fit them. I attempted to use machine learning, but didn't know where to start, what packages to use. I am using python, but can also use C++, in fact I hope to transfer it to C++ by the end. Where do you think I should start, is it worth it to use unsupervised machine learning, or is there a better way to separate the data?
The expected curves:
An example of the data
Well, you sure do have an interesting problem.
I see that there are curves with Y-axis values that are considerably larger than the rest of them. I would simply take the first N-values with the largest Y-axis values and then fit them to an exponential decay curve (or that other curve you mention). You can then simply take the points that most fit that curve and then leave the other points alone.
Except...
This is a terrible way to extrapolate data. Doing this, you are cherry-picking the data you want. This is falsifying information and is very bad.
Your best bet is to create a single curve that all points fit too if you cannot isolate all of those points into separate curves with external information.
But...
We do know some information: a valid function must have only 1 output given a single input.
If the X-Axis is discreet, this means you can create a lookup table of Outputs given the input. This allows you to count how many curves there are associated with the specific X-value (which could be a time unit). In other words, you have to have external information to separate points locally. You can then reorder the points in increasing Y-value, and now you have your separate curves defined in discrete points.
Basically, this is an unsolvable problem in the general sense, but in your specific application, there might be extra rules that further define the domain and range such that you can do data filtering.
One more thing...
I am making these statements with the assumption that the (X,Y) values are floats that cannot maintain accuracy after some mathematical operations.
If you are using things like unum numbers, you might be able to keep enough information in the decimal such that your fitting functions can differentiate between points without extra filtering.
This case is more of a hope than anything, as adopting a new number representation to get more accuracy to isolate sampled points is a stretch at best.
Just for completeness, there are some mathematical libraries that might help you.
Boost.uBLAS
Eigen
LAPACK++
Hopefully, I have given you enough information to allow you to solve your problem.
I extracted data from the plot for analysis. Here is example code that loads, separates, fits and plots the three data sets. It works when the separate data files are appended into a single text file.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
##########################################################
# data load and separation section
datafilename = 'temp.dat'
textdata = open(datafilename, 'rt').read()
xLists = [[], [], []]
yLists = [[], [], []]
previousY = 0.0 # initialize
whichList = -1 # initialize
datalines = textdata.split('\n')
for line in datalines:
if not line: # allow for blank lines in data file
continue
spl = line.split()
x = float(spl[0])
y = float(spl[1])
if y > previousY + 50.0: # this separator must be greater than max noise
whichList += 1
previousY = y
xLists[whichList].append(x)
yLists[whichList].append(y)
##########################################################
# curve fitting section
def func(x, a, b):
return a / x + b
parameterLists = []
for curveIndex in range(len(xLists)):
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0])
xData = numpy.array(xLists[curveIndex], dtype=float)
yData = numpy.array(yLists[curveIndex], dtype=float)
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
parameterLists.append(fittedParameters)
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
for curveIndex in range(len(xLists)):
# first the raw data as a scatter plot
axes.plot(xLists[curveIndex], yLists[curveIndex], 'D')
# create data for each fitted equation plot
xModel = numpy.linspace(min(xLists[curveIndex]), max(xLists[curveIndex]))
yModel = func(xModel, *parameterLists[curveIndex])
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
The idea:
create N naive, easy to calculate, sufficiently precise(for clustering), approximations. Then "classify" each data-point to the closest such approximation.
This is done like this:
The approximations are analytical approximations using these two equations I derived:
where (x1,y1) and (x2,y2) are coordinates of two points on the curve.
To get these two points I assumed that (1) the first points(according to the x-axis) are distributed equally between the different real curves. And (2) the 2 first points of each real curve, are smaller or bigger than the 2 first points of each other real curve. Thus sorting them and dividing into N groups will successfully cluster the first *2*N* points. If these assumptions are false you can still manually classify the 2 first points of each real curve and the rest will be classified automatically (this is actually the first approach I implemented).
Then cluster rest of the points to each point's closest approximation. Closest meaning with the smallest error.
Edit: A stronger approach for the initial approximation could be by calculating A and B for a couple of pairs of points and using their mean A and B as the approximation. And maybe even possibly doing K-means on these points/approximations.
The Code:
import numpy as np
import matplotlib.pyplot as plt
# You should probably edit this variable
NUM_OF_CURVES = 4
# <data> should be a 1-D array containing the Y values of the series
# <x_of_data> should be a 1-D array containing the corresponding X values of the series
data, x_of_data = np.loadtxt('...')
# clustering of first 2*num_of_curves points
# I started at NUM_OF_CURVES instead of 0 because my xs started at 0.
# The range (0:NUM_OF_CURVES*2) will probably be better for you.
raw_data = data[NUM_OF_CURVES:NUM_OF_CURVES*3]
raw_xs = x_of_data[NUM_OF_CURVES:NUM_OF_CURVES*3]
sort_ind = np.argsort(raw_data)
Y = raw_data[sort_ind].reshape(NUM_OF_CURVES,-1).T
X = raw_xs[sort_ind].reshape(NUM_OF_CURVES,-1).T
# approximation of A and B for each curve
A = ((Y[0]*Y[1])*(X[0]-X[1]))/(Y[1]-Y[0])
B = (A / Y[0]) - X[0]
# creating approximating curves
f = []
for i in range(NUM_OF_CURVES):
f.append(A[i]/(x_of_data+B[i]))
curves = np.vstack(f)
# clustering the points to the approximating curves
raw_clusters = [[] for _ in range(NUM_OF_CURVES)]
for i in range(len(data)):
raw_clusters[np.abs(curves[:,i]-data[i]).argmin()].append((x_of_data[i],data[i]))
# changing the clusters to np.arrays of the shape (2,-1)
# where row 0 contains the X coordinates and row 1 the Y coordinates
clusters = []
for i in range(len(raw_clusters)):
clusters.append(np.array(list(zip(*raw_clusters[i]))))
Example:
raw series:
separated series:
Related
So, I have three numpy arrays which store latitude, longitude, and some property value on a grid -- that is, I have LAT(y,x), LON(y,x), and, say temperature T(y,x), for some limits of x and y. The grid isn't necessarily regular -- in fact, it's tripolar.
I then want to interpolate these property (temperature) values onto a bunch of different lat/lon points (stored as lat1(t), lon1(t), for about 10,000 t...) which do not fall on the actual grid points. I've tried matplotlib.mlab.griddata, but that takes far too long (it's not really designed for what I'm doing, after all). I've also tried scipy.interpolate.interp2d, but I get a MemoryError (my grids are about 400x400).
Is there any sort of slick, preferably fast way of doing this? I can't help but think the answer is something obvious... Thanks!!
Try the combination of inverse-distance weighting and
scipy.spatial.KDTree
described in SO
inverse-distance-weighted-idw-interpolation-with-python.
Kd-trees
work nicely in 2d 3d ..., inverse-distance weighting is smooth and local,
and the k= number of nearest neighbours can be varied to tradeoff speed / accuracy.
There is a nice inverse distance example by Roger Veciana i Rovira along with some code using GDAL to write to geotiff if you're into that.
This is of coarse to a regular grid, but assuming you project the data first to a pixel grid with pyproj or something, all the while being careful what projection is used for your data.
A copy of his algorithm and example script:
from math import pow
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
def pointValue(x,y,power,smoothing,xv,yv,values):
nominator=0
denominator=0
for i in range(0,len(values)):
dist = sqrt((x-xv[i])*(x-xv[i])+(y-yv[i])*(y-yv[i])+smoothing*smoothing);
#If the point is really close to one of the data points, return the data point value to avoid singularities
if(dist<0.0000000001):
return values[i]
nominator=nominator+(values[i]/pow(dist,power))
denominator=denominator+(1/pow(dist,power))
#Return NODATA if the denominator is zero
if denominator > 0:
value = nominator/denominator
else:
value = -9999
return value
def invDist(xv,yv,values,xsize=100,ysize=100,power=2,smoothing=0):
valuesGrid = np.zeros((ysize,xsize))
for x in range(0,xsize):
for y in range(0,ysize):
valuesGrid[y][x] = pointValue(x,y,power,smoothing,xv,yv,values)
return valuesGrid
if __name__ == "__main__":
power=1
smoothing=20
#Creating some data, with each coodinate and the values stored in separated lists
xv = [10,60,40,70,10,50,20,70,30,60]
yv = [10,20,30,30,40,50,60,70,80,90]
values = [1,2,2,3,4,6,7,7,8,10]
#Creating the output grid (100x100, in the example)
ti = np.linspace(0, 100, 100)
XI, YI = np.meshgrid(ti, ti)
#Creating the interpolation function and populating the output matrix value
ZI = invDist(xv,yv,values,100,100,power,smoothing)
# Plotting the result
n = plt.normalize(0.0, 100.0)
plt.subplot(1, 1, 1)
plt.pcolor(XI, YI, ZI)
plt.scatter(xv, yv, 100, values)
plt.title('Inv dist interpolation - power: ' + str(power) + ' smoothing: ' + str(smoothing))
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.colorbar()
plt.show()
There's a bunch of options here, which one is best will depend on your data...
However I don't know of an out-of-the-box solution for you
You say your input data is from tripolar data. There are three main cases for how this data could be structured.
Sampled from a 3d grid in tripolar space, projected back to 2d LAT, LON data.
Sampled from a 2d grid in tripolar space, projected into 2d LAT LON data.
Unstructured data in tripolar space projected into 2d LAT LON data
The easiest of these is 2. Instead of interpolating in LAT LON space, "just" transform your point back into the source space and interpolate there.
Another option that works for 1 and 2 is to search for the cells that maps from tripolar space to cover your sample point. (You can use a BSP or grid type structure to speed up this search) Pick one of the cells, and interpolate inside it.
Finally there's a heap of unstructured interpolation options .. but they tend to be slow.
A personal favourite of mine is to use a linear interpolation of the nearest N points, finding those N points can again be done with gridding or a BSP. Another good option is to Delauney triangulate the unstructured points and interpolate on the resulting triangular mesh.
Personally if my mesh was case 1, I'd use an unstructured strategy as I'd be worried about having to handle searching through cells with overlapping projections. Choosing the "right" cell would be difficult.
I suggest you taking a look at GRASS (an open source GIS package) interpolation features (http://grass.ibiblio.org/gdp/html_grass62/v.surf.bspline.html). It's not in python but you can reimplement it or interface with C code.
Am I right in thinking your data grids look something like this (red is the old data, blue is the new interpolated data)?
alt text http://www.geekops.co.uk/photos/0000-00-02%20%28Forum%20images%29/DataSeparation.png
This might be a slightly brute-force-ish approach, but what about rendering your existing data as a bitmap (opengl will do simple interpolation of colours for you with the right options configured and you could render the data as triangles which should be fairly fast). You could then sample pixels at the locations of the new points.
Alternatively, you could sort your first set of points spatially and then find the closest old points surrounding your new point and interpolate based on the distances to those points.
There is a FORTRAN library called BIVAR, which is very suitable for this problem. With a few modifications you can make it usable in python using f2py.
From the description:
BIVAR is a FORTRAN90 library which interpolates scattered bivariate data, by Hiroshi Akima.
BIVAR accepts a set of (X,Y) data points scattered in 2D, with associated Z data values, and is able to construct a smooth interpolation function Z(X,Y), which agrees with the given data, and can be evaluated at other points in the plane.
I'm trying to draw the best fitting line for given (x,y) data points.
Here shows data points (red pixels) and estimated line (green), I obtained using following library.
import numpy as np
m, c = np.linalg.lstsq(A, y)[0]
Documentation for used library module
We can see data points are roughly symmetrically distributed. Problem is why is this line not having the gradient similar to the long symmetric axis through the data points? Can you please explain can this result is correct? Then, how it gives minimum error? (Line is drawn correctly using gradient returned by the lstsq method). Thank you.
EDIT
Here is the code I'm trying. Input image can be downloaded from here. In this code I've not forced the line to pass through the center of the pixel distribution. (Note: here I've used polyfit instead of lstsq. Both gives same results)
import numpy as np
import cv2
import math
img = cv2.imread('points.jpg',1);
h, w = img.shape[:2]
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
points = np.argwhere(gray>10) # get (x,y) pairs where red pixels exist
y = points[:,0]
x = points[:,1]
m, c = np.polyfit(x, y, 1) # calculate least square fit line
# calculate two cordinates (x1,y1),(x2,y2) on the line
angle = np.arctan(m)
x1, y1, length = 0, int(c), 500
x2 = int(round(math.ceil(x1 + length * np.cos(angle)),0))
y2 = int(round(math.ceil(y1 + length * np.sin(angle)),0))
# draw line on the color image
cv2.line(img, (x1, y1), (x2, y2), (0,255,0), 1, cv2.LINE_8)
# show output the image
cv2.namedWindow("Display window", cv2.WINDOW_AUTOSIZE);
cv2.imshow("Display window", img);
cv2.waitKey(0);
cv2.destroyAllWindows()
How can I have the line pass through the longest symmetric axis of the pixel distribution? Can I use principle component analysis?
It's hard to say why this would be the case. The bottom line is that I can't see the data you're using, and I can't see what the calculated slope and y intercept are for the data you're using.
Here are a couple of things that could explain what we're seeing:
(1) The density of data points is actually quite different than it appears to a casual glance and everything is working properly.
(2) You're sending the wrong arguments to the least squares function and you've got a GIGO situation. (I haven't used numpy's least squares algorithm, so I can't check this.)
(3) The scatter plot and the line plot don't agree on the scale of the axes.
(4) The least squares function in question is broken.
(5) You're not passing the same data to the least squares algorithm as you're passing to the plotting routine.
(6) The data formatting is funky so that the scatter plot and least squares routines are interpreting your data differently.
I can't know which of these is the problem, and unless it's (3), I expect we'd need more data to be able to distinguish between these possibilities.
Here's how I'd proceed if I were you: (1) Create a small artificial data set that sits on a line and pass it to the least squares function and see if it spits out the right numbers. See if these look right when plotted or not. (2) If this looks okay, record the output of the least squares algorithm, see if you can find another least squares program to calculate the slope and y intercept and compare them. If they're the same, it's probably not the routine, it's probably something to do with plotting.
If you get this far and it's still a mystery, let us know what you've found and maybe we can make another suggestion.
Good luck.
If the red dots truly represent your data, you are probably applying your linear regression function in a way that forces the line through the origin. How do i know? When using linear regression on two variables x and y, the line will intercept a few specific points. For example the average of x, and the average of y. Also, depending on your specifications, a calculated or specified intercept of the y axis. If all variables of x and y are positive, you will have a line that looks like yours if the line is forced through the origin. Not much more can be said before you provide som reproducible data and code.
EDIT:
I didn't have much luck with the reproducble sample provided, so I built an example with random numbers to elaborate on my original answer. I think statsmodels is a decent library for linear regression analysis. First, I'll address this earlier comment:
If all variables of x and y are positive, you will have a line that looks like yours if the line is forced through the origin.
You'll see an increasing effect of this the larger your numbers are (the further away from the origin your numbers are). Using sm.OLS(y,sm.add_constant(x)).fit() and sm.OLS(y,x).fit() for two different sets of numbers will show you exactly what I mean. First, I'll run a regression on the dataset below without an estimated constant (the line goes through the origin). This will give us a plot that at resembles your original plot:
# Libraries
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Data
np.random.seed(123)
x = np.random.normal(size=2500) + 100
y = x * 2 + np.random.normal(size=2500) + 100
# Regression
results1 = sm.OLS(y,x).fit()
regLine_origin = x*results1.params[0]
# PLot
fig, ax = plt.subplots()
ax.scatter(x, y, c='red', s=4)
ax.scatter(x, regLine_origin, c = 'green', s = 1)
ax.patch.set_facecolor('black')
plt.show()
Next, I'll include a constant in the regression. Now, the yellow line will represent what I think you were after in your question:
# Libraries
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Data
np.random.seed(123)
x = np.random.normal(size=2500) + 100
y = x * 2 + np.random.normal(size=2500) + 100
# Regression
results1 = sm.OLS(y,x).fit()
results2 = sm.OLS(y,sm.add_constant(x)).fit()
regLine_origin = x*results1.params[0]
regLine_constant = results2.params[0] + x*results2.params[1]
# PLot
fig, ax = plt.subplots()
ax.scatter(x, y, c='red', s=4)
ax.scatter(x, regLine_origin, c = 'green', s = 1)
ax.scatter(x, regLine_constant, c = 'yellow', s = 1)
ax.patch.set_facecolor('black')
plt.show()
And lastly, we can take a look at what happens when the numbers are closer to the origin. So to speak. Here, I'll remove the +100 part when the numbers are produced:
# The following is changed in the snippet above:
# Data
x = np.random.normal(size=2500)
y = x * 2 + np.random.normal(size=2500)
And that's why I think your original regression line is set to go through the origin. Have a look at the statsmodels package. Here you can study the details of the estimate by running print(results2.summary()):
And as you've already seen in the snippets above, you'll have direct access to the regression coefficients by using results2.params.
Edit2: My explanation still isn't 100% valid. The x and y values will have to differ a bit in size to see this effect. You'll certainly find situations where the line goes through the origin no matter the size of the numbers.
Have a look at the different x labels, and you'll see what I mean.
I am working on a project in which I am trying to model the movement of an object in a kymograph. In order to do so, I fit a curve to each line of pixels in an image, and append the location of the vertex to approximately model the location of the object in the image. Below is a sample image.
As you can see, early in the time series (at the top of the image) the position of the object is nicely focused and easily modeled with a Gaussian curve. However, closer to the end of the time series (at the bottom of the image), the peak is much more diffuse. I suspect that the data at the bottom of the image will be fit much more closely by a curve modeling a Poisson distribution (image below, right) while the data at the top/middle of the image will be fit much more closely by a Gaussian or polynomial curve (image below, left).
Is there any way to, for each line of pixels, fit more than one curve to the same data and then score each for a least-squares fit? This way, I could (hopefully) switch models midway through an image to accommodate changing behaviors of the object that I am trying to track. My current code is below:
from PIL import Image
def populateData(picture) :
"""Open an image and populate a list of lists with the grayscale value"""
im = Image.open(picture)
size = im.size
width = size[0]
height = size[1]
allPixels = list(im.getdata())
pixelList = [allPixels[width*i :
width * (i+1)] for i in range(height)]
return(pixelList)
rawData = populateData("testTop.tif")
import numpy as np
from scipy.optimize import curve_fit
def findVertex(listOfRows) :
xList = []
for row in listOfRows :
x = np.arange(len(row))
ffunc = lambda x, a, x0, s: a*np.exp(-0.5*(x-x0)**2/s**2)
p, _ = curve_fit(ffunc, x, row, p0=[100,5,2])
x0 = p[1]
xList.append(x0)
xArray = np.array(xList)
return(xArray)
xValues = findVertex(rawData)
def buildRows(listOfRows) :
yArray = np.arange(len(listOfRows))
return(yArray)
yValues = buildRows(rawData)
from matplotlib import pyplot as plt
from scipy import ndimage
image = ndimage.imread("testTop.tif",flatten=True)
fig = plt.figure()
axes = fig.add_subplot(111)
axes.imshow(image)
axes.plot(xValues, yValues, 'k-')
axes.set_title('testLine')
axes.grid()
axes.set_xlabel('x')
axes.set_ylabel('time')
plt.show()
EDIT:
This is the file I used as an input (testTop.tif)
You will need to work out some form of goodness of fit between the fit and your data. Taking the sum of the squared differences between your current fit (a Gaussian) and your data divided by the variance.
sumerrsq = 0.
for i in range(yValues.shape[0]):
sumerrsq += np.power(yValues[i] - xValues[i],2)
goodfit = np.sqrt(sumerrsq/var)
I think you can use use the second output from curve fit (the covariance) to get the variance,
p, pcov = curve_fit(ffunc, x, row, p0=[100,5,2])
var = np.diag(pcov)
You can then check the value of goodfit and if it is not sufficient, switch to a different distribution. In using a different distribution, you may need to use a different estimation of error (this assumes the errors are normally distributed).
Note, without the data (and not being sure what array was which) I couldn't test any of this code...
According to the curve_fit docs:
To compute one standard deviation errors on the parameters use perr =
np.sqrt(np.diag(pcov)).
So if that's the value you're trying to compare, then you could take that second returned value from curve_fit (the one you are currently assigning to _), use it to calculate perr as above, and compare the error between multiple curves.
I would suggest you work with a 2D fit model. A 1d Gaussian distribution is the basis but the mean and variance depend on position and time. You then would fit the model against the 2d image data.
In case you want to stay with your approach, it looks like it's just the starting value for mean and variance which you need to tweak in order to get a better fit for the lines with large times.
To your question, you can model any score function you want, so you could do something like:
def score(x,y):
if x < 10:
return x**2 - y
else:
return x - y
So in order to work with two different models in different ranges, follow this example.
I am trying to interpolate 3D atmospheric data from one vertical coordinate to another using Numpy/Scipy. For example, I have cubes of temperature and relative humidity, both of which are on constant, regular pressure surfaces. I want to interpolate the relative humidity to constant temperature surface(s).
The exact problem I am trying to solve has been asked previously here, however, the solution there is very slow. In my case, I have approximately 3M points in my cube (30x321x321), and that method takes around 4 minutes to operate on one set of data.
That post is nearly 5 years old. Do newer versions of Numpy/Scipy perhaps have methods that handle this faster? Maybe new sets of eyes looking at the problem have a better approach? I'm open to suggestions.
EDIT:
Slow = 4 minutes for one set of data cubes. I'm not sure how else I can quantify it.
The code being used...
def interpLevel(grid,value,data,interp='linear'):
"""
Interpolate 3d data to a common z coordinate.
Can be used to calculate the wind/pv/whatsoever values for a common
potential temperature / pressure level.
grid : numpy.ndarray
The grid. For example the potential temperature values for the whole 3d
grid.
value : float
The common value in the grid, to which the data shall be interpolated.
For example, 350.0
data : numpy.ndarray
The data which shall be interpolated. For example, the PV values for
the whole 3d grid.
kind : str
This indicates which kind of interpolation will be done. It is directly
passed on to scipy.interpolate.interp1d().
returns : numpy.ndarray
A 2d array containing the *data* values at *value*.
"""
ret = np.zeros_like(data[0,:,:])
for yIdx in xrange(grid.shape[1]):
for xIdx in xrange(grid.shape[2]):
# check if we need to flip the column
if grid[0,yIdx,xIdx] > grid[-1,yIdx,xIdx]:
ind = -1
else:
ind = 1
f = interpolate.interp1d(grid[::ind,yIdx,xIdx], \
data[::ind,yIdx,xIdx], \
kind=interp)
ret[yIdx,xIdx] = f(value)
return ret
EDIT 2:
I could share npy dumps of sample data, if anyone was interested enough to see what I am working with.
Since this is atmospheric data, I imagine that your grid does not have uniform spacing; however if your grid is rectilinear (such that each vertical column has the same set of z-coordinates) then you have some options.
For instance, if you only need linear interpolation (say for a simple visualization), you can just do something like:
# Find nearest grid point
idx = grid[:,0,0].searchsorted(value)
upper = grid[idx,0,0]
lower = grid[idx - 1, 0, 0]
s = (value - lower) / (upper - lower)
result = (1-s) * data[idx - 1, :, :] + s * data[idx, :, :]
(You'll need to add checks for value being out of range, of course).For a grid your size, this will be extremely fast (as in tiny fractions of a second)
You can pretty easily modify the above to perform cubic interpolation if need be; the challenge is in picking the correct weights for non-uniform vertical spacing.
The problem with using scipy.ndimage.map_coordinates is that, although it provides higher order interpolation and can handle arbitrary sample points, it does assume that the input data be uniformly spaced. It will still produce smooth results, but it won't be a reliable approximation.
If your coordinate grid is not rectilinear, so that the z-value for a given index changes for different x and y indices, then the approach you are using now is probably the best you can get without a fair bit of analysis of your particular problem.
UPDATE:
One neat trick (again, assuming that each column has the same, not necessarily regular, coordinates) is to use interp1d to extract the weights doing something like follows:
NZ = grid.shape[0]
zs = grid[:,0,0]
ident = np.identity(NZ)
weight_func = interp1d(zs, ident, 'cubic')
You only need to do the above once per grid; you can even reuse weight_func as long as the vertical coordinates don't change.
When it comes time to interpolate then, weight_func(value) will give you the weights, which you can use to compute a single interpolated value at (x_idx, y_idx) with:
weights = weight_func(value)
interp_val = np.dot(data[:, x_idx, y_idx), weights)
If you want to compute a whole plane of interpolated values, you can use np.inner, although since your z-coordinate comes first, you'll need to do:
result = np.inner(data.T, weights).T
Again, the computation should be practically immediate.
This is quite an old question but the best way to do this nowadays is to use MetPy's interpolate_1d funtion:
https://unidata.github.io/MetPy/latest/api/generated/metpy.interpolate.interpolate_1d.html
There is a new implementation of Numba accelerated interpolation on regular grids in 1, 2, and 3 dimensions:
https://github.com/dbstein/fast_interp
Usage is as follows:
from fast_interp import interp2d
import numpy as np
nx = 50
ny = 37
xv, xh = np.linspace(0, 1, nx, endpoint=True, retstep=True)
yv, yh = np.linspace(0, 2*np.pi, ny, endpoint=False, retstep=True)
x, y = np.meshgrid(xv, yv, indexing='ij')
test_function = lambda x, y: np.exp(x)*np.exp(np.sin(y))
f = test_function(x, y)
test_x = -xh/2.0
test_y = 271.43
fa = test_function(test_x, test_y)
interpolater = interp2d([0,0], [1,2*np.pi], [xh,yh], f, k=5, p=[False,True], e=[1,0])
fe = interpolater(test_x, test_y)
I have some data over a 2D range that I am interested in analyzing. These data were originally in lists x,y, and z where z[i] was the value for the point located at (x[i],y[i]). I then interpolated this data onto a regular grid using
x=np.array(x)
y=np.array(y)
z=np.array(z)
xi=np.linspace(minx,maxx,100)
yi=np.linspace(miny,maxy,100)
zi=griddata(x,y,z,xi,yi)
I then plotted the xi,yi,zi data using
plt.contour(xi,yi,zi)
plt.pcolormesh(xi,yi,zi,cmap=plt.get_cmap('PRGn'),norm=plt.Normalize(-10,10),vmin=-10,vmax=10)
This produced this plot:
In this plot you can see the S-like curve where the values are equal to zero (aside: the data doesn't vary as rapidly as shown in the colorbar -- that's simply a result of me normalizing the data to -10-10 when it actually extends far beyond that range; I did this to make the zero-valued region show up better -- maybe there's a better way of doing this too...).
The scattered dots are simply the points at which I have original data (yes, in this case my data was already on a regular grid). What I'm curious about is whether there is a good way for me to extract the values for which the curve is zero and obtain x,y pairs that, if plotted as a line, would trace that zero-region in the colormesh. I could interpolate to a really fine grid and then just brute force search for the values which are closest to zero. But is there a more automatic way of doing this, or a more automatic way of plotting this "zero-line"?
And a secondary question: I am using griddata correctly, right? I have these simple 1D arrays although elsewhere people use various meshgrids, loading texts, etc., before calling griddata.
Here is a full example:
import numpy as np
import matplotlib.pyplot as plt
y, x = np.ogrid[-1.5:1.5:200j, -1.5:1.5:200j]
f = (x**2 + y**2)**4 - (x**2 - y**2)**2
plt.figure(figsize=(9,4))
plt.subplot(121)
extent = [np.min(x), np.max(x), np.min(y), np.max(y)]
cs = plt.contour(f, extent=extent, levels=[0.1],
colors=["b", "r"], linestyles=["solid", "dashed"], linewidths=[2, 2])
plt.subplot(122)
# get the points on the lines
for c in cs.collections:
data = c.get_paths()[0].vertices
plt.plot(data[:,0], data[:,1],
color=c.get_color()[0], linewidth=c.get_linewidth()[0])
plt.show()
here is the output: