I have a Mathematica code that calculates the 95% confidence intervals of a Cumulative Distribution Function (CDF) obtained from a specific Probability Distribution Function (PDF). The PDF is ugly, as it contains an Hypergeometric 2F1 function, and I need to calculate the 2-sigma errorbars of a data set of 15 values.
I want to translate this code to Python, but I get a very significant divergence on the second half of the values.
Mathematica code
results are the lower and upper 2-sigma confidence level for the values in xdata. That is, xdata should always fall between the two corresponding results values.
navs = {10, 10, 18, 30, 52, 87, 147, 245, 410, 684, 1141, 1903, 3173, 5290, 8816};
freqs = {0.00002, 0.00004, 0.0000666667, 0.000111111, 0.000185185, 0.000308642, 0.000514403, 0.000857339, 0.00142893, 0.00238166, 0.00396944, 0.00661594, 0.0165426, 0.0220568, 0.027571}
xdata = {0.578064980346793, 0.030812200935204, 0.316777979844816,
0.353718150091612, 0.287659600326548, 0.269254388840293,
0.16545714457921, 0.138759871084825, 0.0602382519940077,
0.10120771961, 0.065311134782518, 0.105235790998594,
0.124642033979457, 0.0271909963701794, 0.0686653810421847};
data = MapThread[{#1, #2, #3} &, {navs, freqs, xdata}]
post[x_, n_, y_] =
(n - 1) (1 - x)^n (1 - y)^(n - 2) Hypergeometric2F1[n, n, 1, x*y]
integral = Map[(values = #; mesh = Subdivide[0, 1, 1000];
SetPrecision[post[#, values[[1]], values[[3]]^2], 100] &,
mesh] // (Accumulate[#] - #/2 - #[[1]]/
2) & // #/#[[-1]] &,
mesh}\[Transpose], (#1[[1]] == #2[[1]] &)],
InterpolationOrder -> 1]) &, data];
results =
MapThread[{Sqrt[#1[.025]], Sqrt[#1[0.975]]} &, {integral, data}]
{{0.207919, 0.776508}, {0.0481485, 0.535278}, {0.0834002, 0.574447},
{0.137742, 0.551035}, {0.121376, 0.455097}, {0.136889, 0.403306},
{0.0674029, 0.279408}, {0.0612534, 0.228762}, {0.0158357, 0.134521},
{0.0525374, 0.156055}, {0.0270589, 0.108861}, {0.0740978, 0.137691},
{0.100498, 0.149646}, {0.00741129, 0.0525161}, {0.0507748, 0.0850961}}
Python code
Here's my translation: results are the same quantity as before, truncated to the 7th digit to increase readability.
The results values I get start diverging from the 7th pair of values on, and the last four points of xdata do not fall between the two corresponding results values.
import numpy as np
from scipy.integrate import cumtrapz
from scipy.interpolate import interp1d
from mpmath import *
mesh = list(np.linspace(0,1,1000));
navs = [10, 10, 18, 30, 52, 87, 147, 245, 410, 684, 1141, 1903, 3173, 5290, 8816]
freqs = [0.00002, 0.00004, 0.0000666667, 0.000111111, 0.000185185, 0.000308642, 0.000514403, 0.000857339, 0.00142893, 0.00238166, 0.00396944, 0.00661594, 0.0165426, 0.0220568, 0.027571]
xdata = [0.578064980346793, 0.030812200935204, 0.316777979844816,
0.353718150091612,0.287659600326548, 0.269254388840293,
0.16545714457921, 0.138759871084825, 0.0602382519940077,
0.10120771961, 0.065311134782518, 0.105235790998594,
0.124642033979457, 0.0271909963701794, 0.0686653810421847]
def post(x,n,y):
post = (n-1)*((1-x)**n)*((1-y)**(n-2))*hyp2f1(n,n,1,x*y)
return post
# setting the numeric precision to 100 as in Mathematica
# trying to get the most precise hypergeometric function values
mp.dps = 100
mp.pretty = True
results = []
for i in range(len(navs)):
postprob = [];
for j in range(len(mesh)):
posterior = post(mesh[j], navs[i], xdata[i]**2)
# calculate the norm of the pdf for integration
norm = np.trapz(np.array(postprob),mesh);
# integrate pdf/norm to obtain cdf
integrate = list(np.unique(cumtrapz(np.array(postprob)/norm, mesh, initial=0)));
mesh2 = list(np.linspace(0,1,len(integrate)));
# interpolate inverse cdf to obtain the 2sigma quantiles
icdf = interp1d(integrate, mesh2, bounds_error=False, fill_value='extrapolate');
results.append(list(np.sqrt(icdf([0.025, 0.975]))))
[[0.2079198, 0.7765088], [0.0481485, 0.5352773], [0.0834, 0.5744489],
[0.1377413, 0.5510352], [0.1218029, 0.4566994], [0.1399324, 0.4122767],
[0.0733743, 0.3041607], [0.0739691, 0.2762597], [0.0230135, 0.1954886],
[0.0871462, 0.2588804], [0.05637, 0.2268962], [0.1731199, 0.3217401],
[0.2665897, 0.3969059], [0.0315915, 0.2238736], [0.2224567, 0.3728803]]
Thanks to the comments to this question, I found out that:
The hypergeometric function gives different results in the two languages. With the same input values i get that: In Mathematica Hypergeometric2F1 gives me as a result 1.0588267, while in Python mpmath.hyp2f1 gives 1.0588866. This is the very second point of the mesh, and the difference in in the fifth decimal place.
Is there somewhere a better definition of this special function I was not able to find?
I still don't know if this is only due to the Hypergeometric function or also to the integration method, but that is definitely a starting point.
(I am fairly new to Python, maybe the code is a bit naive)
I am trying to create a program about non-linear regression. I have three parameters [R,G,B] and I want to obtain the temperature of any pixel on image with respect to my reference color code. For example:
Reference Files R,G,B,Temperature = [(157,158,19,300),(146,55,18,320),(136,57,22,340),(133,88,25,460),(141,105,27,500),(210,195,3,580),(203,186,10,580),(214,195,4,600),(193,176,10,580)]
You can see above, all RGB values change as non-linear. Now, I use "minimum error algorithm" to obtain temperature w.r.t. RGB color codes but I want to obtain a value that not exist in the reference file (i.e. If I have (155,200,40) and it is not exist in reference file, I must obtain this three codes is equal to which temperature).
Here is the code to select the closest reference temperature given a RGB value:
from math import sqrt
referenceColoursRGB =[(157,158,19),
referenceTemperatures = [
def closest_color(rgb):
r, g, b = rgb
color_diffs = []
counter = 0
for color in referenceColoursRGB:
cr, cg, cb = color
color_diff = sqrt(abs(r - cr)**2 + abs(g - cg)**2 + abs(b - cb)**2)
color_diffs.append((color_diff, color))
minErrorIndex =color_diffs.index(min(color_diffs))
return minErrorIndex
temperatureLocation = closest_color((149, 60, 25))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 320
temperatureLocation = closest_color((220, 145, 4))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 580
I really want to calculate temperature values that don't appear in the reference list, but I am having problems using all RGB values and calculating/predicting reasonable/accurate temperatures from them.
I tried to obtain 1 parameter after that used polyfit but there is some problem because every variable have same effect on this one parameter. Therefore I can't realize which color code is highest (i.e. "oneParameter = 1000 *R + 100 *G + 10 *B" , in this situation if I have a parameter that color code is (2,20,50) and another color code is (2,5,200). As a result they are equal w.r.t. "oneParameter" equation)
I hope I explain my problem clearly. I am waiting for your helps !
Thank you.
from math import sqrt
referenceColoursRGB =[(157,158,19),
referenceTemperatures = [
def closest_color(rgb):
r, g, b = rgb
color_diffs = []
counter = 0
for color in referenceColoursRGB:
cr, cg, cb = color
color_diff = sqrt(abs(r - cr)**2 + abs(g - cg)**2 + abs(b - cb)**2)
color_diffs.append((color_diff, color))
minErrorIndex =color_diffs.index(min(color_diffs))
return minErrorIndex
temperatureLocation = closest_color((149, 60, 25))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 320
temperatureLocation = closest_color((220, 145, 4))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 580
N.B.: I can't vouch for the physical accuracy of this prediction, but this might be along the lines of what you're looking for. I.e., this makes the predictions match your reference data exactly, but I have no idea how accurate the temperature predictions might be for non-reference RGB colors. If I knew the exact physics of the mapping from RGB to temperature, I'd use that.
Bad Model 1
One simple way to do nonlinear regression is to preprocess your data so that you have nonlinear terms for your regression. sklearn has a builtin preprocessing function to do this by generating powers and interactions of the original input data.
referenceColoursRGB =[(157,158,19),
referenceTemperatures = [
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_RGB = poly.fit_transform(referenceColoursRGB)
ols = linear_model.LinearRegression()
ols.fit(poly_RGB, referenceTemperatures)
# array([300., 320., 340., 460., 500., 580., 600.])
To make non-reference RGB predictions, you would do something like:
ols.predict(poly.transform([(149, 60, 25)]))
# array([369.68980598])
ols.predict(poly.transform([(220, 145, 4)]))
# array([949.34548347])
EDIT: Bad Model 2
So, before I picked something simple to implement a nonlinear fit using PolynomialFeatures without regard to any real physics that might be going on at the RGB sensor. You can decide if it fits your needs. Well, here's another model that uses RGB ratios without any regard to whatever physics is happening. Again, you can decide if this model is appropriate.
rat_RGB = [(r, g, b, r/g, r/b, g/r, g/b, b/r, b/g) for r,g,b in referenceColoursRGB]
rat_ols = linear_model.LinearRegression()
rat_ols.fit(rat_RGB, referenceTemperatures)
# array([300., 320., 340., 460., 500., 580., 600.])
You can see that this model can also be fit perfectly to the reference data. It's interesting, and probably important to note that the other example predictions produce different temperatures with this model.
rat_ols.predict([(r, g, b, r/g, r/b, g/r, g/b, b/r, b/g) for r,g,b in [(149, 60, 25)]])
# array([481.79424789])
rat_ols.predict([(r, g, b, r/g, r/b, g/r, g/b, b/r, b/g) for r,g,b in [(220, 145, 4)]])
# array([653.06116368])
I hope you can find/develop a RGB/temp model that is physics based. I am wondering if the manufacturer of your RGB sensor has some specifications and/or engineering notes that might help.
For this lab I need to sample 150 x-values from a Normal distribution using a mean of 0 and standard deviation of 10, then from the x-values construct a design matrix using the features {1,x,x^2}.
We have to sample parameters and then use the design matrix to create y values for regression data.
The problem is that my design matrix isn't square, and the Moore-Penrose Pseduoinverse needs square matrices, but I don't know how to get that to work given the earlier setup of the lab?
This is what I've done
#Linear Regression Lab
import numpy as np
import math
data = np.random.normal(0, 10, 150)
design_matrix = np.zeros((150,3))
for i in range(150):
design_matrix[i][0] = 1
design_matrix[i][1] = data[i]
design_matrix[i][2] = pow(data[i], 2)
print("-------------------Design Matrix---------------------")
#sampling paramters
theta_0 = np.random.uniform(low = -30, high = 20)
theta_1 = np.random.uniform(low = -30, high = 20)
theta_2 = np.random.uniform(low = -30, high = 20)
print(theta_0, theta_1, theta_2)
theta = np.array([theta_0, theta_1, theta_2])
theta = np.transpose(theta)
#moore penrose psuedo inverse
MPpi = np.linalg.pinv(design_matrix) ##problem here
y_values = np.linalg.inv(MPpi)
Feel free to edit this incomplete answer
After running this code on Repl, I got the following error message
Traceback (most recent call last):
File "main.py", line 32, in <module>
y_values = np.linalg.inv(MPpi)
File "<__array_function__ internals>", line 5, in inv
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 542, in inv
File "/home/runner/.local/share/virtualenvs/python3/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 213, in _assert_stacked_square
raise LinAlgError('Last 2 dimensions of the array mustbe square')
numpy.linalg.LinAlgError: Last 2 dimensions of the array must be square
The first error propagates from taking the inverse of MPpi
By looking at the docs, it seems that pinv switches the last two dimensions [e.g., an m x n matrix becomes n x m], so we will need to format the matrix before calculating the psuedoinverse
As far as the Moore Penrose inverse AKA pinv is concerned, this article suggests that multiplying MPpi*data, which will yield x_0 {notation from Ross MacAusland}, which is the best fit for your least squares regression.
How can I find anomalous values from following data. I am simulating a sinusoidal pattern. While I can plot the data and spot any anomalies or noise in data, but how can I do it without plotting the data. I am looking for simple approaches other than Machine learning methods.
import random
import numpy as np
import matplotlib.pyplot as plt
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
print("in_array : ", in_array)
out_array = np.sin(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Inject random noise
noise_input = random.uniform(-.5, .5); print("Noise : ",noise_input)
in_array[random.randint(0,len(in_array)-1)] = noise_input
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Data with noise
I've thought of the following approach to your problem, since you have only some values that are anomalous in the time vector, it means that the rest of the values have a regular progression, which means that if we gather all the data points in the vector under clusters and calculate the average step for the biggest cluster (which is essentially the pool of values that represent the real deal), then we can use that average to do a triad detection, in a given threshold, over the vector and detect which of the elements are anomalous.
For this we need two functions: calculate_average_step which will calculate that average for the biggest cluster of close values, and then we need detect_anomalous_values which will yield the indexes of the anomalous values in our vector, based on that average calculated earlier.
After we detected the anomalous values, we can go ahead and replace them with an estimated value, which we can determine from our average step value and by using the adjacent points in the vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def calculate_average_step(array, threshold=5):
Determine the average step by doing a weighted average based on clustering of averages.
array: our array
threshold: the +/- offset for grouping clusters. Aplicable on all elements in the array.
# determine all the steps
steps = []
for i in range(0, len(array) - 1):
steps.append(abs(array[i] - array[i+1]))
# determine the steps clusters
clusters = []
skip_indexes = []
cluster_index = 0
for i in range(len(steps)):
if i in skip_indexes:
# determine the cluster band (based on threshold)
cluster_lower = steps[i] - (steps[i]/100) * threshold
cluster_upper = steps[i] + (steps[i]/100) * threshold
# create the new cluster
# try to match elements from the rest of the array
for j in range(i + 1, len(steps)):
if not (cluster_lower <= steps[j] <= cluster_upper):
cluster_index += 1 # increment the cluster id
clusters = sorted(clusters, key=lambda x: len(x), reverse=True)
biggest_cluster = clusters[0] if len(clusters) > 0 else None
if biggest_cluster is None:
return None
return sum(biggest_cluster) / len(biggest_cluster) # return our most common average
def detect_anomalous_values(array, regular_step, threshold=5):
Will scan every triad (3 points) in the array to detect anomalies.
array: the array to iterate over.
regular_step: the step around which we form the upper/lower band for filtering
treshold: +/- variation between the steps of the first and median element and median and third element.
assert(len(array) >= 3) # must have at least 3 elements
anomalous_indexes = []
step_lower = regular_step - (regular_step / 100) * threshold
step_upper = regular_step + (regular_step / 100) * threshold
# detection will be forward from i (hence 3 elements must be available for the d)
for i in range(0, len(array) - 2):
a = array[i]
b = array[i+1]
c = array[i+2]
first_step = abs(a-b)
second_step = abs(b-c)
first_belonging = step_lower <= first_step <= step_upper
second_belonging = step_lower <= second_step <= step_upper
# detect that both steps are alright
if first_belonging and second_belonging:
continue # all is good here, nothing to do
# detect if the first point in the triad is bad
if not first_belonging and second_belonging:
# detect the last point in the triad is bad
if first_belonging and not second_belonging:
# detect the mid point in triad is bad (or everything is bad)
if not first_belonging and not second_belonging:
# we won't add here the others because they will be detected by
# the rest of the triad scans
return sorted(set(anomalous_indexes)) # return unique indexes
if __name__ == "__main__":
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
noisy_out_array = np.sin(in_array)
# display noisy sin
plt.plot(in_array, noisy_out_array, color = 'red', marker = "o");
plt.title("noisy numpy.sin()")
# detect anomalous values
average_step = calculate_average_step(in_array)
anomalous_indexes = detect_anomalous_values(in_array, average_step)
# replace anomalous points with an estimated value based on our calculated average
for anomalous in anomalous_indexes:
# try forward extrapolation
in_array[anomalous] = in_array[anomalous-1] + average_step
# else try backwward extrapolation
except IndexError:
in_array[anomalous] = in_array[anomalous+1] - average_step
# generate sine wave
out_array = np.sin(in_array)
plt.plot(in_array, out_array, color = 'green', marker = "o");
plt.title("cleaned numpy.sin()")
Noisy sine:
Cleaned sine:
Your problem relies in the time vector (which is of 1 dimension). You will need to apply some sort of filter on that vector.
First thing that came to mind was medfilt (median filter) from scipy and it looks something like this:
from scipy.signal import medfilt
l1 = [0, 10, 20, 30, 2, 50, 70, 15, 90, 100]
l2 = medfilt(l1)
the output of this will be:
[ 0. 10. 20. 20. 30. 50. 50. 70. 90. 90.]
the problem with this filter though is that if we apply some noise values to the edges of the vector like [200, 0, 10, 20, 30, 2, 50, 70, 15, 90, 100, -50] then the output would be something like [ 0. 10. 10. 20. 20. 30. 50. 50. 70. 90. 90. 0.] and obviously this is not ok for the sine plot since it will produce the same artifacts for the sine values array.
A better approach to this problem is to treat the time vector as an y output and it's index values as the x input and do a linear regression on the "time linear function", not the quotes, it just means we're faking the 2 dimensional model by applying a fake X vector. The code implies the use of scipy's linregress (linear regression) function:
from scipy.stats import linregress
l1 = [5, 0, 10, 20, 30, -20, 50, 70, 15, 90, 100]
l1_x = range(0, len(l1))
slope, intercept, r_val, p_val, std_err = linregress(l1_x, l1)
l1 = intercept + slope * l1_x
whose output will be:
[-10.45454545 -1.63636364 7.18181818 16. 24.81818182
33.63636364 42.45454545 51.27272727 60.09090909 68.90909091
Now let's apply this to your time vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
N = 20
# N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
# apply filter on time array
in_array_x = range(0, len(in_array))
slope, intercept, r_val, p_val, std_err = linregress(in_array_x, in_array)
in_array = intercept + slope * in_array_x
# generate sine wave
out_array = np.sin(in_array)
print("OUT ARRAY")
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
the output will be:
the resulting signal will be an approximation of the original, as it is with any form of extrapolation/interpolation/regression filtering.
I have obtained the coefficients for the Legendre polynomial that best fits my data. Now I am needing to determine the value of that polynomial at each time-step of my data. I need to do this so that I can subtract the fit from my data. I have looked at the documentation for the Legendre module, and I'm not sure if I just don't understand my options or if there isn't a native tool in place for what I want. If my data-points were evenly spaced, linspace would be a good option, but that's not the case here. Does anyone have a suggestion for what to try?
For those who would like to demand a minimum working example of code, just use a random array, get the coefficients, and tell me from there how you would proceed. The values themselves don't matter. It's the technique that I'm asking about here. Thanks.
To simplify Ahmed's example
In [1]: from numpy.polynomial import Polynomial, Legendre
In [2]: p = Polynomial([0.5, 0.3, 0.1])
In [3]: x = np.random.rand(10) * 10
In [4]: y = p(x)
In [5]: pfit = Legendre.fit(x, y, 2)
In [6]: plot(*pfit.linspace())
Out[6]: [<matplotlib.lines.Line2D at 0x7f815364f310>]
In [7]: plot(x, y, 'o')
Out[7]: [<matplotlib.lines.Line2D at 0x7f81535d8bd0>]
The Legendre functions are scaled and offset, as the data should be confined to the interval [-1, 1] to get any advantage over the usual power basis. If you want the coefficients for plain old Legendre functions
In [8]: pfit.convert()
Out[8]: Legendre([ 0.53333333, 0.3 , 0.06666667], [-1., 1.], [-1., 1.])
But that isn't recommended.
Once you have a function, you can just generate a numpy array for the timepoints:
>>> import numpy as np
>>> timepoints = [1,3,7,15,16,17,19]
>>> myarray = np.array(timepoints)
>>> def mypolynomial(bins, pfinal): #pfinal is just the estimate of the final array (i'll do quadratic)
... a,b,c = pfinal # obviously, for a*x^2 + b*x + c
... return (a*bins**2) + b*bins + c
>>> mypolynomial(myarray, (1,1,0))
array([ 2, 12, 56, 240, 272, 306, 380])
It automatically evaluates it for each timepoint is in the numpy array.
Now all you have to do is rewrite mypolynomial to go from a simple quadratic example to a proper one for a Legendre polynomial. Treat the function as if it were evaluating a float to return the value, and when called on the numpy array it will automatically evaluate it for each value.
Let's say I wanted to generalize this to all standard polynomials:
>>> import numpy as np
>>> timepoints = [1,3,7,15,16,17,19]
>>> myarray = np.array(timepoints)
>>> def mypolynomial(bins, pfinal): #pfinal is just the estimate of the final array (i'll do quadratic)
>>> hist = np.zeros((1, len(myarray))) # define blank return
... for i in range(len(pfinal)):
... # fixed a typo here, was pfinal[-i] which would give -0 rather than -1, since negative indexing starts at -1, not -0
... const = pfinal[-i-1] # negative index to go from 0 exponent to highest exponent
... hist += const*(bins**i)
... return hist
>>> mypolynomial(myarray, (1,1,0))
array([ 2, 12, 56, 240, 272, 306, 380])
EDIT2: Typo fix
#Ahmed is perfect right when he states Homer's rule is good for numerical stability. The implementation here would be as follows:
>>> def horner(coeffs, x):
... acc = 0
... for c in coeffs:
... acc = acc * x + c
... return acc
>>> horner((1,1,0), myarray)
array([ 2, 12, 56, 240, 272, 306, 380])
Slightly modified to keep the same argument order as before, from the code here:
When you're using a nice library to fit polynomials, the library will in my experience usually have a function to evaluate them. So I think it is useful to know how you're generating these coefficients.
In the example below, I used two functions in numpy, legfit and legval which made it trivial to both fit and evaluate the Legendre polynomials without any need to invoke Horner's rule or do the bookkeeping yourself. (Though I do use Horner's rule to generate some example data.)
Here's a complete example where I generate some sparse data from a known polynomial, fit a Legendre polynomial to it, evaluate that polynomial on a dense grid, and plot. Note that the fitting and evaluating part takes three lines thanks to the numpy library doing all the heavy lifting.
It produces the following figure:
import numpy as np
### Setup code
def horner(coeffs, x):
"""Evaluate a polynomial at a point or array"""
acc = 0.0
for c in reversed(coeffs):
acc = acc * x + c
return acc
x = np.random.rand(10) * 10
true_coefs = [0.1, 0.3, 0.5]
y = horner(true_coefs, x)
### Fit and evaluate
legendre_coefs = np.polynomial.legendre.legfit(x, y, 2)
new_x = np.linspace(0, 10)
new_y = np.polynomial.legendre.legval(new_x, legendre_coefs)
### Plotting only
import pylab
pylab.ion() # turn on interactive plotting
pylab.plot(x, y, 'o', new_x, new_y, '-')
pylab.title('Fitting Legendre polynomials and evaluating them')
pylab.legend(['original sparse data', 'fit'])
print("Can't start plots.")
The griding the data (d) in irregular grid (x and y) using Scipy's griddata is timecomsuing when the datasets are many. But, the longitudes and latitudes (x and y) are always same, only the data (d) are changing. In this case, once using the giddata, how to repeat the procedure with different d arrys to achieve faster result?
import numpy as np, matplotlib.pyplot as plt
from scipy.interpolate import griddata
x = np.array([110, 112, 114, 115, 119, 120, 122, 124]).astype(float)
y = np.array([60, 61, 63, 67, 68, 70, 75, 81]).astype(float)
d = np.array([4, 6, 5, 3, 2, 1, 7, 9]).astype(float)
ulx, lrx = np.min(x), np.max(x)
uly, lry = np.max(y), np.min(y)
xi = np.linspace(ulx, lrx, 15)
yi = np.linspace(uly, lry, 15)
grided_data = griddata((x, y), d, (xi.reshape(1,-1), yi.reshape(-1,1)), method='nearest',fill_value=0)
The above code works for one array of d.
But I have hundreds of other arrays.
griddata with nearest ends up using NearestNDInterpolator. That's a class that creates an iterator, which is called with the xi:
elif method == 'nearest':
ip = NearestNDInterpolator(points, values, rescale=rescale)
return ip(xi)
So you could create your own NearestNDInterpolator and call it with multiple times with different xi.
But I think in your case you want to change the values. Looking at the code for that class I see
self.tree = cKDTree(self.points)
self.values = y
the __call__ does:
dist, i = self.tree.query(xi)
return self.values[i]
I don't know the relative cost of creating the tree versus query.
So it should be easy to change values between uses of __call__. And it looks like values could have multiple columns, since it's just indexing on the 1st dimension.
This interpolator is simple enough that you could write your own using the same tree idea.
Here's a Nearest Interpolator that lets you repeat the interpolation for the same points, but different z values. I haven't done timings yet to see how much time it saves
class MyNearest(interpolate.NearestNDInterpolator):
# normal interpolation, but returns the near neighbor indices as well
def __call__(self, *args):
xi = interpolate.interpnd._ndim_coords_from_arrays(args, ndim=self.points.shape[1])
xi = self._check_call_shape(xi)
xi = self._scale_x(xi)
dist, i = self.tree.query(xi)
return i, self.values[i]
def my_griddata(points, values, method='linear', fill_value=np.nan,
points = interpolate.interpnd._ndim_coords_from_arrays(points)
if points.ndim < 2:
ndim = points.ndim
ndim = points.shape[-1]
# simplified call for 2d 'nearest'
ip = MyNearest(points, values, rescale=rescale)
return ip # ip(xi) # return iterator, not values
ip = my_griddata((xreg, yreg), z, method='nearest',fill_value=0)
xi = (xi.reshape(1,-1), yi.reshape(-1,1))
I, data = ip(xi)
z1 = xreg+yreg # new z data
data = z1[I] # should show diagonal color bars
So as long as z has the same shape as before (and as xreg), z[I] will return the nearest value for each xi.
And it can interpolated 2d data as well (e.g. (225,n) shaped)
z1 = np.array([xreg+yreg, xreg-yreg]).T
print(z1.shape) # (225,2)
data = z1[I]
print(data.shape) # (20,20,2)