numpy polyfit yields nonsense

numpy polyfit yields nonsense - python

I am trying to fit these values:
This is my code:
for i in range(-area,area):
stDev1= []
for j in range(-area,area):
stDev0 = stDev[i+i0][j+j0]
stDev1.append(stDev0)
slices[i] = stDev1
fitV = []
xV = []
for l in range(-area,area):
y = np.asarray(slices[l])
x = np.arange(0,2*area,1)
for m in range(-area,area):
fitV.append(slices[m][l])
xV.append(l)
fit = np.polyfit(xV,fitV,4)
yfit = function(fit,area)
x100 = np.arange(0,100,1)
plt.plot(xV,fitV,'.')
plt.savefig("fits1.png")
def function(fit,area):
yfit = []
for x in range(-area,area):
yfit.append(fit[0]+fit[1]*x+fit[2]*x**2+fit[3]*x**3+fit[4]*x**4)
return(yfit)
i0 = 400
j0 = 400
area = 50
stdev = 2d np.array([1300][800]) #just an image of "noise" feel free to add any image // 2d np array you like.
This yields:
obviously this is completly wrong?
I assume I miss understand the concept of polyfit? From the doc the requirement is that I feed it with with two arrays of shape x[i] y[i]? My values in
xV = [ x_1_-50,x_1_-49,...,x_1_49,x_2_-50,...,x_49_49]
and my ys are:
fitV = [y_1_-50,y_1_-49,...,y_1_49,...y_2_-50,...,y_2_49]

I do not completely understand your program. In the future, it would be helpful if you were to distill your issue to a MCVE. But here are some thoughts:
It seems, in your data, that for a given value of x there are multiple values of y. Given (x,y) data, polyfit returns a tuple that represents a polynomial function, but no function can map a single value of x onto multiple values of y. As a first step, consider collapsing each set of y values into a single representative value using, for example, the mean, median, or mode. Or perhaps, in your domain, there's a more natural way to do this.
Second, there is an idiomatic way to use the pair of functions np.polyfit and np.polyval, and you're not using them in the standard way. Of course, numerous useful departures from this pattern exist, but first make sure you understand the basic pattern of these two functions.
a. Given your measurements y_data, taken at times or locations x_data, plot them and make a guess as to the order of the fit. That is, does it look like a line? Like a parabola? Let's assume you believe your data to be parabolic, and that you'll use a second order polynomial fit.
b. Make sure that your arrays are sorted in order of increasing x. There are many ways to do this, but np.argsort is a easy one.
c. Run polyfit: p = polyfit(x_data,y_data,2), which returns a tuple containing the 2nd, 1st, and 0th order coefficients in p, (c2,c1,c0).
d. In the idiomatic use of polyfit and polyval, next you would generate your fit: polyval(p,x_data). Or perhaps you want the fit to be sampled more coarsely or finely, in which case you might take a subset of x_data or interpolate more values in x_data.
A complete example is below.
import numpy as np
from matplotlib import pyplot as plt
# these are your measurements, unsorted
x_data = np.array([18, 6, 9, 12 , 3, 0, 15])
y_data = np.array([583.26347805, 63.16059915, 100.94286909, 183.72581827, 62.24497418,
134.99558191, 368.78421529])
# first, sort both vectors in increasing-x order:
sorted_indices = np.argsort(x_data)
x_data = x_data[sorted_indices]
y_data = y_data[sorted_indices]
# now, plot and observe the parabolic shape:
plt.plot(x_data,y_data,'ks')
plt.show()
# generate the 2nd order fitting polynomial:
p = np.polyfit(x_data,y_data,2)
# make a more finely sampled x_fit vector with, for example
# 1024 equally spaced points between the first and last
# values of x_data
x_fit = np.linspace(x_data[0],x_data[-1],1024)
# now, compute the fit using your polynomial:
y_fit = np.polyval(p,x_fit)
# and plot them together:
plt.plot(x_data,y_data,'ks')
plt.plot(x_fit,y_fit,'b--')
plt.show()
Hope that helps.

Related

How do I calculate y for every value of x if the slope a set to 10 and the intercept b to 0?

I have a small set of data. I used python3 to read it and created a scatter plot. My next task is to set the slope a to 10 and the intercept b to 0 and calculate y for every value of x. The task says I should not use any existing linear regression functions. I have been stuck for some time on this. How can I do that?

If your slope is already set to 10, I don't see why you need to use Linear Regression. I hope I'm not missing anything from your task.
However, keeping that aside if you need to get a list in python with all elements multiplied by your slope a then you can use a list comprehension to find this new list in the following way:
y_computed = [item*a for item in x]

You can literally just draw a line with a constant slope (10) on the same plot, then calculate the the predicted y-value based on that line "estimate" (you can also find the error of the estimate if you want). That be done using the following:
import numpy as np
from matplotlib import pyplot as plt
def const_line(x):
y = 10 * x + 0 # Just to illustrate that the intercept is zero
return y
x = np.linspace(0, 1)
y = const_line(x)
plt.plot(x, y, c='m')
plt.show()
# Find the y-values for each sample point in your data:
for x in data:
const_line(x)

How to fit multiple curves to a single scatter plot of data?

I have data from distinct curves, and want to fit each of them individually. However, the data is mixed into a single array, so first I believe I need a way to separate the data.
I know that each of the individual curves is under the family A/x+B. As of now I cut out each of the curves by hand and curve fit, but would like to automate this process, have the computer separate these curves a fit them. I attempted to use machine learning, but didn't know where to start, what packages to use. I am using python, but can also use C++, in fact I hope to transfer it to C++ by the end. Where do you think I should start, is it worth it to use unsupervised machine learning, or is there a better way to separate the data?
The expected curves:
An example of the data

Well, you sure do have an interesting problem.
I see that there are curves with Y-axis values that are considerably larger than the rest of them. I would simply take the first N-values with the largest Y-axis values and then fit them to an exponential decay curve (or that other curve you mention). You can then simply take the points that most fit that curve and then leave the other points alone.
Except...
This is a terrible way to extrapolate data. Doing this, you are cherry-picking the data you want. This is falsifying information and is very bad.
Your best bet is to create a single curve that all points fit too if you cannot isolate all of those points into separate curves with external information.
But...
We do know some information: a valid function must have only 1 output given a single input.
If the X-Axis is discreet, this means you can create a lookup table of Outputs given the input. This allows you to count how many curves there are associated with the specific X-value (which could be a time unit). In other words, you have to have external information to separate points locally. You can then reorder the points in increasing Y-value, and now you have your separate curves defined in discrete points.
Basically, this is an unsolvable problem in the general sense, but in your specific application, there might be extra rules that further define the domain and range such that you can do data filtering.
One more thing...
I am making these statements with the assumption that the (X,Y) values are floats that cannot maintain accuracy after some mathematical operations.
If you are using things like unum numbers, you might be able to keep enough information in the decimal such that your fitting functions can differentiate between points without extra filtering.
This case is more of a hope than anything, as adopting a new number representation to get more accuracy to isolate sampled points is a stretch at best.
Just for completeness, there are some mathematical libraries that might help you.
Boost.uBLAS
Eigen
LAPACK++
Hopefully, I have given you enough information to allow you to solve your problem.

I extracted data from the plot for analysis. Here is example code that loads, separates, fits and plots the three data sets. It works when the separate data files are appended into a single text file.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
##########################################################
# data load and separation section
datafilename = 'temp.dat'
textdata = open(datafilename, 'rt').read()
xLists = [[], [], []]
yLists = [[], [], []]
previousY = 0.0 # initialize
whichList = -1 # initialize
datalines = textdata.split('\n')
for line in datalines:
if not line: # allow for blank lines in data file
continue
spl = line.split()
x = float(spl[0])
y = float(spl[1])
if y > previousY + 50.0: # this separator must be greater than max noise
whichList += 1
previousY = y
xLists[whichList].append(x)
yLists[whichList].append(y)
##########################################################
# curve fitting section
def func(x, a, b):
return a / x + b
parameterLists = []
for curveIndex in range(len(xLists)):
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0])
xData = numpy.array(xLists[curveIndex], dtype=float)
yData = numpy.array(yLists[curveIndex], dtype=float)
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
parameterLists.append(fittedParameters)
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
for curveIndex in range(len(xLists)):
# first the raw data as a scatter plot
axes.plot(xLists[curveIndex], yLists[curveIndex], 'D')
# create data for each fitted equation plot
xModel = numpy.linspace(min(xLists[curveIndex]), max(xLists[curveIndex]))
yModel = func(xModel, *parameterLists[curveIndex])
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

The idea:
create N naive, easy to calculate, sufficiently precise(for clustering), approximations. Then "classify" each data-point to the closest such approximation.
This is done like this:
The approximations are analytical approximations using these two equations I derived:
where (x1,y1) and (x2,y2) are coordinates of two points on the curve.
To get these two points I assumed that (1) the first points(according to the x-axis) are distributed equally between the different real curves. And (2) the 2 first points of each real curve, are smaller or bigger than the 2 first points of each other real curve. Thus sorting them and dividing into N groups will successfully cluster the first *2*N* points. If these assumptions are false you can still manually classify the 2 first points of each real curve and the rest will be classified automatically (this is actually the first approach I implemented).
Then cluster rest of the points to each point's closest approximation. Closest meaning with the smallest error.
Edit: A stronger approach for the initial approximation could be by calculating A and B for a couple of pairs of points and using their mean A and B as the approximation. And maybe even possibly doing K-means on these points/approximations.
The Code:
import numpy as np
import matplotlib.pyplot as plt
# You should probably edit this variable
NUM_OF_CURVES = 4
# <data> should be a 1-D array containing the Y values of the series
# <x_of_data> should be a 1-D array containing the corresponding X values of the series
data, x_of_data = np.loadtxt('...')
# clustering of first 2*num_of_curves points
# I started at NUM_OF_CURVES instead of 0 because my xs started at 0.
# The range (0:NUM_OF_CURVES*2) will probably be better for you.
raw_data = data[NUM_OF_CURVES:NUM_OF_CURVES*3]
raw_xs = x_of_data[NUM_OF_CURVES:NUM_OF_CURVES*3]
sort_ind = np.argsort(raw_data)
Y = raw_data[sort_ind].reshape(NUM_OF_CURVES,-1).T
X = raw_xs[sort_ind].reshape(NUM_OF_CURVES,-1).T
# approximation of A and B for each curve
A = ((Y[0]*Y[1])*(X[0]-X[1]))/(Y[1]-Y[0])
B = (A / Y[0]) - X[0]
# creating approximating curves
f = []
for i in range(NUM_OF_CURVES):
f.append(A[i]/(x_of_data+B[i]))
curves = np.vstack(f)
# clustering the points to the approximating curves
raw_clusters = [[] for _ in range(NUM_OF_CURVES)]
for i in range(len(data)):
raw_clusters[np.abs(curves[:,i]-data[i]).argmin()].append((x_of_data[i],data[i]))
# changing the clusters to np.arrays of the shape (2,-1)
# where row 0 contains the X coordinates and row 1 the Y coordinates
clusters = []
for i in range(len(raw_clusters)):
clusters.append(np.array(list(zip(*raw_clusters[i]))))
Example:
raw series:
separated series:

Comparing model and data using scipy.optimize

I'm trying to compare a set of discrete data values with a model to estimate the "x" value where there is a good match between the discrete data points and the model. In other words, I'm trying to estimate the x value (or the range of x) where the differences between data (discrete points) and the model are minimum. I've a model that provides Ya(x), Yb(x), Yc(x) (continuous lines). I also have the data points A, B and C (filled circles). I would like to estimate the x value where the data points A, B and C (or most of the points) match well with the corresponding continuous line. I also plot the (model-data)^2 as a function of x. It appears from the second plot that a good match can be obtained for the x range 5.e3 to 1.e4. I was wondering if I may use any scipy.optimize subroutine to estimate it quantitatively.
Thanks for your time and any help would be greatly appreciated.

I think this might be some pseudo-code to get you started. See if this actually matches what you want.
import scipy.optimize
import numpy as np
def yA(x):
# whatever calculations here you do for curve A
return(1.0) # return whatever yA is at x
def yB(x):
# whatever calculations here you do for curve B
return(1.0) # return whatever yB is at x
def yC(x):
# whatever calculations here you do for curve C
return(1.0) # return whatever yC is at x
def func(x,data):
A,B,C = data # unpack tuple
devA = np.abs((yA(x)-A)/yA(x)) # normalize the deviations
devB = np.abs((yB(x)-B)/yB(x)) # to account for the order
devC = np.abs((yC(x)-C)/yC(x)) # of magnitude variations
return(devA+devB+devC) # you want to minimize the sum of the deviations
A = 1.0E-10 # these are your data points (rough guess from plot)
B = 1.0E-11
C = 1.0E-8
x0 = 1000.0 # an initial guess
result = scipy.optimize.minimize(func,x0,args=(A,B,C))
print(result.x)

Use Histogram data to generate random samples in scipy

Suppose I have a process where I push a button, and after a certain amount of time (from 1 to 30 minutes), an event occurs. I then run a very large number of trials, and record how long it takes the event to occur for each trial. This raw data is then reduced to a set of 30 data points where the x value is the number of minutes it took for the event to occur, and the y value is the percentage of trials which fell into that bucket. I do not have access to the original data.
How can I use this set of 30 points to identify an appropriate probability distribution which I can then use to generate representative random samples?
I feel like scipy.stats has all the tools I need built in, but for the life of me I can't figure out how to go about it. Any tips?

If you don't have any prior information about the underlying function of the data which have been produced, I suggest you to use numpy.polyfit which fits a polynomial of given degree.
import matplotlib.pyplot as plt
import numpy as np
y = np.array([ 0.005995184, ...]) # your array
x = np.arange(len(y))
f = np.poly1d(np.polyfit(x, y, 10))
x_new = np.linspace(x[0], x[-1], 30)
y_new = f(x_new)
plt.plot(x,y,'o', x_new, y_new)
plt.xlim([x[0]-1, x[-1] + 1 ])
plt.show()
Here is an example for degree = 10.
In order to get an unknown value from the produced polynomial distribution, you simply:
f(13.5)
which in this case gives:
0.0206996531272

You can also use the histogram, piecewise uniform distribution directly, then you get exactly the corresponding random numbers instead of an approximation.
The inverse cdf, ppf, is piecewise linear and linear interpolation can be used to transform uniform random numbers appropriately.

I was able to come up with a solution, but it doesn't feel like a very elegant one. Basically, take the percentage value (y value) for each x value, multiply by some large number (say, 10,000), then add that many values of x to an array. Continue through all values of x, ending up with a single giant array. This array can then be fed into .fit() methods of the scipy.stats.rv_discrete subclasses. I'll leave the question open for now as I feel like there must be a better way.
import matplotlib.pyplot as plt
import scipy
import scipy.stats
import numpy as np
xRange = 30
x = scipy.arange(0,xRange+1)
data = [
0.005995184,0.012209876,0.028232119,0.04711878,0.087894128,
0.116652421,0.115370764,0.12774159,0.109731418,0.079767439,
0.068016186,0.045287033,0.033403796,0.029145134,0.018925806,
0.013340493,0.010087069,0.007998098,0.00984276,0.004906083,
0.004720561,0.003186032,0.003028522,0.002942859,0.002780096,
0.002450613,0.002733441,0.002217294,0.002072314,0.002063246]
y=[]
for i in range(len(data)):
for j in range(int(data[i]*10000)):
y=np.append(y,i+1)
# creating the histogram
plt.figure(num=1,figsize=(22,12))
h = plt.hist(y, bins=x, normed=True)
dist_names = ['burr','f','rayleigh']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1])
plt.plot(pdf_fitted, label=dist_name, lw=4)
plt.xlim(0,xRange)
plt.legend(loc='upper right')
plt.show()

Fast 3D interpolation of atmospheric data in Numpy/Scipy

I am trying to interpolate 3D atmospheric data from one vertical coordinate to another using Numpy/Scipy. For example, I have cubes of temperature and relative humidity, both of which are on constant, regular pressure surfaces. I want to interpolate the relative humidity to constant temperature surface(s).
The exact problem I am trying to solve has been asked previously here, however, the solution there is very slow. In my case, I have approximately 3M points in my cube (30x321x321), and that method takes around 4 minutes to operate on one set of data.
That post is nearly 5 years old. Do newer versions of Numpy/Scipy perhaps have methods that handle this faster? Maybe new sets of eyes looking at the problem have a better approach? I'm open to suggestions.
EDIT:
Slow = 4 minutes for one set of data cubes. I'm not sure how else I can quantify it.
The code being used...
def interpLevel(grid,value,data,interp='linear'):
"""
Interpolate 3d data to a common z coordinate.
Can be used to calculate the wind/pv/whatsoever values for a common
potential temperature / pressure level.
grid : numpy.ndarray
The grid. For example the potential temperature values for the whole 3d
grid.
value : float
The common value in the grid, to which the data shall be interpolated.
For example, 350.0
data : numpy.ndarray
The data which shall be interpolated. For example, the PV values for
the whole 3d grid.
kind : str
This indicates which kind of interpolation will be done. It is directly
passed on to scipy.interpolate.interp1d().
returns : numpy.ndarray
A 2d array containing the *data* values at *value*.
"""
ret = np.zeros_like(data[0,:,:])
for yIdx in xrange(grid.shape[1]):
for xIdx in xrange(grid.shape[2]):
# check if we need to flip the column
if grid[0,yIdx,xIdx] > grid[-1,yIdx,xIdx]:
ind = -1
else:
ind = 1
f = interpolate.interp1d(grid[::ind,yIdx,xIdx], \
data[::ind,yIdx,xIdx], \
kind=interp)
ret[yIdx,xIdx] = f(value)
return ret
EDIT 2:
I could share npy dumps of sample data, if anyone was interested enough to see what I am working with.

Since this is atmospheric data, I imagine that your grid does not have uniform spacing; however if your grid is rectilinear (such that each vertical column has the same set of z-coordinates) then you have some options.
For instance, if you only need linear interpolation (say for a simple visualization), you can just do something like:
# Find nearest grid point
idx = grid[:,0,0].searchsorted(value)
upper = grid[idx,0,0]
lower = grid[idx - 1, 0, 0]
s = (value - lower) / (upper - lower)
result = (1-s) * data[idx - 1, :, :] + s * data[idx, :, :]
(You'll need to add checks for value being out of range, of course).For a grid your size, this will be extremely fast (as in tiny fractions of a second)
You can pretty easily modify the above to perform cubic interpolation if need be; the challenge is in picking the correct weights for non-uniform vertical spacing.
The problem with using scipy.ndimage.map_coordinates is that, although it provides higher order interpolation and can handle arbitrary sample points, it does assume that the input data be uniformly spaced. It will still produce smooth results, but it won't be a reliable approximation.
If your coordinate grid is not rectilinear, so that the z-value for a given index changes for different x and y indices, then the approach you are using now is probably the best you can get without a fair bit of analysis of your particular problem.
UPDATE:
One neat trick (again, assuming that each column has the same, not necessarily regular, coordinates) is to use interp1d to extract the weights doing something like follows:
NZ = grid.shape[0]
zs = grid[:,0,0]
ident = np.identity(NZ)
weight_func = interp1d(zs, ident, 'cubic')
You only need to do the above once per grid; you can even reuse weight_func as long as the vertical coordinates don't change.
When it comes time to interpolate then, weight_func(value) will give you the weights, which you can use to compute a single interpolated value at (x_idx, y_idx) with:
weights = weight_func(value)
interp_val = np.dot(data[:, x_idx, y_idx), weights)
If you want to compute a whole plane of interpolated values, you can use np.inner, although since your z-coordinate comes first, you'll need to do:
result = np.inner(data.T, weights).T
Again, the computation should be practically immediate.

This is quite an old question but the best way to do this nowadays is to use MetPy's interpolate_1d funtion:
https://unidata.github.io/MetPy/latest/api/generated/metpy.interpolate.interpolate_1d.html

There is a new implementation of Numba accelerated interpolation on regular grids in 1, 2, and 3 dimensions:
https://github.com/dbstein/fast_interp
Usage is as follows:
from fast_interp import interp2d
import numpy as np
nx = 50
ny = 37
xv, xh = np.linspace(0, 1, nx, endpoint=True, retstep=True)
yv, yh = np.linspace(0, 2*np.pi, ny, endpoint=False, retstep=True)
x, y = np.meshgrid(xv, yv, indexing='ij')
test_function = lambda x, y: np.exp(x)*np.exp(np.sin(y))
f = test_function(x, y)
test_x = -xh/2.0
test_y = 271.43
fa = test_function(test_x, test_y)
interpolater = interp2d([0,0], [1,2*np.pi], [xh,yh], f, k=5, p=[False,True], e=[1,0])
fe = interpolater(test_x, test_y)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.