Error when fitting elements found in data to its cumulative distribution - python

I have a large simulated data set in which I have passed through values and what not for an analysis. My main objective is to take actual, real record values and compare it the simulated data via cumulative distribution.
I start out by defining the method of going through each bin of the data set by taking values that have a certain value x and match it to the "real" data analyzed with the same value x
bins = np.linspace(SimData.min(),SimData.max(), 24)
def CumuProb(SimData, bins, x, realValue):
h, bins_ = np.histogram(be, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())
cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
p = [x, realValue]
yi = np.interp(p[1],cbins, hcumc)
return [p[1],yi]
This method works for large values fine. But, if I were to pass this through values <<1 but >0, this miserably fails.
For example, performing, on my project using this method gives:
Where you can see at the very bottom, their is 2 points, when their should be about 10 points all on the blue line (the actual data).
The main culprit is found from this traceback:
RuntimeWarning: invalid value encountered in divide hcum = np.cumsum(h)/float(np.cumsum(h).max())
So this has to do how I am most likely defining my bin size, which is defined at bin=np.linspace(np.log(binding).min(),np.log(binding).max(),24), which is going through the logarithmic x-axis values in the plot above for binning.
How do I fix this?

I can't be 100% sure, since the question lacks a lot of relevant information needed, but judging from how I intended to use this function, it seems odd to put realValue into the interpolation. If, what the name suggests, x is the x axis value of the data point to be investigated, the interpolation should take x in:
yi = np.interp(x,cbins, hcumc)
return [x,yi]

Related

lmfit Stepped functions and Step size

I want to fit a 2D shape in an image. In the past, I have successfully done this using lmfit in Python and wrapping the 2D function/data to 1D. On that occasion, the 2D model was a smooth function (a ring with a gaussian profile). Now I am trying to do the same but with a "non-smooth function" and it is not working as expected.
This is what I am trying to do (guessed and fitted are the same):
I have shifted the guessed parameters in purpose to easily see if it moves as expected, and nothing happens.
I have noticed that if instead of a swiss flag I use a 2D gaussian, which is a smooth function, this works fine (see MWE below):
So I guess the problem is related to the fact that the Swiss flag function is not smooth. I have tried to make it smooth by adding a gaussian filter (blur) but it still did not work, even though the swiss flag plot became very blurred.
After some time I came to the thought that maybe the step size that is using lmfit (o whoever is in the background) is too small to produce any change in the swiss flag. I would like to try to increase the step size to 1, but I don't know exactly how to do that.
This is my MWE (sorry, it is still quite long):
import numpy as np
import myplotlib as mpl # https://github.com/SengerM/myplotlib
import lmfit
def draw_swiss_flag(fig, center, side, **kwargs):
fig.plot(
np.array(2*[side] + 2*[side/2] + 2*[-side/2] + 2*[-side] + 2*[-side/2] + 2*[side/2] + 2*[side]) + center[0],
np.array([0] + 2*[side/2] + 2*[side] + 2*[side/2] + 2*[-side/2] + 2*[-side] + 2*[-side/2] + [0]) + center[1],
**kwargs,
)
def swiss_flag(x, y, center: tuple, side: float):
# x, y numpy arrays.
if x.shape != y.shape:
raise ValueError(f'<x> and <y> must have the same shape!')
flag = np.zeros(x.shape)
flag[(center[0]-side/2<x)&(x<center[0]+side/2)&(center[1]-side<y)&(y<center[1]+side)] = 1
flag[(center[1]-side/2<y)&(y<center[1]+side/2)&(center[0]-side<x)&(x<center[0]+side)] = 1
return flag
def gaussian_2d(x, y, center, side):
return np.exp(-(x-center[0])**2/side**2-(y-center[1])**2/side**2)
def wrapper_for_lmfit(x, x_pixels, y_pixels, function_2D_to_wrap, *params):
pixel_number = x # This is the pixel number in the data array
# x_pixels and y_pixels are the number of pixels that the image has. This is needed to make the mapping.
if (pixel_number > x_pixels*y_pixels - 1).any():
raise ValueError('pixel_number (x) > x_pixels*y_pixels - 1')
x = np.array([int(p%x_pixels) for p in pixel_number])
y = np.array([int(p/x_pixels) for p in pixel_number])
return function_2D_to_wrap(x, y, *params)
data = np.genfromtxt('data.txt') # Read data
data -= data.min().min()
data = data/data.max().max()
guessed_center = (data.sum(axis=0).argmax()+11, data.sum(axis=1).argmax()+11) # I am adding 11 in purpose.
guessed_side = 19
model = lmfit.Model(lambda x, xc, yc, side: wrapper_for_lmfit(x, data.shape[1], data.shape[0], swiss_flag, (xc,yc), side))
params = model.make_params()
params['xc'].set(value = guessed_center[0], min = 0, max = data.shape[1])
params['yc'].set(value = guessed_center[1], min = 0, max = data.shape[0])
params['side'].set(value = guessed_side, min = 0)
fit_results = model.fit(data.ravel(), params, x = [i for i in range(len(data.ravel()))])
mpl.manager.set_plotting_package('matplotlib')
fit_plot = mpl.manager.new(
title = 'Data vs fit',
aspect = 'equal',
)
fit_plot.colormap(data)
draw_swiss_flag(fit_plot, guessed_center, guessed_side, label = 'Guessed')
draw_swiss_flag(fit_plot, (fit_results.params['xc'],fit_results.params['yc']), fit_results.params['side'], label = 'Fitted')
swiss_flag_plot = mpl.manager.new(
title = 'Swiss flag plot',
aspect = 'equal',
)
xx, yy = np.meshgrid(np.array([i for i in range(data.shape[1])]), np.array([i for i in range(data.shape[0])]))
swiss_flag_plot.colormap(
z = swiss_flag(xx, yy, center = (fit_results.params['xc'],fit_results.params['yc']), side = fit_results.params['side']),
)
mpl.manager.show()
and this is the content of data.txt.
It seems your code is all fine. The issue is, as you already guessed, that the algorithm used by lmfit is not dealing well with non-smooth data.
By default lmfit uses a leas squares method. Let's change it to method 'differential_evolution' instead.
params['side'].set(value=guessed_side, min=0, max=len(data))
fit_results = model.fit(data.ravel(), params,
x=[i for i in range(len(data.ravel()))],
method='differential_evolution'
)
Note that I needed to add some finite value for the max value to prevent a "differential_evolution requires finite bound for all varying parameters" message.
After switching to the evolutionary algorithm, the fit now looks like this:
All the fitting algorithms in lmfit (and scipy.optimize for that matter), and including the "global optimizers" really work on continuous variables (double precision). When trying to find the optimal parameter values, most of the algorithms will make a very small step (at the ~1.e-7 level) in the value to determine the derivative which will then be used to make the next guess of the optimal values.
The problem you're seeing is that your model function uses the parameter value as discrete values - as the index of an array using int(). If a small change is made to the parameter value, no change in the result will be detected - the algorithm will decide that the fit result does not depend on small changes to that value.
The so-called "global solvers" like differential evolution, basin-hopping, shgo, take the view that the derivative approach can lead to "false minima" and so will "spray parameter space" with lots of candidate values and then use different strategies to refine the best of those results to find the optimal values. Generally speaking, these are much slower to run (OTOH runtime is cheap!) and very good for problems where there may be multiple "minima" and you really want to find the best of these, or where getting a decent guess of starting values is very hard.
For your problem, it is pretty clear that you can guess starting values (the center pixels must be on the image, say, so maybe guess "the middle"), and it seems likely from the image that there are not a lot of false minima that might be found. That means that the expense of a global solver might not be needed.
Another approach would be to allow your shaped object to be centered at any continuous center in the image, and not only at integer pixels. Of course, you do have to map that to the discrete image, but it doesn't need to fully on/off. Using a sigmoidal functions like scipy.special.erf() and erfc() will allow you to still have a transition from "on" to "off", but with a small but finite width, bleeding into adjacent pixels. And that would be enough to allow a fit to find a continuous (and so, sub-pixel!) value for the center position. In 1-d, that might look like::
from scipy.special import erf
def smoothed_window(x, edge1, edge2, width):
return (erf((x-edge1)/width) + erf((edge2-x)/width))/2.0
For integer x values, a width of 0.5 (that is, half a pixel) will almost certainly allow a fit to find sub-integer values for edge1 and edge2. (Aside: either force the width parameter to be fixed or force it to be positive, eithr in the code or at the Parameter level).
I have not tried to extend that to your more complicated "swiss flag" function, but it should be possible and also work for fitting center values.

For loops to iterate through columns of a csv

I'm very new to python and programming in general (This is my first programming language, I started about a month ago).
I have a CSV file with data ordered like this (CSV file data at the bottom). There are 31 columns of data. The first column (wavelength) must be read in as the independent variable (x) and for the first iteration, it must read in the second column (i.e. the first column labelled as "observation") as the dependent variable (y). I am then trying to fit a Gaussian+line model to the data and extracting the value of the mean of the Gaussian (mu) from the data which should be stored in an array for further analysis. This process should be repeated for each set of observations, whilst the x values read in must stay the same (i.e. from the Wavelength column)
Here is the code for how I am currently reading in the data:
import numpy as np #importing necessary packages
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
from scipy.optimize import curve_fit
e=np.exp
spectral_data=np.loadtxt(r'C:/Users/Sidharth/Documents/Computing Labs/Project 1/Halpha_spectral_data.csv', delimiter=',', skiprows=2) #importing data file
print(spectral_data)
x=spectral_data[:,0] #selecting column 0 to be x-axis data
y=spectral_data[:,1] #selecting column 1 to be y-axis data
So I need to automate that process so that instead of having to change y=spectral_data[:,1] to y=spectral_data[:,2] manually all the way up to y=spectral_data[:,30] for each iteration, it can simply be automated.
My code for producing the Gaussian fit is as follows:
plt.scatter(x,y) #produce scatter plot
plt.title('Observation 1')
plt.ylabel('Intensity (arbitrary units)')
plt.xlabel('Wavelength (m)')
plt.plot(x,y,'*')
plt.plot(x,c+m*x,'-') #plots the fit
print('The slope and intercept of the regression is,', m,c)
m_best=m
c_best=c
def fit_gauss(x,a,mu,sig,m,c):
gaus = a*sp.exp(-(x-mu)**2/(2*sig**2))
line = m*x+c
return gaus + line
initial_guess=[160,7.1*10**-7,0.2*10**-7,m_best,c_best]
po,po_cov=sp.optimize.curve_fit(fit_gauss,x,y,initial_guess)
The Gaussian seems to fit fine (as shown in the image of the plot) and so the mean value of this gaussian (i.e. the x-coordinate of its peak) is the value I must extract from it. The value of the mean is given in the console (denoted by mu):
The slope and intercept of the regression is, -731442221.6844947 616.0099144830941
The signal parameters are
Gaussian amplitude = 19.7 +/- 0.8
mu = 7.1e-07 +/- 2.1e-10
Gaussian width (sigma) = -0.0 +/- 0.0
and the background estimate is
m = 132654859.04 +/- 6439349.49
c = 40 +/- 5
So my questions are, how can I iterate the process of reading in data from the csv so that I don't have to manually change the column y takes data from, and then how do I store the value of mu from each iteration of the read-in so that I can do further analysis/calculations with that mean later?
My thoughts are I should use a for-loop but I'm not sure how to do it.
The orange line shown in the plot is a result of some code I tried earlier. I think its irrelevant which is why it isn't in the main part of the question, but if necessary, this is all it is.
x=spectral_data[:,0] #selecting column 0 to be x-axis data
y=spectral_data[:,1] #selecting column 1 to be y-axis data
plt.scatter(x,y) #produce scatter plot
plt.title('Observation 1')
plt.ylabel('Intensity (arbitrary units)')
plt.xlabel('Wavelength (m)')
plt.plot(x,y,'*')
plt.plot(x,c+m*x,'-') #plots the fit
Usually when you encounter a problem like that, try to break it into what has to be kept unchanged (in your example, the x data and the analysis code), and what does have to be changed (the y data, or more specific the index which tells the rest of the code what is the right column for the y data), and how to keep the values you wish to store further down the road.
Once you figure this out, we need to formalize the right loop and how to store the values we wish to. To do the latter, an easy way is to store them in a list, so we'll initiate an empty list and at the end of each loop iteration we'll append the value to that list.
mu_list = [] # will store our mu's in this list
for i in range(1, 31): # each iteration i gets a different value, starting with 1 and ends with 30 (and not 31)
x = spectral_data[:, 0]
y = spectral_data[:, i]
# Your analysis and plot code here #
mu = po[1] # Not sure po[1] is the right place where your mu is, please change it appropriately...
mu_list.append(mu) # store mu at the end of our growing mu_list
And you will have a list of 30 mu's under mu_list.
Now, notice we don't have to do everything inside the loop, for example x is the same regardless to i (loading x only once - improves performance) and the analysis code is basically the same, except for a different input (y data), so we can define a function for it (a good practice to make bigger code much more readable), so most likely we can take them out from the loop. We can write x = spectral_data[:, 0] before the loop, and define a function which analyizes the data and returns mu:
def analyze(x, y):
# Your analysis and plot code here #
mu = po[1]
return mu
x = spectral_data[:, 0]
mu_list = [] # will store our mu's in this list
for i in range(1, 31):
y = spectral_data[:, i]
mu_list.append(analyze(x,y)) # Will calculate mu using our function, and store it at the end of our growing mu_list

Having trouble plotting a log-log plot in python

Hey so I'm trying to plot variables like age against its frequency, for a rotating body. I am given the period and period derivative aswell as their associated errors. Since frequency is related to period by:
f = 1/T
where frequency is f and period is T
then,
df = - (1/(T^2)) * dT
where dT and dF are the derivatives of period and frequency
but when it comes to plotting the log of this I can't do it in python as it doesn't accept negative values for a loglog plot.
I've tried a work around of using only absolute values but then I only get half the errors when plotting error bars. Is there a way to make python plot both the negative and positive error bars? The frequency derivative itself is a negative quantity.
Unfortunately, log(x) cannot be negative because log(x) = y <=> 10^y = x.
Is 10^y ever going to be -5?
Unfortunately it is impossible to make 10^y<=0 because as y becomes -infinity, x approaches 1/infinity; x approaches, but never passes 0.
Is it possible to plot log(x), where x is negative?
One simple solution to your problem however, is to take the absolute value of df. By doing this, negative numbers become positive. The only downside is that after you've transformed the data this way, you will need to undo the transformation. If the number was negative (and turned positive due to abs(df)), then you must multiply it by -1 afterwards.
You may need to define your own absolute value function that records any values it needs to make positive:
changeList = []
def absRecordChanges(value):
if value < 0 :
value = value * -1
changeList.append(value)
return value
There are other ways to solve the problem, but they are all centred around transforming your data to meet the conditions of a log tranformation (x > 0), and having the data you changed recorded so you can change it back afterward (before you plot it).
EDIT:
While fiddling around in desmos, I was able to plot log(x) where x is any integer. I used a piecewise function to do this: {x<0:-log(abs(x)),log (x)}.
def piecewiseLog(x)
If x <= 0 :
return -log(abs(x))
else :
return log(x)
As I'm not familiar with matlab syntax, this link has an alternative solution: http://www.mathworks.com/matlabcentral/answers/31566-display-negative-values-on-logarithmic-graph

Fast 3D interpolation of atmospheric data in Numpy/Scipy

I am trying to interpolate 3D atmospheric data from one vertical coordinate to another using Numpy/Scipy. For example, I have cubes of temperature and relative humidity, both of which are on constant, regular pressure surfaces. I want to interpolate the relative humidity to constant temperature surface(s).
The exact problem I am trying to solve has been asked previously here, however, the solution there is very slow. In my case, I have approximately 3M points in my cube (30x321x321), and that method takes around 4 minutes to operate on one set of data.
That post is nearly 5 years old. Do newer versions of Numpy/Scipy perhaps have methods that handle this faster? Maybe new sets of eyes looking at the problem have a better approach? I'm open to suggestions.
EDIT:
Slow = 4 minutes for one set of data cubes. I'm not sure how else I can quantify it.
The code being used...
def interpLevel(grid,value,data,interp='linear'):
"""
Interpolate 3d data to a common z coordinate.
Can be used to calculate the wind/pv/whatsoever values for a common
potential temperature / pressure level.
grid : numpy.ndarray
The grid. For example the potential temperature values for the whole 3d
grid.
value : float
The common value in the grid, to which the data shall be interpolated.
For example, 350.0
data : numpy.ndarray
The data which shall be interpolated. For example, the PV values for
the whole 3d grid.
kind : str
This indicates which kind of interpolation will be done. It is directly
passed on to scipy.interpolate.interp1d().
returns : numpy.ndarray
A 2d array containing the *data* values at *value*.
"""
ret = np.zeros_like(data[0,:,:])
for yIdx in xrange(grid.shape[1]):
for xIdx in xrange(grid.shape[2]):
# check if we need to flip the column
if grid[0,yIdx,xIdx] > grid[-1,yIdx,xIdx]:
ind = -1
else:
ind = 1
f = interpolate.interp1d(grid[::ind,yIdx,xIdx], \
data[::ind,yIdx,xIdx], \
kind=interp)
ret[yIdx,xIdx] = f(value)
return ret
EDIT 2:
I could share npy dumps of sample data, if anyone was interested enough to see what I am working with.
Since this is atmospheric data, I imagine that your grid does not have uniform spacing; however if your grid is rectilinear (such that each vertical column has the same set of z-coordinates) then you have some options.
For instance, if you only need linear interpolation (say for a simple visualization), you can just do something like:
# Find nearest grid point
idx = grid[:,0,0].searchsorted(value)
upper = grid[idx,0,0]
lower = grid[idx - 1, 0, 0]
s = (value - lower) / (upper - lower)
result = (1-s) * data[idx - 1, :, :] + s * data[idx, :, :]
(You'll need to add checks for value being out of range, of course).For a grid your size, this will be extremely fast (as in tiny fractions of a second)
You can pretty easily modify the above to perform cubic interpolation if need be; the challenge is in picking the correct weights for non-uniform vertical spacing.
The problem with using scipy.ndimage.map_coordinates is that, although it provides higher order interpolation and can handle arbitrary sample points, it does assume that the input data be uniformly spaced. It will still produce smooth results, but it won't be a reliable approximation.
If your coordinate grid is not rectilinear, so that the z-value for a given index changes for different x and y indices, then the approach you are using now is probably the best you can get without a fair bit of analysis of your particular problem.
UPDATE:
One neat trick (again, assuming that each column has the same, not necessarily regular, coordinates) is to use interp1d to extract the weights doing something like follows:
NZ = grid.shape[0]
zs = grid[:,0,0]
ident = np.identity(NZ)
weight_func = interp1d(zs, ident, 'cubic')
You only need to do the above once per grid; you can even reuse weight_func as long as the vertical coordinates don't change.
When it comes time to interpolate then, weight_func(value) will give you the weights, which you can use to compute a single interpolated value at (x_idx, y_idx) with:
weights = weight_func(value)
interp_val = np.dot(data[:, x_idx, y_idx), weights)
If you want to compute a whole plane of interpolated values, you can use np.inner, although since your z-coordinate comes first, you'll need to do:
result = np.inner(data.T, weights).T
Again, the computation should be practically immediate.
This is quite an old question but the best way to do this nowadays is to use MetPy's interpolate_1d funtion:
https://unidata.github.io/MetPy/latest/api/generated/metpy.interpolate.interpolate_1d.html
There is a new implementation of Numba accelerated interpolation on regular grids in 1, 2, and 3 dimensions:
https://github.com/dbstein/fast_interp
Usage is as follows:
from fast_interp import interp2d
import numpy as np
nx = 50
ny = 37
xv, xh = np.linspace(0, 1, nx, endpoint=True, retstep=True)
yv, yh = np.linspace(0, 2*np.pi, ny, endpoint=False, retstep=True)
x, y = np.meshgrid(xv, yv, indexing='ij')
test_function = lambda x, y: np.exp(x)*np.exp(np.sin(y))
f = test_function(x, y)
test_x = -xh/2.0
test_y = 271.43
fa = test_function(test_x, test_y)
interpolater = interp2d([0,0], [1,2*np.pi], [xh,yh], f, k=5, p=[False,True], e=[1,0])
fe = interpolater(test_x, test_y)

Problems trying to calculate FWHM with scipy.interpolate

I am having problems trying to find the FWHM of some data. I initially tried to fit a curve using interpolate.interp1d. With this I was able to create a function that when I entered an x value it would return an interpolated y value. The issue is that I need the inverse of this functionality. In other words, I want to switch my independent and dependent variables. When I try to switch them, I get errors because the independent data has to be sorted. If I sort the data, I will lose the indexes, and therefore lose the shape of my graph.
I tried:
x = np.linspace(0, line.shape[0], line.shape[0])
self.x_curve = interpolate.interp1d(x, y, 'linear')
where y is my data.
To get the inverse, I tried:
self.x_curve = interpolate.interp1d(sorted(y), x, 'linear')
but the values are off.
I then moved on and tried to use UnivariateSpline and get the roots to find the FWHM (from this question here: Finding the full width half maximum of a peak), but the roots() method keeps giving me an empty list [].
This is what I used:
x_curve = interpolate.UnivariateSpline(x, y)
r = x_curve.roots()
print(r)
Here is an image of the data (with the UnivariateSpline):
Any ideas? Thanks.
Using UnivariateSpline.roots() to get FWHM will only work if you shift the data so that its value is 0 at FWHM.
Seeing that the background of the data is noisy, I'd first estimate the baseline. For example:
y_baseline = y[(x<200) & (x>350)].mean()
(adjust the limits for x as you see fit). Then shift the data so that the middle of the baseline and the peak is at 0. Seeing that your data has a minimum and not a maximum as in the example, I'm using y.min():
y_shifted = y - (y.min()+y_baseline)/2.0
Now fit a spline to this shifted data and roots() should be able to find the roots, the difference of which is the FWHM.
x_curve = interpolate.UnivariateSpline(x, y_shifted, s=0)
x_curve.roots()
Increase the s parameter if you want to estimate the FWHM from smoothed data.

Categories