Fitting a function to data using pyminuit - python

I wrote a Python (2.7) program to evaluate some scientific data. Its main task is to fit this data to a certain function (1). Since this is quite a lot of data the program distributes the jobs (= "fit one set of data") to several cores using multiprocessing. In a first attempt I implemented the fitting process using curve_fit from scipy.optimize which works pretty well.
So far, so good. Then we saw that the data is more precisely described by a convolution of function (1) and a gaussian distribution. The idea was to fit the data first to function (1), get the guess values as a result of this and then fit the data again to the convolution. Since the data is quite noisy and I am trying to fit it to a convolution with seven parameters, the results were this time rather bad. Especially the gaussian parameters were in some extend physically impossible.
So I tried implementing the fitting process in PyMinuit because it allows to limit the parameters within certain borders (like a positive amplitude). As I never worked with Minuit before and I try to start small, I rewrote the first ("easy") part of the fitting process. The code snippet doing the job looks like this (simplified):
import minuit
import numpy as np
# temps are the x-values
# pol are the y-values
def gettc(temps, pol, guess_values):
try:
efunc_chi2 = lambda a,b,c,d: np.sum( (efunc(temps, a,b,c,d) - pol)**2 )
fit = minuit.Minuit(efunc_chi2)
fit.values['a'] = guess_values[0]
fit.values['b'] = guess_values[1]
fit.values['c'] = guess_values[2]
fit.values['d'] = guess_values[3]
fit.fixed['d'] = True
fit.maxcalls = 1000
#fit.tol = 1000.0
fit.migrad()
param = fit.args
popc = fit.covariance
except minuit.MinuitError:
return np.zeros(len(guess_values))
return param,popc
Where efunc() is function (1). The parameter d is fixed since I don't use it at the moment.
PyMinuit function reference
Finally, here comes the actual problem: When running the script, Minuit prints for almost every fit
VariableMetricBuilder: Tolerance is not sufficient - edm is 0.000370555 requested 1e-05 continue the minimization
to stdout with different values for edm. The fit still works fine but the printing slows the program considerably down. I tried to increase fit.tol but there are a lot of datasets which return even higher edm's. Then I tried just hiding the output of fit.migrad() using this solution which actually works. And now something strange happens: Somewhere in the middle of the program, the processes on all cores fail simultaneously. Not at the first fit but in the middle of my whole dataset. The only thing I changed, was
with suppress_stdout_stderr():
fit.migrad()
I know this is quite a long introduction, but I think it helps you more when you know the whole framework. I would be very grateful if someone has any idea on how to approach this problem.
Note: Function (1) is defined as
def efunc(x,a,b,c,d):
if x < c:
return a*np.exp(b*x**d)
else:
return a*np.exp(b*c**d) # therefore constant
efunc = np.vectorize(efunc)

Related

How is it possible to call functions in PuLP (Python) constraints?

Use case
This is just an easy example for understanding, how and why it is not working as expected.
There is a set of processes, which have a start and a finishing timestamp.
the start timestamp of a process must be after the finishing timestamp of it's predecessor. So far, so good.
Consideration
Regarding constraints: Shouldn't it be possible to carry out more complex operations than arithmetic equations (e.g. queries and case distinctions)?
This is illustrated in the code below.
The standard formulation of the constraint works properly.
But it fails, if you put a function call into the equation.
def func(p):
if self.start_timestamps[p] >= self.end_timestamps[p-1]:
return 1
return 0
# constraint for precedences of processes
for process_idx in self.processes:
if process_idx > 0:
# works fine !
model += self.start_timestamps[process_idx] >= self.end_timestamps[process_idx-1]
# doesn't work, but should?!
model += func(process_idx) == 1
Questions
Are there ways to resolve that via function call? (In a more complex case, for example, you would have to make different queries and iterations within the function.)
If it is not possible with PuLP, are there other OR libraries, which can process things like that?
Thanks!

Python SciPy ODE solver not converging

I'm trying to use scipy's ode solver to plot the interaction between a 2D system of equations. I'm attempting to alter the parameters passed to the solver by the following block of code:
# define maximum number of iteration steps for ode solver iteration
m = 1 #power of iteration
N = 2**m #number of steps
# setup a try-catch formulation to increase the number of steps as needed for solution to converge
while True:
try:
z = ode(stateEq).set_integrator("vode",nsteps=N,method='bdf',max_step=5e5)
z.set_initial_value(x0, t0)
for i in range(1, t.size):
if i%1e3 == 0:
print 'still integrating...'
x[i, :] = z.integrate(t[i]) # get one more value, add it to the array
if not z.successful():
raise RuntimeError("Could not integrate")
break
except:
m += 1
N = 2**m
if m%2 == 0:
print 'increasing nsteps...'
print 'nsteps = ', N
Running this never breaks the while loop. It keeps increasing the nsteps forever and the system never gets solved. If I don't put it in the while loop, the system gets solved, I think, because the solution gets plotted. Is the while loop necessary? Am I formulating the solver incorrectly?
The parameter nsteps regulates how many integration steps can be maximally performed during one sampling step (i.e., a call of z.integrate). Its default value is okay if your sampling step is sufficiently small to capture the dynamics. If you want to integrate over a huge time span in one large sampling step (e.g., to get rid of transient dynamics), the value can easily be too small.
The point of this parameter is to avoid problems arising from unexpectedly very long integrations. For example, if you want to perform a given integration for 100 values of a control parameter in a loop over night, you do not want to see on the next morning that the No. 14 was pathological and is still running.
If this is not relevant to you, just set nsteps to a very high value and stop worrying about it. There is certainly no point to successively increase nsteps, you are just performing the same calculations all over again.
Running this never breaks the while loop. It keeps increasing the nsteps forever and the system never gets solved.
This suggests that you have a different problem than nsteps being exceeded, most likely that the problem is not well posed. Carefully read the error message produced by the integrator. I also recommend that you check your differential equations. It may help to look at the solutions until the integration fails to see what is going wrong, i.e., plot x after running this:
z = ode(stateEq)
z.set_integrator("vode",nsteps=1e10,method='bdf',max_step=5e5)
z.set_initial_value(x0, t0)
for i,time in enumerate(t):
x[i,:] = z.integrate(time)
if not z.successful():
break
Your value for max_step is very high (this should not be higher than the time scale of your dynamics). Depending on your application, this may very well be reasonable, but then it suggests that you are working with large numbers. This in turn may mean that the default values of the parameters atol and first_step are not suited for your situation and you need to adjust them.

Proper method for combining Attention, Multi, Residual cell wrappers in 1.1.0-rc2

I am attempting to combine the following:
tf.contrib.rnn.AttentionCellWrapper
tf.contrib.rnn.MultiRNNCell
tf.contrib.rnn.ResidualWrapper
tf.contrib.rnn.LSTMCell
I am constructing the cell in the following manner
cell = tf.contrib.rnn.AttentionCellWrapper(
tf.contrib.rnn.MultiRNNCell([
tf.contrib.rnn.ResidualWrapper(
cell=tf.contrib.rnn.LSTMCell(dec_units))
for _ in range(dec_layers)]),
attn_length=attn_len)
This works fine if I keep the attn_len small (1-2) but increasing the attn_len to a larger value (5+) causes the script to hang indefinitely with one CPU core pegged at 100% at the start of training (0 steps are completed).
Is this the appropriate way to combine these elements? Should I be overriding the defaults on the optional parameters?

Recieving SystemError: NULL result without error in PyObject_Call when using Python multiprocessing

I have searched around and there have been several questions on this issue, but none of them seem to fit my context.
I am working on an algorithm for image segmentation in Python, using the anaconda distribution. In the beginning stages of the algorithm, a weight matrix must be calculated, to give the weights of the edges between every pixel of an image (using the pixels as nodes in a weighted graph data structure). Obviously, this results in a matrix of a huge size, ((imageWidth * imageHeight)**2). This takes a very long time to build for larger and larger images, so a while back I implemented a couple of functions to build the matrix using all cpu's on the machine.
This has worked perfectly fine for me for weeks during testing. However, our algorithm has been steadily improving, and it is now ready to test on some real data (gathered from an x-ray microscope). Our tests have so far been no larger than 64x64, and that took quite a while, so we have moved to a machine with 40 processors.
Now that we have moved, I all of the sudden am getting an error at the multiprocessing function:
SystemError: NULL result without error in PyObject_Call
I don't understand this, because the only difference is the number of cpu's. Why should that cause an issue? If someone knows more about how python's multiprocessing package actually works, then maybe you can see something that I don't.
Here are the functions in question. The first initiates the process pool:
def CreateMatrix(self, sigmaI, sigmaX):
cpus = mp.cpu_count()
poolCount = cpus
args = [(self, sigmaI, sigmaX, i,) for i in range(self.numPixels)]
pool = mp.Pool(processes = poolCount)
tempData = pool.map(unwrap_CreateMatrix, args)
for pixelList in tempData:
for pixel in pixelList:
self.data[pixel[1]] = pixel[0]
self.matrix = numpy.matrix(self.data.reshape(self.columns, self.rows), numpy.float64)
return
The second one if called by each process (and has nothing to dow ith the creation or handling of the process pool, so I doubt the issue is here):
def CreateMatrixPixelA(self, sigmaI=1, sigmaX=1, i = 0):
pixelA = self.pixelArray[i]
pixelAData = []
j = 0
for pixelB in self.pixelArray:
stride = (i * self.numPixels) + j
locationDiff = self.CalcLocationVectorNorm(pixelA, pixelB)
if locationDiff < self.distance:
intensityDiff = self.CalcPixelVectorNorm(pixelA, pixelB)
locationDiff = -1 * pow(locationDiff,2)
intensityDiff = -1 * pow(intensityDiff, 2)
value = math.exp(intensityDiff / sigmaI) * math.exp(locationDiff / sigmaX)
pixelAData.append((value, stride))
j += 1
return pixelAData
Also this one, just to spare any confusion:
def unwrap_CreateMatrix(args):
return WeightMatrix.CreateMatrixPixelA(*args)
Sorry for the big wall of text. I'm sure the answer to this is quite simple, I just don't know what kind of information will help, so I have included everything relevant. My only thoughts are that the image I'm using is too big (highly doubt, though), or there may be an issue with the packages installed on this machine (though this machine is using anaconda just as all other testing machines were.)

Python OverflowError: math range error being raised differently in different runs

My program seems to be crashing almost arbitrarily.
My code includes these two lines:
z[i, j] = -math.exp(oneminusbeta[j, i])
weights[i,j] = math.exp(beta[j,i] + oneminusbeta[j,i])
I've run my whole code before on data that had 2 dimensions, it was 7341 x 648. I had no issues at all running that code.
But now the data I'm using is about ten times as big. It's 71678 x 648, and I'm getting this error:
OverflowError: math range error
And I'm not getting this on any specific point. I'm logging comments before every line of code so that I can see what's causing the crash, and it appears the crash is happening more often on the second line mentioned above (weights[i,j] = math.exp(beta[j,i] + oneminusbeta[j,i])).
The thing is, it crashes at different times.
At first, it crashed at weights[30816, 42]. Then at weights[55399, 43]. Then at z[33715,45]. But the data is the same in all 3 cases.
What could the problem be? Is this a memory related issue with python? I'm using Numba too, by the way.
Edit
I forgot to mention, I've put thresholds so that what goes into the exp() function doesn't exceed what 709 or -708, so technically there shouldn't be an overflow.
The result of your calculation cannot be represented on your computer. This probably means that math.exp(...) is greater than about 10308, or the argument passed to math.exp() is greater than about 710.
Try printing the values of beta[j,i] and oneminusbeta[j,i] before each calculation.
In fact, you don't have to print comments before every line of code. Instead, try wrapping the calculations with a try block, like so:
try:
weights[i,j] = math.exp(beta[j,i] + oneminusbeta[j,i])
except OverflowError:
print "Calculation failed! j=%d i=%d beta=%f oneminusbeta=%f"%(j,i,beta[j,i],oneminusbeta[j,i])
raise
Your overflow is almost certainly a real overflow; one of your values is too large to fit in a Python float (meaning a C double).
So, why does it happen in different places each time?
Because you're using Numba.
Numba JIT-compiles your code. If it detects that there's no contention, it can reorder your code—or even, at least in theory, run it in parallel on multiple cores or on the GPU (although I think at present you only get GPU computation if you code it explicitly with numba.cuda).
At any rate, this means that the path through your code can be nondeterministic. If there's more than one place an overflow could happen, you can't predict which one will fail and trigger the exception.
At any rate, that doesn't really matter. If your calculations are overflowing, you have to fix that. And the fact that different ones overflow each time shouldn't make it that much harder to debug—especially given that it apparently usually happens in a single place, just not always.

Categories