Restricting values for curve_fit (scipy.optimize) - python

I'm trying to fit a logistic growth curve to my data using curve_fit using the following function as the input.
def logistic(x, y0, k, d, a, b):
if b > 0 and a > 0:
y = (k * pow(1 + np.exp(d - (a * b * x) ), (-1/b) )) + y0
elif b >= -1 or b < 0 or a < 0:
y = (k * pow(1 - np.exp(d - (a * b * x) ), (-1/b) )) + y0
return y
As you can see the function i am using has some restrictions on the values it can accept for parameter a and b. Any guess on how to handle the incorrect values? Should the input function raise an exception or return a dummy value?
Thanks in advance.

When the parameters fall out of the admissible range, return a wildly huge number (far from the data to be fitted). This will (hopefully) penalize this choice of parameters so much that curve_fit will settle on some other admissible set of parameters as optimal:
def logistic(x, y0, k, d, a, b):
if b > 0 and a > 0:
y = (k * pow(1 + np.exp(d - (a * b * x) ), (-1/b) )) + y0
elif b >= -1 or b < 0 or a < 0:
y = (k * pow(1 - np.exp(d - (a * b * x) ), (-1/b) )) + y0
else:
y = 1e10
return y

Related

How to find 2 parameters with gradient descent method in Python?

I have a few lines of code which doesn't converge. If anyone has an idea why, I would greatly appreciate. The original equation is written in def f(x,y,b,m) and I need to find parameters b,m.
np.random.seed(42)
x = np.random.normal(0, 5, 100)
y = 50 + 2 * x + np.random.normal(0, 2, len(x))
def f(x, y, b, m):
return (1/len(x))*np.sum((y - (b + m*x))**2) # it is supposed to be a sum operator
def dfb(x, y, b, m): # partial derivative with respect to b
return b - m*np.mean(x)+np.mean(y)
def dfm(x, y, b, m): # partial derivative with respect to m
return np.sum(x*y - b*x - m*x**2)
b0 = np.mean(y)
m0 = 0
alpha = 0.0001
beta = 0.0001
epsilon = 0.01
while True:
b = b0 - alpha * dfb(x, y, b0, m0)
m = m0 - alpha * dfm(x, y, b0, m0)
if np.sum(np.abs(m-m0)) <= epsilon and np.sum(np.abs(b-b0)) <= epsilon:
break
else:
m0 = m
b0 = b
print(m, f(x, y, b, m))
Both derivatives got some signs mixed up:
def dfb(x, y, b, m): # partial derivative with respect to b
# return b - m*np.mean(x)+np.mean(y)
# ^-------------^------ these are incorrect
return b + m*np.mean(x) - np.mean(y)
def dfm(x, y, b, m): # partial derivative with respect to m
# v------ this should be negative
return -np.sum(x*y - b*x - m*x**2)
In fact, these derivatives are still missing some constants:
dfb should be multiplied by 2
dfm should be multiplied by 2/len(x)
I imagine that's not too bad because the gradient is scaled by alpha anyway, but it could make the speed of convergence worse.
If you do use the correct derivatives, your code will converge after one iteration:
def dfb(x, y, b, m): # partial derivative with respect to b
return 2 * (b + m * np.mean(x) - np.mean(y))
def dfm(x, y, b, m): # partial derivative with respect to m
# Used `mean` here since (2/len(x)) * np.sum(...)
# is the same as 2 * np.mean(...)
return -2 * np.mean(x * y - b * x - m * x**2)

Similar algorithm implementation producing different results

I'm trying to understand the Karatsuba multiplication algorithm. I've written the following code:
def karatsuba_multiply(x, y):
# split x and y
len_x = len(str(x))
len_y = len(str(y))
if len_x == 1 or len_y == 1:
return x*y
n = max(len_x, len_y)
n_half = 10**(n // 2)
a = x // n_half
b = x % n_half
c = y // n_half
d = y % n_half
ac = karatsuba_multiply(a, c)
bd = karatsuba_multiply(b, d)
ad_plus_bc = karatsuba_multiply((a+b), (c+d)) - ac - bd
return (10**n * ac) + (n_half * ad_plus_bc) + bd
This test case does not work:
print(karatsuba_multiply(1234, 5678)) ## returns 11686652, should be 7006652‬
But if I use the following code from this answer, the test case produces the correct answer:
def karat(x,y):
if len(str(x)) == 1 or len(str(y)) == 1:
return x*y
else:
m = max(len(str(x)),len(str(y)))
m2 = m // 2
a = x // 10**(m2)
b = x % 10**(m2)
c = y // 10**(m2)
d = y % 10**(m2)
z0 = karat(b,d)
z1 = karat((a+b),(c+d))
z2 = karat(a,c)
return (z2 * 10**(2*m2)) + ((z1 - z2 - z0) * 10**(m2)) + (z0)
Both functions look like they're doing the same thing. Why doesn't mine work?
It seems that in with kerat_multiply implementation you can't use the correct formula for the last return.
In the original kerat implementation the value m2 = m // 2 is multiplied by 2 in the last return (z2 * 10**(2*m2)) + ((z1 - z2 - z0) * 10**(m2)) + (z0) (2*m2)
So you i think you need either to add a new variable as below where n2 == n // 2 so that you can multiply it by 2 in the last return, or use the original implementation.
Hoping it helps :)
EDIT: This is explain by the fact that 2 * n // 2 is different from 2 * (n // 2)
n = max(len_x, len_y)
n_half = 10**(n // 2)
n2 = n // 2
a = x // n_half
b = x % n_half
c = y // n_half
d = y % n_half
ac = karatsuba_multiply(a, c)
bd = karatsuba_multiply(b, d)
ad_plus_bc = karatsuba_multiply((a + b), (c + d)) - ac - bd
return (10**(2 * n2) * ac) + (n_half * (ad_plus_bc)) + bd

Making numeric double integration more efficent

So i've made a simple program for numerically aproximating double integral, which accepts that the bounds of the inner integral are funcions:
def double_integral(func, limits, res=1000):
t = time.clock()
t1 = time.clock()
t2 = time.clock()
s = 0
a, b = limits[0], limits[1]
outer_values = np.linspace(a, b, res)
c_is_func = callable(limits[2])
d_is_func = callable(limits[3])
for y in outer_values:
if c_is_func:
c = limits[2](y)
else:
c = limits[2]
if d_is_func:
d = limits[3](y)
else:
d = limits[3]
dA = ((b - a) / res) * ((d - c) / res)
inner_values = np.linspace(c, d, res)
for x in inner_values:
t2 = time.clock() - t2
s += func(x, y) * dA
t1 = time.clock() - t1
t = time.clock() - t
return s, t, t1 / res, t2 / res**2
This is, however, terribly slow. When res=1000, such that the integral is a sum of a million parts, it takes about 5 seconds to run, but the answer is only correct to about the 3rd decimal in my experience. Is there any way to speed this up?
The code i am running to check the integral is
def f(x, y):
if (4 - y**2 - x**2) < 0:
return 0 #This is to avoid taking the root of negarive #'s
return np.sqrt(4 - y**2 - x**2)
def c(y):
return np.sqrt(2 * y - y**2)
def d(y):
return np.sqrt(4 - y**2)
# b d
# S S f(x,y) dx dy
# a c
a, b, = 0, 2
print(double_integral(f, [a, b, c, d]))
The integral is eaqual to 16/9
Edit
so i got a great answer over at coderewiev, but i am still baffeled by how scipy.integrate.dblquad seem to give me the wrong answer (see comment). does anyone have an answer for this?

Way to solve constraint satisfaction faster than brute force?

I have a CSV that provides a y value for three different x values for each row. When read into a pandas DataFrame, it looks like this:
5 10 20
0 -13.6 -10.7 -10.3
1 -14.1 -11.2 -10.8
2 -12.3 -9.4 -9.0
That is, for row 0, at 5 the value is -13.6, at 10 the value is -10.7, and at 20 the value is -10.3. These values are the result of an algorithm in the form:
def calc(x, r, b, c, d):
if x < 10:
y = (x * r + b) / x
elif x >= 10 and x < 20:
y = ((x * r) + (b - c)) / x
else:
y = ((x * r) + (b - d)) / x
return y
I want to find the value of r, b, c, and d for each row. I know certain things about each of the values. For example, for each row: r is in np.arange(-.05, -.11, -.01), b is in np.arange(0, -20.05, -.05), and c and d are in np.arange(0, 85, 5). I also know that d is <= c.
Currently, I am solving this with brute force. For each row, I iterate through every combination of r, b, c, and d and test if the value at the three x values is equal to the known value from the DataFrame. This works, giving me a few combinations for each row that are basically the same except for rounding differences.
The problem is that this approach takes a long time when I need to run it against 2,000+ rows. My question is: is there a faster way than iterating and testing every combination? My understanding is that this is a constraint satisfaction problem but, after that, I have no idea what to narrow in on; there are so many types of constraint satisfaction problems (it seems) that I'm still lost (I'm not even certain that this is such a problem!). Any help in pointing me in the right direction would be greatly appreciated.
I hope i understood the task correctly.
If you know the resolution/discretization of the parameters, it looks like a discrete-optimization problem (in general: hard), which could be solved by CP-approaches.
But if you allow these values to be continuous (and reformulate the formulas), it is:
(1) a Linear Program: if checking for feasible values (there needs to be a valid solution)
(2) a Linear Program: if optimizing parameters for minimization of sum of absolute differences (=errors)
(3) a Quadratic Program: if optimizing parameters for minimization of sum of squared differences (=errors) / equivalent to minimizing euclidean-norm
All three versions can be solved efficiently!
Here is a non-general (could be easily generalized) implementation of (3) using cvxpy to formulate the problem and ecos to solve the QP. Both tools are open-source.
Code
import numpy as np
import time
from cvxpy import *
from random import uniform
""" GENERATE TEST DATA """
def sample_params():
while True:
r = uniform(-0.11, -0.05)
b = uniform(-20.05, 0)
c = uniform(0, 85)
d = uniform(0, 85)
if d <= c:
return r, b, c, d
def calc(x, r, b, c, d):
if x < 10:
y = (x * r + b) / x
elif x >= 10 and x < 20:
y = ((x * r) + (b - c)) / x
else:
y = ((x * r) + (b - d)) / x
return y
N = 2000
sampled_params = [sample_params() for i in range(N)]
data_5 = np.array([calc(5, *sampled_params[i]) for i in range(N)])
data_10 = np.array([calc(10, *sampled_params[i]) for i in range(N)])
data_20 = np.array([calc(20, *sampled_params[i]) for i in range(N)])
data = np.empty((N, 3))
for i in range(N):
data[i, :] = [data_5[i], data_10[i], data_20[i]]
""" SOLVER """
def solve(row):
""" vars """
R = Variable(1)
B = Variable(1)
C = Variable(1)
D = Variable(1)
E = Variable(3)
""" constraints """
constraints = []
# bounds
constraints.append(R >= -.11)
constraints.append(R <= -.05)
constraints.append(B >= -20.05)
constraints.append(B <= 0.0)
constraints.append(C >= 0.0)
constraints.append(C <= 85.0)
constraints.append(D >= 0.0)
constraints.append(D <= 85.0)
constraints.append(D <= C)
# formula of model
constraints.append((1.0 / 5.0) * B + R == row[0] + E[0]) # alternate function form: b/x+r
constraints.append((1.0 / 10.0) * B - (1.0 / 10.0) * C == row[1] + E[1]) # alternate function form: b/x-c/x+r
constraints.append((1.0 / 20.0) * B - (1.0 / 20.0) * D == row[2] + E[2]) # alternate function form: b/x-d/x+r
""" Objective """
objective = Minimize(norm(E, 2))
""" Solve """
problem = Problem(objective, constraints)
problem.solve(solver=ECOS, verbose=False)
return R.value, B.value, C.value, D.value, E.value
start = time.time()
for i in range(N):
r, b, c, d, e = solve(data[i])
end = time.time()
print('seconds taken: ', end-start)
print('seconds per row: ', (end-start) / N)
Output
('seconds taken: ', 20.620506048202515)
('seconds per row: ', 0.010310253024101258)

Fitting piecewise function: running into NaN problems due to boolean array indexing

So I've been trying to fit to an exponentially modified gaussian function (if interested, https://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution)
import numpy as np
import scipy.optimize as sio
import scipy.special as sps
def exp_gaussian(x, h, u, c, t):
z = 1.0/sqrt(2.0) * (c/t - (x-u)/c) #not important
k1 = k2 = h * c / t * sqrt(pi / 2) #not important
n1 = 1/2 * (c / t)**2 - (x-u)/t #not important
n2 = -1 / 2 * ((x - u) / c)**2 #not important
y = np.zeros(len(x))
y += (k1 * np.exp(n1) * sps.erfc(z)) * (z < 0)
y += (k2 * np.exp(n2) * sps.erfcx(z)) * (z >= 0)
return y
In order to prevent overflow problems, one of two equivilent functions must be used depending on whether z is positive or negative (see Alternative forms for computation from previous wikipedia page).
The problem I am having is this: The line y += (k2 * np.exp(n2) * sps.erfcx(z)) * (z >= 0)is only supposed to add to y when z is positive. But if z is, say, -30, sps.erfcx(-30) is inf, and inf * False is NaN. Therefore, instead of leaving y untouched, the resulting y is clustered with NaN. Example:
x = np.linspace(400, 1000, 1001)
y = exp_gaussian(x, 100, 400, 10, 5)
y
array([ 84.27384586, 86.04516723, 87.57518493, ..., nan,
nan, nan])
I tried the replacing the line in question with the following:
y += numpy.nan_to_num((k2 * np.exp(n2) * sps.erfcx(z)) * (z >= 0))
But doing this ran into serious runtime issues. Is there a way to only evaluate (k2 * np.exp(n2) * sps.erfcx(z)) on the condition that (z >= 0) ? Is there some other way to solve this without sacrificing runtime?
Thanks!
EDIT: After Rishi's advice, the following code seems to work much better:
def exp_gaussian(x, h, u, c, t):
z = 1.0/sqrt(2.0) * (c/t - (x-u)/c)
k1 = k2 = h * c / t * sqrt(pi / 2)
n1 = 1/2 * (c / t)**2 - (x-u)/t
n2 = -1 / 2 * ((x - u) / c)**2
return = np.where(z >= 0, k2 * np.exp(n2) * sps.erfcx(z), k1 * np.exp(n1) * sps.erfc(z))
How about using numpy.where with something like: np.where(z >= 0, sps.erfcx(z), sps.erfc(z)). I'm no numpy expert, so don't know if it's efficient. Looks elegant at least!
One thing you could do is create a mask and reuse it so it wouldn't need to be evaluated twice. Another idea is to use the nan_to_num only once at the end
mask = (z<0)
y += (k1 * np.exp(n1) * sps.erfc(z)) * (mask)
y += (k2 * np.exp(n2) * sps.erfcx(z)) * (~mask)
y = numpy.nan_yo_num(y)
Try and see if this helps...

Categories