Related
I'm trying to use scipy.optimize.curve_fit on a large latitude/longitude/time xarray using dask.distributed as computing backend.
The idea is to run an individual data fitting for every (latitude, longitude) using the time series.
All of this runs fine outside xarray/dask. I tested it using the time series of a single location passed as a pandas dataframe. However, if I try to run the same process on the same (latitude, longitude) directly on the xarray, the curve_fit operation returns the initial parameters.
I am performing this operation using xr.apply_ufunc like so (here I'm providing only the code that is strictly relevant to the problem):
# function to perform the fit
def _fit_rti_curve(data, data_rti, fit, loc=False):
fit_func, linearize, find_init_params = _get_fit_functions(fit)
# remove nans
x, y = _filter_nodata(data_rti, data)
# remove outliers
x, y = _filter_for_outliers(x, y, linearize=linearize)
# find a first guess for maximum achieveable value
yscale = np.max(y) * 1.05
# find a first guess for the other parameters
# here loc can be manually passed if you have a good estimation
init_parms = find_init_params(x, y, yscale, loc=loc, linearize=linearize)
# fit the curve and return parameters
parms = curve_fit(fit_func, x, y, p0=init_parms, maxfev=10000)
parms = parms[0]
return parms
# shell around _fit_rti_curve
def find_rti_func_parms(data, rti, fit):
# sort and fit highest n values
top_data = np.sort(data)
top_data = top_data[-len(rti):]
# convert to float64 if needed
top_data = top_data.astype(np.float64)
rti = rti.astype(np.float64)
# run the fit
parms = _fit_rti_curve(top_data, rti, fit, loc=0) #TODO maybe add function to allow a free loc
return parms
# call for the apply_ufunc
# `fit` is a string that defines the distribution type
# `rti` is an array for the x values
parms_data = xr.apply_ufunc(
find_rti_func_parms,
xr_obj,
input_core_dims=[['time']],
output_core_dims=[[fit + ' parameters']],
output_sizes = {fit + ' parameters': len(signature(fit_func).parameters) - 1},
vectorize=True,
kwargs={'rti':return_time_interval, 'fit':fit},
dask='parallelized',
output_dtypes=['float64']
)
My guess would be that is a problem related to threading, or at least some shared memory space that is not properly passed between workers and scheduler.
However, I am just not knowledgeable enough to test this within dask.
Any idea on this problem?
You should have a look at this issue https://github.com/pydata/xarray/issues/4300
I had the same problem and I solved using apply_ufunc. It is not optimized, since it has to perform rechunking operations, but it works!
I've created a GitHub Gist for it https://gist.github.com/clausmichele/8350e1f7f15e6828f29579914276de71
This previous answer might be helpful? It's using numpy.polyfit but I think the general approach should be similar.
Applying numpy.polyfit to xarray Dataset
Also, I haven't tried it but xr.polyfit() just got merged recently! Could also be something to look into. http://xarray.pydata.org/en/stable/generated/xarray.DataArray.polyfit.html#xarray.DataArray.polyfit
hello I am newbie at python and coding for the most part and I have 5 ordinary differential equations.(non-linear) that I want to model and have them plot. I have the parameters that are given, my main issue has been setting the independent variables to be a function of z. As well as setting the 'S' parameters to be a function of time since they vary depending on the time of year.
edited CODE
I've been able to have the code run with set parameters. I now wonder how I could take these parameters and make them behave at different times. The parameters that are set on this code are for a specific amount of "days" during the year. They are not meant to be consistent throughout. How could I implement time to have them be dependent on it?
import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
import math
from math import e
def func(z,t):
xh, xf, y, m, n = z
v1,v2,v3 = 0.05,0.06,0.07
B1,B2,B3 = 0.1984,0.1593,0.04959
d1,d2,d3 = 0.02272,0.02272,0.2
o1,o2 = 0.25,0.75
S1=S2=S3=0.005
S4=S5=0.3
p = 0
u = 500
k = 0.000075
a = 0.4784
r = 0.0165
K = 8000
i = 2
H = e**(-m*k)
g = ((xh+xf)**i)/((K**i)+((xh+xf)**i))
R = o1-(o2*(xf/(xh+xf+.002)))
P1 =(xh+xf)/(xh+y+xf+.002)
P2 = 1-((m+n)/(a*(xh+y+xf+.002)))
P3 = y/(xh+y+xf+.002)
dxhdt = (u*g*H)-(B1*(m*(xh/(xh+y+xf+.002))))-((d1+S1)*xh)-((v1*(m+n))*xh)-(xh*R)
dxfdt = (xh*R)-(B1*(m*(xf/(xh+y+xf+.002))))-((p+d2+S2)*xf)-(v2*(m+n)*xf)
dydt = (B1*(m*P1))-((d3+S3)*y)-((v3*(m+n))*y)
dmdt =(r*(m*P2))+(B2*(n*P3))-(B3*(m*P1))-(S4*m)
dndt = (r*(n*P2))-(B2*(n*P3))+(B3*(m*P1))-(S5*n)
return [dxhdt,dxfdt,dydt,dmdt,dndt]
z0=[13000,11000,0,0,0]
t = np.linspace(0,100,1000)
xx=odeint(func,z0,t)
plt.figure(1)
plt.plot(t,xx[:,0],'b-',label = 'xh')
plt.plot(t,xx[:,1],'y-',label = 'xf')
plt.plot(t,xx[:,2],'g-',label = 'y')
plt.plot(t,xx[:,3],'r-',label = 'm')
plt.plot(t,xx[:,4],'m-',label = 'n')
plt.legend()
plt.ylabel('POPULATION')
plt.xlabel('TIME')
plt.show()
I though about creating two different functions and looping the plot. How do you makes "days" of function of t? just declaring it is? I get error code "TypeError: 'float' object cannot be interpreted as an integer"
z0=[13000,11000,0,0,0]
t = np.linspace(0,91.25,1000)
xx=odeint(func,z0,t)
xy=odeint(func2,z0,t)
plt.figure(1)
for t in range(1,91.25):
plt.plot(t,xx[:,0],'b-',label = '$x_h$')
plt.plot(t,xx[:,1],'y-',label = '$x_f$')
plt.plot(t,xx[:,2],'g-',label = 'y')
plt.plot(t,xx[:,3],'r-',label = 'm')
plt.plot(t,xx[:,4],'m-',label = 'n')
for t in range(91.25,182.50):
plt.plot(t,xy[:,0],'b-',label = '$x_h$')
plt.plot(t,xy[:,1],'y-',label = '$x_f$')
plt.plot(t,xy[:,2],'g-',label = 'y')
plt.plot(t,xy[:,3],'r-',label = 'm')
plt.plot(t,xy[:,4],'m-',label = 'n')
plt.legend()
plt.ylabel('POPULATION')
plt.xlabel('TIME')
plt.show()
I get what you mean by an ODE, but please expand it so others that are not cognizant of mathematics can understand.
If you want these to be a function of z, then you must declare a function something() and assign the variables this function. This way, your values will change with respect to changes in z.
Also by convention, I don't recommend using this much of variable declarations. Abstract these as much as possible. As an alternative, you can declare similar variables in the same line, like
v1, v2, v3 = 0.5, 0.6, 0.7
etc. This will make it much more readable.
If you don't have any syntax error due to multiple assignments in the first line, I recommend change each of this to be a function of z. Divide your bigger function to smaller chunks, make each of this a different function. This way you can manipulate results directly and code will be much more readable.
You prefer the state vector to be composed as
xh, xf, y, m, n
This interpretation of the state vector then needs to be applied everywhere, which means that you have to change the first line of the ODE function to
xh, xf, y, m, n = z
Also check that your fractions are implemented as they were in paper, esp. P1 appears suspicious. But without the genesis of the equation I can not say that it is wrong as it is.
I am trying to solve a second order ODE with solve_bvp. I have split the second order ODE into a system of tow first oder ODEs. I have a changing set of constants depending on the x (mesh) value. So I am passing these as an array of shape (N,) into my function numdens. While trying to run solve_bvp I get the error that the returns have different shapes namely (N,) and (N-1,) and thus cannot be broadcast into one array. But when I check each return back manually outside of the function it has the shape (N,).
If I run the solver without my changing constants I get a solution akin to the right one.
import numpy as np
from scipy.integrate import solve_bvp,odeint
import matplotlib.pyplot as plt
E_0 = 1 * 0.0000016021773 #erg: gcm^2/s^2
m_H = 1.6*10**(-24) #g
c = 3e11 #cm
sigma_c = 2*10**(-23)
n_0 = 1*10**(20) #1/cm^3
v_0 = (2*E_0/m_H)**(0.5) #cm/s
T = 10**7
b = 20.3
n_eq = b*T**3
n_s = 2.03*10**(19)
Q = 1
def velocity(v,x):
dvdx = -sigma_c*n_0*v_0*((8*v_0*v-7*v**2-v_0**2)/(2*v*c))
return dvdx
n_num = 100
x_num = np.linspace(-1*10**(6),3*10**(6), n_num)
sol_velo = odeint(velocity,0.999999999999*v_0,x_num)
sol_new = np.reshape(sol_velo,n_num)
def constants(v):
D1 = (c*v/(3*n_0*v_0*sigma_c))
D2 = ((v**2-8*v_0*v+v_0**2)/(6*v))
D3 = sigma_c*n_0*v_0*((8*v_0*v-7*v**2-v_0**2)/(2*v*c))
return D1,D2,D3
def numdens(x,y):
v = sol_new
D1,D2,D3 = constants(v)
return np.vstack((y[1],(-D2*y[1]-D3*y[0]+Q*((1-y[0])/n_eq))/(D1)))
def bc_num(ya, yb):
return np.array([ya[0]-n_s,yb[0]-n_eq])
y_num = np.array([np.linspace(n_s, n_eq, n_num),np.linspace(n_s, n_eq, n_num)])
sol_num = solve_bvp(numdens, bc_num, x_num, y_num)
plt.plot(sol_num.x, sol_num.y[0], label='$n(x)$')
plt.plot(x_num, sol_velo-v_0/7, label='$v(x)$')
plt.yscale('log')
plt.grid(alpha=0.5)
plt.legend(framealpha=1)
plt.show()
You need to take into account that the BVP solver uses an adaptive mesh. That is, after refining the initial guess on the initial grid the solver identifies regions with overly large errors and creates new mesh nodes there. As far as I have seen, the opposite is not implemented, even if it may be in some applications sensible to reduce the number of mesh nodes on especially "nice" segments.
Thus what you are doing the the numdens function is incomprehensible, it has to function exactly like any other function that you would pass to an ODE solver. If I had to propose some fast fix, and without knowing what the underlying problem is that you want to solve, I would change the assignment of v to
v = np.interp(x,x_num,sol_velo)
as that should at least produce an array of the correct format.
I would like to use scipy's ODR to fit a curve to a set of variables with variances. In this case, I am fitting a linear function with a set Y axis crossing point (e.g. a*x+100). Due to my inability to find an estimator (I asked about that here), I am using scipy.optimize curve_fit to estimate the initial a value. Now, the function works perfectly without standard deviation, but when I add it, the output makes completely no sense (the curve is far above all of the points). What could be the case of such behaviour? Thanks!
The code is attached here:
import scipy.odr as SODR
from scipy.optimize import curve_fit
def fun(kier, arg):
'''
Function to fit
:param kier: list of parameters
:param arg: argument of the function fun
:return y: value of function in x
'''
y =kier[0]*arg +100 #+kier[1]
return y
zx = [30.120348566300354, 36.218214083626386, 52.86998374096616]
zy = [83.47033171149137, 129.10207165602722, 85.59465198231146]
dx = [2.537935346025827, 4.918719773247683, 2.5477221183398977]
dy = [3.3729431749276837, 5.33696690247701, 2.0937213187876]
sx = [6.605581618516947, 8.221194790372632, 22.980577676739113]
sy = [1.0936584882351936, 0.7749999999999986, 20.915359045447914]
dx_total = [9.143516964542775, 13.139914563620316, 25.52829979507901]
dy_total = [4.466601663162877, 6.1119669024770085, 23.009080364235516]
# curve fitting
popt, pcov = curve_fit(fun, zx, zy)
danesd = SODR.RealData(x=zx, y=zy, sx=dx_total, sy=dy_total)
model = SODR.Model(fun)
onbig = SODR.ODR(danesd, model, beta0=[popt[0]])
outputbig = onbig.run()
biga=outputbig.beta[0]
print(biga)
daned = SODR.RealData(x=zx,y=zy,sx=dx,sy=dy)
on = SODR.ODR(daned, model, beta0=[popt[0]])
outputd = on.run()
normala = outputd.beta[0]
print(normala)
The outputs are:
30.926925885047254 (this is the output with standard deviation)
-0.25132703671513873 (this is without standard deviation)
This makes no sense, as shown here:
Also, I'd be happy to get any feedback whether my code is clear and the formatting of this question. I am still very new here.
In my model, I need to obtain the value of my deterministic variable from a set of parent variables using a complicated python function.
Is it possible to do that?
Following is a pyMC3 code which shows what I am trying to do in a simplified case.
import numpy as np
import pymc as pm
#Predefine values on two parameter Grid (x,w) for a set of i values (1,2,3)
idata = np.array([1,2,3])
size= 20
gridlength = size*size
Grid = np.empty((gridlength,2+len(idata)))
for x in range(size):
for w in range(size):
# A silly version of my real model evaluated on grid.
Grid[x*size+w,:]= np.array([x,w]+[(x**i + w**i) for i in idata])
# A function to find the nearest value in Grid and return its product with third variable z
def FindFromGrid(x,w,z):
return Grid[int(x)*size+int(w),2:] * z
#Generate fake Y data with error
yerror = np.random.normal(loc=0.0, scale=9.0, size=len(idata))
ydata = Grid[16*size+12,2:]*3.6 + yerror # ie. True x= 16, w= 12 and z= 3.6
with pm.Model() as model:
#Priors
x = pm.Uniform('x',lower=0,upper= size)
w = pm.Uniform('w',lower=0,upper =size)
z = pm.Uniform('z',lower=-5,upper =10)
#Expected value
y_hat = pm.Deterministic('y_hat',FindFromGrid(x,w,z))
#Data likelihood
ysigmas = np.ones(len(idata))*9.0
y_like = pm.Normal('y_like',mu= y_hat, sd=ysigmas, observed=ydata)
# Inference...
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(1000, step, start=start, progressbar=False) # draw 1000 posterior samples using NUTS sampling
print('The trace plot')
fig = pm.traceplot(trace, lines={'x': 16, 'w': 12, 'z':3.6})
fig.show()
When I run this code, I get error at the y_hat stage, because the int() function inside the FindFromGrid(x,w,z) function needs integer not FreeRV.
Finding y_hat from a pre calculated grid is important because my real model for y_hat does not have an analytical form to express.
I have earlier tried to use OpenBUGS, but I found out here it is not possible to do this in OpenBUGS. Is it possible in PyMC ?
Update
Based on an example in pyMC github page, I found I need to add the following decorator to my FindFromGrid(x,w,z) function.
#pm.theano.compile.ops.as_op(itypes=[t.dscalar, t.dscalar, t.dscalar],otypes=[t.dvector])
This seems to solve the above mentioned issue. But I cannot use NUTS sampler anymore since it needs gradient.
Metropolis seems to be not converging.
Which step method should I use in a scenario like this?
You found the correct solution with as_op.
Regarding the convergence: Are you using pm.Metropolis() instead of pm.NUTS() by any chance? One reason this could not converge is that Metropolis() by default samples in the joint space while often Gibbs within Metropolis is more effective (and this was the default in pymc2). Having said that, I just merged this: https://github.com/pymc-devs/pymc/pull/587 which changes the default behavior of the Metropolis and Slice sampler to be non-blocked by default (so within Gibbs). Other samplers like NUTS that are primarily designed to sample the joint space still default to blocked. You can always explicitly set this with the kwarg blocked=True.
Anyway, update pymc with the most recent master and see if convergence improves. If not, try the Slice sampler.