I am trying to fit a model to some data. The independent variables are called A and B, and they are columns in a Pandas DataFrame. I am trying to fit with two parameters against y in the data frame.
Previously, with curve_fit from Scipy, I could do:
def fun(X, p1, p2):
A, B = X
return np.exp(p1*A) + p2*B
X = (df['A'].tolist(), df['B'].tolist())
popt, pcov = curve_fit(fun, X, df['y'].tolist())
But now, I'm using lmfit, where I cannot simply "pack" the independent variables like with curve_fit:
def fun(A, B, p1 = 1, p2 = 1):
return np.exp(p1*A) + p2*B
model = Model(fun, independent_vars=['A', 'B'])
How do I run model.fit() here? The FAQ is not really helpful—what do I have to flatten in the first place?
I created a complete, working example with two independent variables:
import pandas as pd
import numpy as np
from lmfit import Model
df = pd.DataFrame({
'A' : pd.Series([1, 1, 1, 2, 2, 2, 2]),
'B' : pd.Series([5, 4, 6, 6, 5, 6, 5]),
'target' : pd.Series([87.79, 40.89, 215.30, 238.65, 111.15, 238.65, 111.15])
})
def fun(A, B, p1 = 1, p2 = 1):
return p1 * np.exp(A) + p2 * np.exp(B)
model = Model(fun, independent_vars=['A', 'B'])
fit = model.fit(df['target'], A = df['A'], B = df['B'])
The trick is to specify all variables as keyword arguments in fit().
Firstly, creat a model with this function of multiple independent variables.
for example,
def random_func(x,y,a,b,c):
return a*x**3+b*y**2+c
Secondly, specify which ones are the independent variables in the formula.
for example,
from lmfit import Model
model = Model(random_func,independent_vars=['x','y'])
Thirdly, set params for the model
for example,
model.set_param_hint('a',value=2)
model.set_param_hint('b',value=3)
model.set_param_hint('c',value=4)
finally, set your x-axis values, as well as y-axis. And do the fit
Like this,
x = np.arange(0,2,0.1)
y = np.arange(0,2,0.1)
z = np.loadtxt('filename')
A direct fit actually does not work well. The 2D data array has to be flattened into 1D array, as well as the coordinates. For example, let's leave the model as it is. We need to create new 1D coordinates arrays.
x1d = []
y1d = []
for i in x:
for j in y:
x1d = x1d.append(i)
y1d = y1d.append(j)
z1d = z.flatten_data()
result = model.fit(z1d, x = x1d, y = y1d)
Related
What is a better way to do the following codeblock? I want to create a 1d array for each scene containing features a-e to eventually have the shape: m x n if m is the number of scenes and n is the combined length of all the features.
The shape of features a-d is unknown and can be different from each other. For example feature a could have shape 100 x 3 x 3 x 5 while feature b could have shape 30 x 4. Feature e is simply a boolean.
inputs = []
for scene in scenes:
inp = np.concatenate((
scene['a'].flatten(),
scene['b'].flatten(),
scene['c'].flatten(),
scene['d'].flatten(),
[scene['e'] == True]))
inputs.append(inp)
inputs = torch.FloatTensor(inputs)
Let say we know ['a', 'b', 'c', 'd', 'e'] are the only attributes in each scene ( so they're accessible by scene.keys(). Then, the following code works:
output = np.vstack(
np.hstack(map(lambda x: np.array(x).flatten(), s.values())) for s in scenes
)
inputs = torch.FloatTensor(output)
In order to test that, I created a scene generator function that creates dictionaries similar to what you mentioned and synthetically create 10 scenes:
import numpy as np
def scense_generator():
scene = dict()
scene['a'] = np.random.random((100, 3, 3, 5))
scene['b'] = np.random.random((30, 4))
scene['c'] = np.random.random((15, 15, 2))
scene['d'] = np.random.random((2, 2, 2))
scene['e'] = True
return scene
scenes = [scense_generator() for _ in range(10)]
output = np.vstack(
np.hstack(map(lambda x: np.array(x).flatten(), s.values())) for s in scenes
)
print(output.shape)
# (10, 5079)
I have 50 variables in my dataframe. 46 are dependant variables and 4 are independandt variables (precipitation, temperature, dew, snow). I want to calculate the mutual information of my dependant variables agaisnt my independant.
So in the end i want a dataframe like this
Right now i am calculating it using the following but it's taking so long because i have to change my y each time
X = df[['Temperature', 'Precipitation','Dew','Snow']] # Features
y = df[['N0037']] #target
from sklearn.feature_selection import mutual_info_regression
mi = mutual_info_regression(X, y)
mi /= np.max(mi)
mi = pd.Series(mi)
mi.index = X.columns
mi.sort_values(ascending=False)
mi
Using list comprehension:
indep_vars = ['Temperature', 'Precipitation', 'Dew', 'Snow'] # set independent vars
dep_vars = df.columns.difference(indep_vars).tolist() # set dependent vars
from sklearn.feature_selection import mutual_info_regression as mi_reg
df_mi = pd.DataFrame([mi_reg(df[indep_vars], df[dep_var]) for dep_var in dep_vars], index = dep_vars, columns = indep_vars).apply(lambda x: x / x.max(), axis = 1)
Another way is to pass a custom method to pandas.DataFrame.corr() function
from sklearn.feature_selection import mutual_info_regression
def custom_mi_reg(a, b):
a = a.reshape(-1, 1)
b = b.reshape(-1, 1)
return mutual_info_regression(a, b)[0] # should return a float value
df_mi = df.corr(method=custom_mi_reg)
i want to make a linear equation with some dynamic inputs like it can be
y = θ0*x0 + θ1*x1
or
y = θ0*x0 + θ1*x1 + θ2*x2 + θ3*x3 + θ4*x4
for that i have
dictionary for x0,x1,x2......xn
and array for θ0,θ1,θ2......θn
im new to python so i tried this function but im stuck
so my question is how can i write a fucntion that gets x_values and theta_values as parameters and gives y_values as output
X = pd.DataFrame({'x0': np.ones(6), 'x1': np.linspace(0, 5, 6)})
θ = np.matrix('0 1')
def line_func(features, parameters):
result = []
for feat, param in zip(features.iteritems(), parameters):
for i in feat:
result.append(i*param)
return result
line_func(X,θ)
If you want to multiply your thetas with a list of features, then you technically mulitply a matrix (the features) with a vector (theta).
You can do this as follows:
import numpy as np
x_array= x.values
theta= np.array([theta_0, theta_1])
x_array.dot(theta)
Just order your theta-vector the way your columns are ordered in x. But note, that this gives a row-wise sum of the products for theta_i*x_i for all is. If you don't want it to be summed up rowise, you just need to write x_array * theta.
If you want to work with pandas (which I wouldn't recommend) also for the mulitplication and want to get a dataframe with the products of the column value and the corresponding theta, you could do this as follows:
# define the theta-x mapping (theta-value per column name in x)
thetas={'x1': 1, 'x2': 3}
# create an empty result dataframe with the index of x
df_result= pd.DataFrame(index=x.index)
# assign the calculated columns in a loop
for col_name, col_series in x.iteritems():
df_result[col_name]= col_series*thetas[col_name]
df_result
This results in:
x1 x2
0 1 6
1 -1 3
I'll give a minimal example where I would create numpy arrays inside row elements of a pandas.DataFrame.
TL;DR: see the screenshot of the DataFrame
This code finds the minimum of a certain function, by using scipy.optimize.brute, which returns the minimum, variable at which the minimum is found and two numpy arrays at which it evaluated the function.
import numpy
import scipy.optimize
import itertools
sin = lambda r, phi, x: r * np.sin(phi * x)
def func(r, x):
x0, fval, grid, Jout = scipy.optimize.brute(
sin, ranges=[(-np.pi, np.pi)], args=(r, x), Ns=10, full_output=True)
return dict(phi_at_min=x0[0], result_min=fval, phis=grid, result_at_grid=Jout)
rs = numpy.linspace(-1, 1, 10)
xs = numpy.linspace(0, 1, 10)
vals = list(itertools.product(rs, xs))
result = [func(r, x) for r, x in vals]
# idk whether this is the best way of generating the DataFrame, but it works
df = pd.DataFrame(vals, columns=['r', 'x'])
df = pd.concat((pd.DataFrame(result), df), axis=1)
df.head()
I expect that this is not how I am supposed to do this and should maybe expand the lists somehow. How do I handle this in a correct, beautiful, and clean way?
So, even though "beautiful and clean" is subject to interpretation, I'll give you mine, which should give you in turn some ideas. I'm leveraging a multiindex so that you can later easily select pairs of phi/result_at_grid for each point in the evaluation grid. I'm also using applyinstead of creating two dataframes.
import numpy
import scipy.optimize
import itertools
sin = lambda r, phi, x: r * np.sin(phi * x)
def func(row):
"""
Accepts a row of a dataframe (a pd.Series).
df.apply(func, axis=1)
returns a pd.Series with the initial (r,x) and the results
"""
r = row['r']
x = row['x']
x0, fval, grid, Jout = scipy.optimize.brute(
sin, ranges=[(-np.pi, np.pi)], args=(r, x), Ns=10, full_output=True)
# Create a multi index series for the phis
phis = pd.Series(grid)
phis.index = pd.MultiIndex.from_product([['Phis'], phis.index])
# same for result at grid
result_at_grid = pd.Series(Jout)
result_at_grid.index = pd.MultiIndex.from_product([['result_at_grid'], result_at_grid.index])
# concat
s = pd.concat([phis, result_at_grid])
# Add these two float results
s['phi_at_min'] = x0[0]
s['result_min'] = fval
# add the initial r,x to reconstruct the index later
s['r'] = r
s['x'] = x
return s
rs = numpy.linspace(-1, 1, 10)
xs = numpy.linspace(0, 1, 10)
vals = list(itertools.product(rs, xs))
df = pd.DataFrame(vals, columns=['r', 'x'])
# Apply func to each row (axis=1)
results = df.apply(func, axis=1)
results.set_index(['r','x'], inplace=True)
results.head().T # Transposing so we can see the output in one go...
Now you can select all values at the evaluation grid point 2 for example
print(results.swaplevel(0,1, axis=1)[2].head()) # Showing only 5 first
Phis result_at_grid
r x
-1.0 0.000000 -1.745329 0.000000
0.111111 -1.745329 0.193527
0.222222 -1.745329 0.384667
0.333333 -1.745329 0.571062
0.444444 -1.745329 0.750415
The griding the data (d) in irregular grid (x and y) using Scipy's griddata is timecomsuing when the datasets are many. But, the longitudes and latitudes (x and y) are always same, only the data (d) are changing. In this case, once using the giddata, how to repeat the procedure with different d arrys to achieve faster result?
import numpy as np, matplotlib.pyplot as plt
from scipy.interpolate import griddata
x = np.array([110, 112, 114, 115, 119, 120, 122, 124]).astype(float)
y = np.array([60, 61, 63, 67, 68, 70, 75, 81]).astype(float)
d = np.array([4, 6, 5, 3, 2, 1, 7, 9]).astype(float)
ulx, lrx = np.min(x), np.max(x)
uly, lry = np.max(y), np.min(y)
xi = np.linspace(ulx, lrx, 15)
yi = np.linspace(uly, lry, 15)
grided_data = griddata((x, y), d, (xi.reshape(1,-1), yi.reshape(-1,1)), method='nearest',fill_value=0)
plt.imshow(grided_data)
plt.show()
The above code works for one array of d.
But I have hundreds of other arrays.
griddata with nearest ends up using NearestNDInterpolator. That's a class that creates an iterator, which is called with the xi:
elif method == 'nearest':
ip = NearestNDInterpolator(points, values, rescale=rescale)
return ip(xi)
So you could create your own NearestNDInterpolator and call it with multiple times with different xi.
But I think in your case you want to change the values. Looking at the code for that class I see
self.tree = cKDTree(self.points)
self.values = y
the __call__ does:
dist, i = self.tree.query(xi)
return self.values[i]
I don't know the relative cost of creating the tree versus query.
So it should be easy to change values between uses of __call__. And it looks like values could have multiple columns, since it's just indexing on the 1st dimension.
This interpolator is simple enough that you could write your own using the same tree idea.
Here's a Nearest Interpolator that lets you repeat the interpolation for the same points, but different z values. I haven't done timings yet to see how much time it saves
class MyNearest(interpolate.NearestNDInterpolator):
# normal interpolation, but returns the near neighbor indices as well
def __call__(self, *args):
xi = interpolate.interpnd._ndim_coords_from_arrays(args, ndim=self.points.shape[1])
xi = self._check_call_shape(xi)
xi = self._scale_x(xi)
dist, i = self.tree.query(xi)
return i, self.values[i]
def my_griddata(points, values, method='linear', fill_value=np.nan,
rescale=False):
points = interpolate.interpnd._ndim_coords_from_arrays(points)
if points.ndim < 2:
ndim = points.ndim
else:
ndim = points.shape[-1]
assert(ndim==2)
# simplified call for 2d 'nearest'
ip = MyNearest(points, values, rescale=rescale)
return ip # ip(xi) # return iterator, not values
ip = my_griddata((xreg, yreg), z, method='nearest',fill_value=0)
print(ip)
xi = (xi.reshape(1,-1), yi.reshape(-1,1))
I, data = ip(xi)
print(data.shape)
print(I.shape)
print(np.allclose(z[I],data))
z1 = xreg+yreg # new z data
data = z1[I] # should show diagonal color bars
So as long as z has the same shape as before (and as xreg), z[I] will return the nearest value for each xi.
And it can interpolated 2d data as well (e.g. (225,n) shaped)
z1 = np.array([xreg+yreg, xreg-yreg]).T
print(z1.shape) # (225,2)
data = z1[I]
print(data.shape) # (20,20,2)