How to find a model for dataset

How to find a model for dataset - python

I have a text file that contains dates and numerical values like
1.1.2020, 45.67
2.1.2020, 49.65
4.1.2020, 47.58
31.1.2020, 55.88
...
Note that value of some dates is missing.
I would like to fit a model of the form ae^(bx) to find an estimate what would be value in 1.1.2021. How can I do that? Is there some Sagemath function for that or some Python library to find such a model.

For your specified formula, this can be solved by fitting a log-log model.
Y = a * exp(bx)
log(Y) = log(a) + bx
Transform your dates into numeric type (for ex. as.numeric())
Fit an ordinary linear model
Back transform the results onto the original scale
Y = exp(intercept + slope*date)
In R, using some made up data
dates=sort(sample(1:100,20))
values=exp(seq(0,5,length.out=20))+rnorm(20)
mod=lm(log(values)~dates)
new=1:100
plot(values~dates)
points(exp(coef(mod)[1]+coef(mod)[2]*new)~new,col="red")

Related

using scipy curve_fit with dask/xarray

I'm trying to use scipy.optimize.curve_fit on a large latitude/longitude/time xarray using dask.distributed as computing backend.
The idea is to run an individual data fitting for every (latitude, longitude) using the time series.
All of this runs fine outside xarray/dask. I tested it using the time series of a single location passed as a pandas dataframe. However, if I try to run the same process on the same (latitude, longitude) directly on the xarray, the curve_fit operation returns the initial parameters.
I am performing this operation using xr.apply_ufunc like so (here I'm providing only the code that is strictly relevant to the problem):
# function to perform the fit
def _fit_rti_curve(data, data_rti, fit, loc=False):
fit_func, linearize, find_init_params = _get_fit_functions(fit)
# remove nans
x, y = _filter_nodata(data_rti, data)
# remove outliers
x, y = _filter_for_outliers(x, y, linearize=linearize)
# find a first guess for maximum achieveable value
yscale = np.max(y) * 1.05
# find a first guess for the other parameters
# here loc can be manually passed if you have a good estimation
init_parms = find_init_params(x, y, yscale, loc=loc, linearize=linearize)
# fit the curve and return parameters
parms = curve_fit(fit_func, x, y, p0=init_parms, maxfev=10000)
parms = parms[0]
return parms
# shell around _fit_rti_curve
def find_rti_func_parms(data, rti, fit):
# sort and fit highest n values
top_data = np.sort(data)
top_data = top_data[-len(rti):]
# convert to float64 if needed
top_data = top_data.astype(np.float64)
rti = rti.astype(np.float64)
# run the fit
parms = _fit_rti_curve(top_data, rti, fit, loc=0) #TODO maybe add function to allow a free loc
return parms
# call for the apply_ufunc
# `fit` is a string that defines the distribution type
# `rti` is an array for the x values
parms_data = xr.apply_ufunc(
find_rti_func_parms,
xr_obj,
input_core_dims=[['time']],
output_core_dims=[[fit + ' parameters']],
output_sizes = {fit + ' parameters': len(signature(fit_func).parameters) - 1},
vectorize=True,
kwargs={'rti':return_time_interval, 'fit':fit},
dask='parallelized',
output_dtypes=['float64']
)
My guess would be that is a problem related to threading, or at least some shared memory space that is not properly passed between workers and scheduler.
However, I am just not knowledgeable enough to test this within dask.
Any idea on this problem?

You should have a look at this issue https://github.com/pydata/xarray/issues/4300
I had the same problem and I solved using apply_ufunc. It is not optimized, since it has to perform rechunking operations, but it works!
I've created a GitHub Gist for it https://gist.github.com/clausmichele/8350e1f7f15e6828f29579914276de71

This previous answer might be helpful? It's using numpy.polyfit but I think the general approach should be similar.
Applying numpy.polyfit to xarray Dataset
Also, I haven't tried it but xr.polyfit() just got merged recently! Could also be something to look into. http://xarray.pydata.org/en/stable/generated/xarray.DataArray.polyfit.html#xarray.DataArray.polyfit

Getting transformed X values from OLS model using statsmodels

I am trying to do a linear regression. With the results I want to multiply each x with its own estimated coefficient: xi·βi.
However, I am doing a lot of transformations on xi.
For example:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
def log_plus_1(x):
return np.log(x + 1.0)
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
formule = 'Lottery ~ pow(Literacy,2) + log_plus_1(Wealth)'
mod = smf.ols(formula=formule, data=df)
res = mod.fit()
res.params
Now I would need pow(Literacy, 2) and log_plus_1(Wealth). But since they go into the model, I was hoping to get them out of there too. Instead of transforming the data from the original dataset.
In R I would use res$model to get it.

The data is stored as attributes of the model, e.g. the design matrix is mod.exog, the dependent or response variable is mod.endog.
(I'm not sure I remember correctly the details of the following: The data that patsy returns after creating the transformed design matrix should, in this case, be a pandas DataFrame, and should be stored in mod.data.orig_exog or something like that.)
res.predict automatically handles the transformation, i.e. patsy uses the formula information to transform the data for the explanatory variables in prediction in the same way as the data was transformed in creating the model.
predict only returns the prediction and not the internally transformed predict exog.

How to fix .predict() function in statsmodels?

I'm trying to predict temperature at 12 UTC tomorrow in 1 location. To forecast, I use a basic linear regression model with the statmodels module. My code is hereafter:
x = ds_main
X = sm.add_constant(x)
y = ds_target_t
model = sm.OLS(y,X,missing='drop')
results = model.fit()
The summary shows that the fit is "good":
But the problem appears when I try to predict values with a new dataset that I consider to be my testset. The latter has the same columns number and the same variables names, but the .predict() function returns an array of NaN, although my testset has values ...
xnew = ts_main
Xnew = sm.add_constant(xnew)
ynewpred = results.predict(Xnew)
I really don't understand where the problem is ...
UPDATE : I think I have an explanation: my Xnew dataframe contains NaN values. Statmodels function .fit() allows to drop missing values (NaN) but not .predict() function. Thus, it returns a NaN values array ...
But this is the "why", but I still don't get the "how" reason to fix it...

statsmodels.api.OLS be default will not accept the data with NA values. So if you use this, then you need to drop your NA values first.
However, if you use statsmodels.formula.api.ols, then it will automatically drop the NA values to run regression and make predictions for you.
so you can try this:
import statsmodels.formula.api as smf
lm = smf.ols(formula = "y~X", pd.concat([y, X], axis = 1)).fit()
lm.predict(Xnew)

OLS predict using only a subset of explanatory variables

Say I do an OLS regression using statsmodels of variable y on some explanatory variables x1 x2 x3 (contained in a dataframe df):
res = smf.ols('y ~ x1 + x2 + x3', data=df).fit()
Is it possible to get a predicted value using only a subset of the explanatory variables? For example, I would like to get a predicted value for the observations in df using only x1 and x2 but not x3.
I have tried to do
res.predict(df[['x1','x2']])
but I get the error message: NameError: name 'x3' is not defined.
Edit: The reason I want to do this is the following. I'm running a regression of house values on house characteristics and dummies for metropolitan area, suburban status, and year. I would like to use the dummies for metropolitan area, suburban status and year to construct a price index for each location and time period.
Edit 2: This is how I ended up doing it, in case it can be helpful to anyone or someone can point me to a better way of doing it.
I'm interested in doing an OLS on the following specification:
model = 'price ~ C(MetroArea) + C(City) + C(Year) + x1 + ... + xK'
where 'x1 + ... + xK' is pseudo-code for a bunch of variables I'm using as controls but I'm not interested in, and the categorical variables are very large (e.g. 90 Metropolitan areas).
Next I fit the model with statsmodels and construct the design matrix that I'll use to predict prices using the variables of interest.
res = smf.ols(model, data=mydata).fit()
data_prediction = mydata[['MetroArea','City','Year']]
model_predict = 'C(MetroArea) + C(City) + C(Year)'
X = patsy.dmatrix(model_predict, data=data_prediction, return_type='dataframe')
The tricky part now is to select the right parameters for the variables of interest, since there are many and their names are not exactly those of their respective variables since I've used the categorical operator, C(), of patsy (e.g. variables for MetroArea look like: C(MetroArea)[0], C(MetroArea)[8], ...).
vars_interest = ['Intercept', 'MetroArea', 'City', 'Year']
params_interest = res.params[[any([word in var for word in vars_interest])
for var in res.params.index]]
Get prediction by doing the dot product of the selected parameters and variables of interest:
prediction = np.dot(X,params_interest)

In case anyone stumbles on this old question, there seems to be a cleaner solution using the information contained in the design matrix.
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
mydata = None
vars_of_interest = ['C(MetroArea)', 'C(City)', 'C(Year)']
formula = 'price ~' + " + ".join(vars_of_interest) + ' + x1 + ... + xK'
Y, X = dmatrices(formula, mydata)
# Get the slice names from patsy
slices = X.design_info.term_name_slices
model = sm.OLS(Y, X)
res = model.fit()
prediction = np.zeros(X.shape[0])
for var in vars_of_interest:
prediction += X[:, slices[var]].dot(res.params[slices[var]])

What are you trying to do conceptually? When you predict using your regression you're just plugging values into an equation. So predicting "without x3" is the same as just plugging in x3=0.
In terms of implementing this, it looks like statsmodels is pretty draconian about prediction using the same variable names as you used during a fit. So this is not elegant, but works:
df2 = df.copy()
df2['x3'] = 0
res.predict(df2[['x1','x2','x3']])

How to fit multidimensional output using scikit-learn?

I am trying to fit OneVsAll Classification output in training data , rows of output adds upto 1 .
One possible way is to read all the rows and find which column has highest value and prepare data for training .
Eg : y = [[0.2,0.8,0],[0,1,0],[0,0.3,0.7]] can be reduced to y = [b,b,c] , considering a,b,c as corresponding class of the columns 0,1,2 respectively.
Is there a function in scikit-learn which helps to achieve such transformations?

This code does what you want:
import numpy as np
import string
y = np.array([[0.2,0.8,0],[0,1,0],[0,0.3,0.7]])
def transform(y,labels):
f = np.vectorize(lambda i : string.letters[i])
y = f(y.argmax(axis=1))
return y
y = transform(y,'abc')
EDIT: Using the comment by alko, I made it more general be letting the user supply the labels to the transform function.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find a model for dataset - python

Related

using scipy curve_fit with dask/xarray

Getting transformed X values from OLS model using statsmodels

How to fix .predict() function in statsmodels?

OLS predict using only a subset of explanatory variables

How to fit multidimensional output using scikit-learn?

Categories

Resources