Implementing a machine learning-like optimizer - python

I am trying to predict the trend of an internet post.
I have available the number of comments and votes the post has after 2 minutes of being posted (can change, but it should be enough).
Currently I use this formula:
predicted_votes = (votes_per_minute + n_comments * 60 * h) * k
And then I find k experimentally. I get the post data, wait an hour, do
k = (older_k + actual_votes/predicted_votes) / 2
And so on. This kind of works. The accuracy is pretty low (40 - 50%), but it gives me a rough idea on how the post is going to react.
I was wondering if I could employ a more complex equation, something like:
predicted_votes = ((votes_per_minute * x + n_comments * y) * 60 * hour) * k # Hour stands for 'how many hours to predict'
And then optimize the parameters to approximate a bit better.
I would assume that I could use Machine Learning, although I don't have a GPU available (that's right, I'm running on integrated graphics, blame Mojave), so I am trying this approach instead.
So the question boils down to, how do I optimize those parameters (k,x,y) to get a better accuracy?
EDIT:
I tried following what #Alexis said, and this is where I am at right now:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
initial_votes_list = [1.41, 0.9, 0.94, 0.47, 0]
initial_comment_list = [0, 3, 0, 1, 64]
def func(x, k, t, s):
votes_per_minute = x[0]
n_comments = x[1]
return ((votes_per_minute * t + n_comments * s) * 60) * k
xdata = [1.41,0]
y = func(xdata, 2.5, 1.3, 0.5)
np.random.seed(1729)
ydata = y + 5
plt.plot(xdata, ydata, 'b-', label='data')
popt, pcov = curve_fit(func, xdata, ydata)
plt.plot(xdata, func(xdata, *popt), 'g--',
label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))
plt.xlabel('Time')
plt.ylabel('Score')
plt.legend()
plt.show()
I am not sure how to feed the data I have (votes_per_minute, n_comments), nor how I could tell the algorithm that y axis is actually time.
EDIT 2:
Tried doing what #Alexis told me, but I am unsure what to use as actual_score, a number doesn't work, a list neither.. Also, I want to predict the 'score' not the number of comments.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
initial_votes_list = [1.41, 0.9, 0.94, 0.47, 0]
initial_comment_list = [0, 3, 0, 1, 64]
final_score = [26,12,13,14,229]
def func(x,k,t,s):
return ((x[0]*k+x[1]*t)*60*x[2])*s
X = [[a,b,c] for a,b,c in zip(initial_votes_list,initial_comment_list,[i for i in range(len(initial_votes_list))])]
y = actual_votes # What is this?
popt, pcov = curve_fit(func, X, y)
plt.plot(xdata, func(xdata, *popt), 'g--',
label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))
plt.xlabel('Time')
plt.ylabel('Score')
plt.legend()
plt.show()

you don't need ML to do so (overkill i think here). Scipy provides a nice and easy way to fit a curve to the observations you have.
scipy.optimize.curve_fit allows you to fit a function with unknown parameters to your observation. As you already know the general form of the function, optimizing the hyper parameters is a well known stat problem and thus scipy should be enough.
We can take a small example to demonstrate this:
first we generate the datas
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from scipy.optimize import curve_fit
>>>
>>> def func(x, a, b, c):
... return a * np.exp(-b * x) + c
Define the data to be fit with some noise:
>>> xdata = np.linspace(0, 4, 50)
>>> y = func(xdata, 2.5, 1.3, 0.5)
>>> np.random.seed(1729)
>>> y_noise = 0.2 * np.random.normal(size=xdata.size)
>>> ydata = y + y_noise
>>> plt.plot(xdata, ydata, 'b-', label='data')
then we fit the function (ax+b=y) to the data using scipy:
popt, pcov = curve_fit(func, xdata, ydata)
you could add constraints to this, but for your problem it is not necessary.
By the way, this example is at the end of the link i provided. Everything you should know to use the curve fit is available on this page.
Edit
it seems you have a hard time figuring out how to use this. Let's go slowly and analytically to make sure we are ok every step of the way:
you want to predict the number of comment, this is your y. It is known. not calculated
you have in entry three parameters: the votes_per_minute , the n_comments and the hour h
and last but not least, you have three parameters to a function (x,y,k)
so X[i] (one sample) should look like this: [votes_per_minute,n_comments,h]
and with your formula y = ((votes_per_minute * k + n_comments * t) * 60 * h) * s, by replacing the names:
def func(x,k,t,s):
return ((x[0]*k+x[1]*t)*60*x[2])*s
X = np.array([[a,b,c] for a,b,c in zip(initial_votes_list,initial_comment_list,[i for i in range(len(initial_votes_list))])]).T
y = score
and then:
popt, pcov = curve_fit(func, X, y)
(if i understand your issue...if not, i don't see where the problem is)
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
initial_votes_list = [1.41, 0.9, 0.94, 0.47, 0]
initial_comment_list = [0, 3, 0, 1, 64]
final_score = [26,12,13,14,229]
def func(x,k,t,s):
return ((x[0]*k+x[1]*t)*60*x[2])*s
X = np.array([[a,b,c] for a,b,c in zip(initial_votes_list,initial_comment_list,[i for i in range(len(initial_votes_list))])]).T
y = [0.12,0.20,0.5,0.9,1]
popt, pcov = curve_fit(func, X, y)
print(popt)
>>>[-6.65969099e+00 -6.99241803e-02 -9.33412000e-04]

Related

How to do Non-Linar Curve fitting and find fitting parameter using Python with user defined function?

I have the following data-set:
x = 0, 5, 10, 15, 20, 25, 30
y = 0, 0.13157895, 0.31578947, 0.40789474, 0.46052632, 0.5, 0.53947368
Now, I want to plot this data and fit this data set with my defined function f(x) = (A*K*x/(1+K*x)) and find the parameters A and K ?
I wrote the following python script but it seems like it can't do the fitting I require:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
x = np.array([0, 5, 10, 15, 20, 25, 30])
y = np.array([0, 0.13157895, 0.31578947, 0.40789474, 0.46052632, 0.5, 0.53947368])
def func(x, A, K):
return (A*K*x / (1+K*x))
plt.plot(x, y, 'b-', label='data')
popt, pcov = curve_fit(func, x, y)
plt.plot(x, func(x, *popt), 'r-', label='fit')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
Still, it's not giving a best curve fit. Can anyone help me with the changes in the python script or a new script where I can properly fit the data with my desired fitting function ?
The classic problem: You didn't give any inital guess for A neither K. In this case the default value will be 1 for all parameters, which is not suitable for your dataset, and the fitting will converge to somewhere else. You can figure out the guesses different ways: by looking at the data, by the real meaning of parameters, etc.. You can guess values with the p0 parameter of scipy.optimize.curve_fit. It accepts list of values in the order they are in the func you want to optimize. I used 0.1 for both, and I got this curve:
popt, pcov = curve_fit(func, x, y, p0=[0.1, 0.1])
Try Minuit, which is a package implemented at Cern.
from iminuit import Minuit
import numpy as np
import matplotlib.pyplot as plt
def func(x, A, K):
return (A*K*x / (1+K*x))
def least_squares(a, b):
yvar = 0.01
return sum((y - func(x, a, b)) ** 2 / yvar)
x = np.array([0, 5, 10, 15, 20, 25, 30])
y = np.array([0, 0.13157895, 0.31578947, 0.40789474, 0.46052632, 0.5, 0.53947368])
m = Minuit(least_squares, a=5, b=5)
m.migrad() # finds minimum of least_squares function
m.hesse() # computes errors
plt.plot(x, y, "o")
plt.plot(x, func(x, *m.values.values()))
# print parameter values and uncertainty estimates
for p in m.parameters:
print("{} = {} +/- {}".format(p, m.values[p], m.errors[p]))
And the outcome:
a = 0.955697134431429 +/- 0.4957121286951612
b = 0.045175437602766676 +/- 0.04465599806912648

Non-linear curve-fitting program in python

I would like to find and plot a function f that represents a curve fitted on some number of set points that I already know, x and y.
After some research I started experimenting with scipy.optimize and curve_fit but on the reference guide I found that the program uses a function to fit the data instead and it assumes ydata = f(xdata, *params) + eps.
So my question is this: What do I have to change in my code to use the curve_fit or any other library to find the function of the curve using my set points? (note: I want to know the function as well so I can integrate later for my project and plot it). I know that its going to be a decaying exponencial function but don't know the exact parameters. This is what I tried in my program:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a * np.exp(-b * x) + c
xdata = np.array([0.2, 0.5, 0.8, 1])
ydata = np.array([6, 1, 0.5, 0.2])
plt.plot(xdata, ydata, 'b-', label='data')
popt, pcov = curve_fit(func, xdata, ydata)
plt.plot(xdata, func(xdata, *popt), 'r-', label='fit')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
Am currently developing this project on a Raspberry Pi, if it changes anything. And would like to use least squares method since is great and precise, but any other method that works well is welcome.
Again, this is based on the reference guide of scipy library. Also, I get the following graph, which is not even a curve: Graph and curve based on set points
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a * np.exp(-b * x) + c
#c is a constant so taking the derivative makes it go to zero
def deriv(x, a, b, c):
return -a * b * np.exp(-b * x)
#Integrating gives you another c coefficient (offset) let's call it c1 and set it equal to zero by default
def integ(x, a, b, c, c1 = 0):
return -a/b * np.exp(-b * x) + c*x + c1
#There are only 4 (x,y) points here
xdata = np.array([0.2, 0.5, 0.8, 1])
ydata = np.array([6, 1, 0.5, 0.2])
#curve_fit already uses "non-linear least squares to fit a function, f, to data"
popt, pcov = curve_fit(func, xdata, ydata)
a,b,c = popt #these are the optimal parameters for fitting your 4 data points
#Now get more x values to plot the curve along so it looks like a curve
step = 0.01
fit_xs = np.arange(min(xdata),max(xdata),step)
#Plot the results
plt.plot(xdata, ydata, 'bx', label='data')
plt.plot(fit_xs, func(fit_xs,a,b,c), 'r-', label='fit')
plt.plot(fit_xs, deriv(fit_xs,a,b,c), 'g-', label='deriv')
plt.plot(fit_xs, integ(fit_xs,a,b,c), 'm-', label='integ')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

Curve fitting with python error

I'm trying to fit my data to (cos(x))^n. The vale of n in theory is 2, but my data should give me around 1.7. When I define my fitting function and I try curve_fit, I get an error
def f(x,a,b,c):
return a+b*np.power(np.cos(x),c)
param, extras = curve_fit(f, x, y)
This is my data
x y error
90 3.3888756187 1.8408898986
60 2.7662844365 1.6632150903
45 2.137309503 1.4619540017
30 1.5256883339 1.2351875703
0 1.4665463518 1.2110104672
The error looks like this:
/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:4:
RuntimeWarning: invalid value encountered in power after removing
the cwd from sys.path.
/usr/lib/python3/dist-packages/scipy/optimize/minpack.py:690:
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
The problem is that cos(x) can become negative and then cos(x) ^ n can be undefined. Illustration:
np.cos(90)
-0.44807361612917013
and e.g.
np.cos(90) ** 1.7
nan
That causes the two error messages you receive.
It works fine, if you modify your model, e.g. to a + b * np.cos(c * x + d). Then the plot looks as follows:
The code can be found below with some inline comments:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def f(x, a, b, c, d):
return a + b * np.cos(c * x + d)
# your data
xdata = [90, 60, 45, 30, 0]
ydata = [3.3888756187, 2.7662844365, 2.137309503, 1.5256883339, 1.4665463518]
# plot data
plt.plot(xdata, ydata, 'bo', label='data')
# fit the data
popt, pcov = curve_fit(f, xdata, ydata, p0=[3., .5, 0.1, 10.])
# plot the result
xdata_new = np.linspace(0, 100, 200)
plt.plot(xdata_new, f(xdata_new, *popt), 'r-', label='fit')
plt.legend(loc='best')
plt.show()

SciPy curve_fit with np.log returns immediately with popt = p0, pcov = inf

I'm trying to optimize a logarithmic fit to a data set with scipy.optimize.curve_fit. Before trying it on an actual data set, I wrote code to run on a dummy data set.
def do_fitting():
x = np.linspace(0, 4, 100)
y = func(x, 1.1, .4, 5)
y2 = y + 0.2 * np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, y2, p0=np.array([2, 0.5, 1]))
plt.figure()
plt.plot(x, y, 'bo', label="Clean Data")
plt.plot(x, y2, 'ko', label="Fuzzed Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()
Of course, do_fitting() relies on func(), which it passes to curve_fit. Here's the problem. When I pass a func() that contains np.log, i.e. the function that I actually want to fit to, curve_fit declares that p0 (the initial condition) is the optimal solution and returns immediately with an infinite covariance.
Here's what happens if I run do_fitting() with a non-logarithmic func():
def func(x, a, b, c):
return a * np.exp(x*b) + c
popt = [ 0.90894173 0.44279212 5.19928151]
pcov = [[ 0.02044817 -0.00471525 -0.02601574]
[-0.00471525 0.00109879 0.00592502]
[-0.02601574 0.00592502 0.0339901 ]]
Here's what happens when I run do_fitting() with a logarithmic func():
def func(x, a, b, c):
return a * np.log(x*b) + c
popt = [ 2. 0.5 1. ]
pcov = inf
You'll notice that the logarithmic solution for popt is equal to the value I gave curve_fit for p0 in the above do_fitting(). This is true, and pcov is infinite, for every value of p0 I have tried.
What am I doing wrong here?
The problem is very simple - since the first value in your x array is 0, you are taking the log of 0, which is equal to -inf:
x = np.linspace(0, 4, 100)
p0 = np.array([2, 0.5, 1])
print(func(x, *p0).min())
# -inf
I was able to fit a logarithmic function just fine using the following code (hardly modified from your original):
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a * np.log(x+b) + c
def do_fitting():
x = np.linspace(0, 4, 100)
y = func(x, 1.1, .4, 5)
y2 = y + 0.2 * np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, y2, p0=np.array([2, 0.5, 1]))
plt.figure()
plt.plot(x, y, 'bo', label="Clean Data")
plt.plot(x, y2, 'ko', label="Fuzzed Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()
do_fitting()
(Unfortunately I can't post a picture of the final fit, but it agrees quite nicely with the clean data).
Likely your problem is not the logarithm itself, but some difficulty curve_fit is having with the specific function you're trying to fit. Can you edit your question to provide an example of the exact logarithmic function you're trying to fit?
EDIT: The function you provided is not well-defined for x=0, and produces a RuntimeWarning upon execution. curve_fit is not good at handling NaNs, and will not be able to fit the function in this case. If you change x to
x = np.linspace(1, 4, 100)
curve_fit performs just fine.

scipy.optimize.curve_fit setting a "fixed" parameter

I'm using scipy.optimize.curve_fit to approximate peaks in my data with Gaussian functions. This works well for strong peaks, but it is more difficult with weaker peaks. However, I think fixing a parameter (say, width of the Gaussian) would help with this. I know I can set initial "estimates" but is there a way that I can easily define a single parameter without changing the function I'm fitting to?
If you want to "fix" a parameter of your fit function, you can just define a new fit function which makes use of the original fit function, yet setting one argument to a fixed value:
custom_gaussian = lambda x, mu: gaussian(x, mu, 0.05)
Here's a complete example of fixing sigma of a Gaussian function to 0.05 (instead of optimal value 0.1). Of course, this doesn't really make sense here because the algorithm has no problem in finding optimal values. Yet, you can see how mu is still found despite the fixed sigma.
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize
def gaussian(x, mu, sigma):
return 1 / sigma / np.sqrt(2 * np.pi) * np.exp(-(x - mu)**2 / 2 / sigma**2)
# Create sample data
x = np.linspace(0, 2, 200)
y = gaussian(x, 1, 0.1) + np.random.rand(*x.shape) - 0.5
plt.plot(x, y, label="sample data")
# Fit with original fit function
popt, _ = scipy.optimize.curve_fit(gaussian, x, y)
plt.plot(x, gaussian(x, *popt), label="gaussian")
# Fit with custom fit function with fixed `sigma`
custom_gaussian = lambda x, mu: gaussian(x, mu, 0.05)
popt, _ = scipy.optimize.curve_fit(custom_gaussian, x, y)
plt.plot(x, custom_gaussian(x, *popt), label="custom_gaussian")
plt.legend()
plt.show()
Hopefully this is helpful. Had to use hax. Curve_fit is pretty strict about what it takes.
import numpy as np
from numpy import random
import scipy as sp
from scipy.optimize import curve_fit
import matplotlib.pyplot as pl
def exp1(t,a1,tau1):
#A1*exp(-t/t1)
val=0.
val=(a1*np.exp(-t/tau1))*np.heaviside(t,0)
return val
def wrapper(t,*args):
global hold
global p0
wrapperName='exp1(t,'
for i in range(0, len(hold)):
if hold[i]:
wrapperName+=str(p0[i])
else:
if i%2==0:
wrapperName+='args['+str(i)+']'
else:
wrapperName+='args'+str(i)+']'
if i<len(hold):
wrapperName+=','
wrapperName+=')'
return eval(wrapperName)
p0=np.array([1.5,500.])
hold=np.array([0,1])
p1=np.delete(p0,1)
timepoints = np.arange(0.,2000.,20.)
y=exp1(timepoints,1,1000)+np.random.normal(0, .1, size=len(timepoints))
popt, pcov = curve_fit(exp1, timepoints, y, p0=p0)
print 'unheld parameters:', popt, pcov
popt, pcov = curve_fit(wrapper, timepoints, y, p0=p1)
for i in range(0, len(hold)):
if hold[i]:
popt=np.insert(popt,i,p0[i])
yfit=exp1(timepoints,popt[0],popt[1])
pl.plot(timepoints,y,timepoints,yfit)
pl.show()
print 'hold parameters:', popt, pcov

Categories