Finding the highest R^2 value

Finding the highest R^2 value - python

I'm new in python and my problem is that I have a given set of data:
import numpy as np
x=np.arange(1,5)
y=np.arange(5,9)
My problem is to find a number n (not necessarily an integer) that will give me the highest value of R^2 value when I plot y^n vs x. I'm thinking of generating n for example as:
n=np.linspace(1,9,100)
I don't know how to execute my idea. My other means is to resort to brute force of generating n and raising y for each value of n. After getting that value (let's say y1), I will plot y1 vs x (which means I have to generate 100 plots. But I have no clue on how to get the R^2 value ( for a linear fit) for a given plot.
What I want to do is to have a list (or array) of R^2 values:
R2= np.array() #a set containing the R^2 values calculated from the plots
and find the max value on that array and from there, find the plot that gave that R^2 value thus I will find a particular n. I don't know how to do this.

If you are able to use the pandas library, this problem is very easy to express:
import pandas
import numpy as np
x = pandas.Series(np.arange(1,5))
y = pandas.Series(np.arange(5,9))
exponents = exponents = np.linspace(1, 9, 100)
r2s = {n:pandas.ols(x=x, y=y**n).r2 for n in exponents}
max(r2s.iteritems(), key=lambda x: x[1])
#>>> (1.0, 1.0)
Breaking this down:
the pandas.Series object is an indexed column of data. It's like a numpy array, but with extra features. In this case, we only care about it because it is something we can pass to pandas.ols.
pandas.ols is a basic implementation of least-squares regression. You can do this directly in numpy with numpy.linalg.lstsq, but it won't directly report the R-squared values for you. To do it with pure numpy, you'll need to get the sum of squared residuals from numpy's lstsq and then perform the formulaic calculations for R-squared manually. You could write this as a function for yourself (probably a good exercise).
The stuff inside the {..} is a dict comprehension. It will iterate over the desired exponents, perform the ols function for each, and report the .r2 attribute (where the R-squared statistic is stored) indexed by whatever exponent number was used to get it.
The final step is to call max on a sequence of the key-value pairs in r2s, and key tells max that it is the second element (the R-squared) by which elements are compared.
An example function to do it with just np.linalg.lstsq is here (good explanation for calculating R2 in numpy):
def r2(x, y):
x_with_intercept = np.vstack([x, np.ones(len(x))]).T
coeffs, resid = np.linalg.lstsq(x_with_intercept, y)[:2]
return 1 - resid / (y.size * y.var())[0]
Then in pure numpy the above approach:
import numpy as np
x = np.arange(1,5)
y = np.arange(5,9)
exponents = np.linspace(1, 9, 100)
r2s = {n:r2(x=x, y=y**n) for n in exponents}
max(r2s.iteritems(), key=lambda x: x[1])
#>>> (1.0, 1.0)
As a final note, there is a fancier way to specify getting the 1-positional item from something. You use the built-in library operator and the callable itemgetter:
max(..., key=operator.itemgetter(1))
The expression itemgetter(1) results in an object that is callable -- when it is called on an argument r it invokes the __getitem__ protocol to result in r[1].

Related

Is there a method for finding the root of a function represented as an narray?

I have a function represented as a narray i. e. y = f(x), where y and x are two narrays.
I am searching for a method that find the roots of f(x).
Reading the scipy documentation, I was able to find just methods that works on user defined functions, like scipy.optimize.root_scalar. I thought about using scipy.interpolate.interp1d to get an interpolated version of my function to be used in scipy.optimize.root_scalar, but I'm not sure it can work and it seems pretty complicated.
Is it there some other function that I can use instead?

You have to interpolate a function defined by numpy arrays as all the solvers require a function that can return a value for any input x, not just those in your array. But this is not complicated, here is an example
from scipy import optimize
from scipy import interpolate
# our xs and ys
xs = np.array([0,2,5])
ys = np.array([-3,-1,2])
# interpolated function
f = interpolate.interp1d(xs, ys)
sol = optimize.root_scalar(f, bracket = [xs[0],xs[-1]])
print(f'root is {sol.root}')
# check
f0 = f(sol.root)
print(f'value of function at the root: f({sol.root})={f0}')
output:
root is 3.0
value of function at the root: f(3.0)=0.0
You may also want to interpolate with higher-degree polynomials for higher accuracy of your root-finding, eg How to perform cubic spline interpolation in python?

scipy.misc.derivative for uneven space points

I want to calculate the second derivative for each point (except the first and last one) in a points set. This points set has a data type of dictionary, which is like points = {x1:y1, x2:y2, ... xn:yn} where all the x are positive integers but it is in uneven spacing, e.g., x1=1, x2=2, x3=3, x4=5, x5=7 the x numbers are not increased linearly and the gap could be random, i.e., x_{i+1} - x_{i} could be any positive integer.
For this dictionary of points, I want to get second derivative for each point, so I did the coding like:
import numpy as np
from scipy.misc import derivative
def wrapper(x):
return np.array([points[int(i)] for i in x])
y_d2 = derivative(wrapper, np.array(list(points.keys()))[1:-1], dx=1.0, n=2)
In this case, I will get KeyError: 4 at return np.array([points[int(i)] for i in x]). It is becasue the x=4 does not exist in the points dictionary so that it has a key error. How could I use scipy.misc.derivative for this situation? How to set the dx parameter (spacing) for scipy.misc.derivative?

Do you have to use scipy.misc.derivative? Because it is very easy to calculate second derivatives without
Let's say you have your data as a dictionary:
points = {1:2, 2:2, 4:4, 5:5}
Then all you do is first get them into x,y lists:
x,y = list(points.keys()), list(points.values())
and then calculate derivs using numpy diff
dy_dx = np.diff(y)/np.diff(x)
d2y_dx2 = np.diff(dy_dx)/np.diff(x[:-1])
output for d2y_dx2 is
array([1., 0.])
as expected.
Of course there are more sophisticated versions of this if you want to use higher-accuracy formulas for derivatives, eg you can create a spline from your x,y and calculate derivatives of the spline. But I would start with the basic scheme above unless there are compelling reasons for anything else

Program to fit a hyperbola to linear data using least squares (Levenberg-Marquardt algorithm) not working as expected

I have a 1D array data which I am trying to model as hyperbola using three parameters. I am trying to implement the Levenberg Marquardt algorithm using the leastsq function from scipy.optimize library. However, my program is getting stuck at an iteration where a number is getting divided by a zero, and I don't understand why.
Some background: The 1D array data are basically lacunarity values for different box sizes. I've generated the lacunarity data from some sound files, the context to which can be found here.
In the algorithm, the least squares function takes three inputs:
(a) initial guess for the three parameters
(b) the x coordinate for the least squares problem - that's basically a 1D array of integers from 1 to 100 in my problem
(c) the y coordinate for the least squares problem - this is the 1D array that stores the lacunarity values. So Lacunarity values are a function of x, where x varies from 1 to 100.
The hyperbola is modeled using three parameters a,b and c as
The code gives the following error:
"OverflowError: cannot convert float infinity to integer"
The code:
#import
from scipy import *
from scipy.optimize import leastsq
import matplotlib.pylab as plt
import numpy as np
import codecs, json
from math import *
# Define your function to calculate the residuals.
#The fitting function holds your parameter values.
def residuals(p, y, x):
err = y-pval(x,p)
return err
def pval(x, p):
z = x
for i in range(100):
print(x)
print(x[i]**p[1])
z[i] = p[0]/(x[i]**p[1])+p[2]
return z
#read in your data
obj_text = codecs.open('textfiles\CC1.json', 'r', encoding='utf-8').read()
b_new = json.loads(obj_text)
data = np.array(b_new)
x = np.arange(1,101)
y = data[1:101]
#guess at initial parameters
A1_0=1.0
A2_0=1.0
A3_0=0.5
#leastsq package calls the Levenberg-Marquardt algorithm
pname = (['A1','A2','A3'])
p0 = array([A1_0 , A2_0, A3_0])
plsq = leastsq(residuals, p0, args=(y, x), maxfev=2000)
# Now, plot your data
plt.plot(x,y,'xo',x,pval(x,plsq[0]),'x')
title('Least-squares fit to data')
xlabel('x')
ylabel('y')
legend(['Data', 'Fit'],loc=4)
# Your best-fit paramters are kept within plsq[0].
print(plsq[0])
According to the error, the value of x changes to 0 at some point in the iteration, and the first parameter a ends up getting divided by zero which gives the error.
To troubleshoot, I printed the values x[i]^b and the array x while executing the code, and you can see the values here. I see that the array x is getting modified which shouldn't happen. x should remain a 1D array of natural numbers from 1 to 100 and not get modified in the iteration. I couldn't identify where exactly is the code modifying the array x.
I expect the array x to remain unchanged and the code to print the final three values of the parameters a,b and c.
EDIT: I made some changes to my code after which it worked successfully. Following are those edits incase anyone would be interested:
Did not define z as z = x, but rather just defined it as z = np.arange(1,101). The result was that the array x did not change anymore which is what was expected.
Changed the datatype of arrays x and y to float using
x = np.array(x, dtype=np.float64)
I got stuck once more, at the piece of code which plots the data. I got the errors" 'title' not defined. Similar errors for xlabel, ylabel. So I just removed those lines and just stuck with
plt.plot(x,y,'red',x,pval(x,plsq[0]),'blue')
plt.show()

Not a direct answer to your question, but since you're using exponentiation (**), I strongly recommend that you convert all your numbers to Decimal beforehand, in order to avoid the precision-loss inherent in floating-point arithmetic on large values.
For example:
import decimal
decimal.getcontext().prec = 100
A1_0=Decimal("1.0")
A2_0=Decimal("1.0")
A3_0=Decimal("0.5")
x = [Decimal(f) for f in x]
y = [Decimal(f) for f in y]
Perhaps your zero will "turn up" to be a small value close to zero...

Minimize n functions

I have two functions whose distance (y_1-y_2) I need to minimize in order to obtain the best factor between the two (dfactor), so I can plot them together and fit them as best as posible. The difference with the examples from documentation is that, in this case, I have n points where I can compute the diference and thus I have nfunctions to minimize. With scipy.optimize.minimize_scalar I use the following syntax:
def chi(dfactor):
for i in range(0, n):
return abs(dfactor*y_1[i] - y_2[i])
res = minimize_scalar(chi)
print res.x
Now res.x is not the factor that best fits the two plots. I would expect to get an array of n elements, very similar between them where I can obtain a single dfactorthat I need. But I am not sure minimize_scalar works like this.
Check the desired result, where I compute the difference between the red dots and the corresponding point in the blue plot (represented here as lines as it is a spectrum) to overplot them as nicely as possible.

You cannot have several return statements in a single function, only the first one will get called. Instead, you need to return some aggregate of the errors, such as the root means quared error:
# Convert to numpy arrays
y_1 = np.asarray(y_1)
y_2 = np.asarray(y_2)
def chi(dfactor):
residual = dfactor * y_1 - y_2
return np.mean(residual ** 2)
res = minimize_scalar(chi)
print res.x

Sample a truncated integer power law in Python?

What function can I use in Python if I want to sample a truncated integer power law?
That is, given two parameters a and m, generate a random integer x in the range [1,m) that follows a distribution proportional to 1/x^a.
I've been searching around numpy.random, but I haven't found this distribution.

AFAIK, neither NumPy nor Scipy defines this distribution for you. However, using SciPy it is easy to define your own discrete distribution function using scipy.rv_discrete:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
def truncated_power_law(a, m):
x = np.arange(1, m+1, dtype='float')
pmf = 1/x**a
pmf /= pmf.sum()
return stats.rv_discrete(values=(range(1, m+1), pmf))
a, m = 2, 10
d = truncated_power_law(a=a, m=m)
N = 10**4
sample = d.rvs(size=N)
plt.hist(sample, bins=np.arange(m)+0.5)
plt.show()

I don't use Python, so rather than risk syntax errors I'll try to describe the solution algorithmically. This is a brute-force discrete inversion. It should translate quite easily into Python. I'm assuming 0-based indexing for the array.
Setup:
Generate an array cdf of size m with cdf[0] = 1 as the first entry, cdf[i] = cdf[i-1] + 1/(i+1)**a for the remaining entries.
Scale all entries by dividing cdf[m-1] into each -- now they actually are CDF values.
Usage:
Generate your random values by generating a Uniform(0,1) and
searching through cdf[] until you find an entry greater than your
uniform. Return the index + 1 as your x-value.
Repeat for as many x-values as you want.
For instance, with a,m = 2,10, I calculate the probabilities directly as:
[0.6452579827864142, 0.16131449569660355, 0.07169533142071269, 0.04032862392415089, 0.02581031931145657, 0.017923832855178172, 0.013168530260947229, 0.010082155981037722, 0.007966147935634743, 0.006452579827864143]
and the CDF is:
[0.6452579827864142, 0.8065724784830177, 0.8782678099037304, 0.9185964338278814, 0.944406753139338, 0.9623305859945162, 0.9754991162554634, 0.985581272236501, 0.9935474201721358, 1.0]
When generating, if I got a Uniform outcome of 0.90 I would return x=4 because 0.918... is the first CDF entry larger than my uniform.
If you're worried about speed you could build an alias table, but with a geometric decay the probability of early termination of a linear search through the array is quite high. With the given example, for instance, you'll terminate on the first peek almost 2/3 of the time.

Use numpy.random.zipf and just reject any samples greater than or equal to m

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.