Related
Good day!
I am using python v3.8 and I want to calculate the value range (minimum and maximum) of a given formula with (also given) bounds e.g.:
formula1: a * sqrt(b/c)
formula2: a^2 * b/1000 + 3 * (a+b)
formula3: (1/(2 * PI * (a * 1000) * (b * 1000)) * 10^12
..with a=[0,5], b=[10,20], c=[30,40]
I am not too familiar with scipy, numpy, sympy.. and I wonder if there is an "easy" way to just calculate the formula with different values, write it into an array and get the min/max from that? The problem I have with "writing into an array and get min/max" is, that there are some bounds [-100000, 100000] with given float numbers and that would generate way too much values.
I do not need the information with which values the minimum/maximum was reached, but only which min/max values can be reached.
Try SymPy solvers, they have solveset that can do the following:
>>> solveset(Eq(x**2, 1), x)
{-1, 1}
Symbolic:
Maximum and minimum points are called critical points. To find a critical point you need to a partial derivative respect every variable (f'_a = df/da, f'_b = df/db, f'_c = df/dc), and solve the system of equations where all of them are equal to 0. We can do this with sympy.
import sympy as sp
a = sp.Symbol('a', real=True)
b = sp.Symbol('b', real=True)
c = sp.Symbol('c', real=True)
functions = [
a * sp.sqrt(b / c),
a**2 * b/1000 + 3*(a+b),
10**6 / (2 * sp.pi * a * b),
]
for f in functions:
f_a, f_b, f_c = f.diff(a), f.diff(b), f.diff(c)
print(f"f(a, b, c) = {f}")
print(f"f'_a(a, b, c) = {f_a} = 0")
print(f"f'_b(a, b, c) = {f_b} = 0")
print(f"f'_c(a, b, c) = {f_c} = 0")
print("Critical points:", sp.solve([f_a, f_b, f_c], a, b, c))
print()
As you can see if you execute this code, there is no critical point, so no absolute maximum nor minimum value for any of these funtions in the Real domain (there are two critical points for the second equation in the Imaginary domain).
Numeric approach:
Using numpy and pandas we can create a matrix with all the possible combinations and then apply each of our functions. The maximum and minimum values are per column, they are not related between the columns. As expected, the max and min values for the a, b and c columns are the range lower and upper bounds.
import pandas as pd
from pandas.core.reshape.util import cartesian_product
import numpy as np
# Number of values per range
N = 21
# Functions
functions = [
lambda row: row['a'] * np.sqrt(row['b'] / row['c']),
lambda row: row['a']**2 * row['b']/1000 + 3*(row['a']+row['b']),
lambda row: 10**6 / (2 * np.pi * row['a'] * row['b']),
]
# Lower and upper bounds
a_lower, a_upper = 0, 5
b_lower, b_upper = 10, 20
c_lower, c_upper = 30, 40
def min_max(col):
return pd.Series(index=['min', 'max'], data=[col.min(), col.max()])
values = [
np.linspace(a_lower, a_upper, N),
np.linspace(b_lower, b_upper, N),
np.linspace(c_lower, c_upper, N),
]
df = pd.DataFrame(cartesian_product(values), index=['a', 'b', 'c']).T
for i, f in enumerate(functions):
df[f'f_{i + 1}'] = df.apply(f, axis=1)
print(df.apply(min_max))
Output:
a b c f_1 f_2 f_3
min 0.0 10.0 30.0 0.000000 30.0 1591.549431
max 5.0 20.0 40.0 4.082483 75.5 inf
N = 101 has exactly the same output (it takes a bit to process as it has to compute the formulas 101^3 > 1M times)
I have the following two pandas dataframes: new & outcome
new = pd.DataFrame([[5,5,1.6],[0.22,0.22,0.56]]).T
new.index = ['Visitor','Draw','Home']
new.columns = ['Decimal odds', 'Win prob']
new['Bet amount'] = np.zeros((len(new),1))
With output:
Decimal odds Win prob Bet amount
Visitor 5.0 0.22 0.0
Draw 5.0 0.22 0.0
Home 1.6 0.56 0.0
And dataframe 'outcome'
outcome = pd.DataFrame([[0.22,0.22,0.56],[100,100,100]]).T
outcome.index = ['Visitor win','Draw','Home win']
outcome.columns = ['Prob.','Starting bankroll']
outcome['Wins'] = ((new['Decimal odds'] - 1) * new['Bet amount']).values
outcome['Losses'] = [sum(new['Bet amount'][[1,2]]) , sum(new['Bet amount'][[0,2]]), sum(new['Bet amount'][[0,1]])]
outcome['Ending bankroll'] = outcome['Starting bankroll'] + outcome['Wins'] - outcome['Losses']
outcome['Logarithm'] = np.log(outcome['Ending bankroll'])
With output:
Prob. Starting bankroll Wins Losses Ending bankroll Logarithm
Visitor win 0.22 100.0 0.0 0.0 100.0 4.60517
Draw 0.22 100.0 0.0 0.0 100.0 4.60517
Home win 0.56 100.0 0.0 0.0 100.0 4.60517
Hereby the objective is calculated by the formula below:
objective = sum(outcome['Prob.'] * outcome['Logarithm'])
Now I want to maximize the objective by the values contained in column `new['Bet amount']. The constraints are that a, b, and c are bounded between 0 and 100. Also the summation of a, b and c must be below 100. Reason is that a,b,c resemble the ratio of your bankroll that is used to place a sports bet.
Want to achieve this using the scipy library. My code so far looks like:
from scipy.optimize import minimize
prob = new['Win prob']
decimal = new['Decimal odds']
bank = outcome['Starting bankroll'][0]
def constraint1(bet):
a,b,c = bet
return 100 - a + b + c
con1 = {'type': 'ineq', 'fun': constraint1}
cons = [con1]
b0, b1, b2 = (0,100), (0,100), (0,100)
bnds = (b0, b1, b2)
def f(bet, sign = -1):
global prob, decimal, bank
p0,p1,p2 = prob
d0,d1,d2 = decimal
a,b,c = bet
wins0 = a * (d0-1)
wins1 = b * (d1-1)
wins2 = c * (d2-1)
loss0 = b + c
loss1 = a + c
loss2 = a + b
log0 = np.log(bank + wins0 - loss0)
log1 = np.log(bank + wins1 - loss1)
log2 = np.log(bank + wins2 - loss2)
objective = (log0 * p0 + log1 * p1 + log2 * p2)
return sign * objective
bet = [5,8,7]
result = minimize(f, bet, method = 'SLSQP', bounds = bnds, constraints = cons)
This however, does not result in the desired result. Desired result would be:
a = 3.33
b = 3.33
c = 0
My question is also how to set the method and initial values? Results seem to differ a lot by assigning different method's and initial values for the bets.
Any help would be greatly appreciated!
(This is an example posted on the pinnacle website: https://www.pinnacle.com/en/betting-articles/Betting-Strategy/the-real-kelly-criterion/HZKJTFCB3KNYN9CJ)
If you print out the "bet" values inside your function, you can see where it's going wrong.
[5. 8. 7.]
[5.00000001 8. 7. ]
[5. 8.00000001 7. ]
[5. 8. 7.00000001]
[5.00040728 7.9990977 6.99975556]
[5.00040729 7.9990977 6.99975556]
[5.00040728 7.99909772 6.99975556]
[5.00040728 7.9990977 6.99975558]
[5.00244218 7.99458802 6.99853367]
[5.0024422 7.99458802 6.99853367]
The algorithm is trying to optimize the formula with very small adjustments relative to your initial values, and it never adjusts enough to get to the values you're looking for.
If you check scipy webpage, you find https://docs.scipy.org/doc/scipy/reference/optimize.minimize-slsqp.html#optimize-minimize-slsqp
eps float
Step size used for numerical approximation of the Jacobian.
result = minimize(f, bet, method='SLSQP', bounds=bnds, constraints=cons,
options={'maxiter': 100, 'ftol': 1e-06, 'iprint': 1, 'disp': True,
'eps': 1.4901161193847656e-08, 'finite_diff_rel_step': None})
So you're starting off with a step size of 1.0e-08, so your initial estimates are off by many orders of magnitude outside the range where the algorithm is going to be looking.
I'd recommend normalizing your bets to values between zero and 1. So instead of saying I'm placing a bet between 0 and 100, just say you're wagering a fraction of your net wealth between 0 and 1. A lot of algorithms are designed to work with standardized inputs (between 0 and 1) or normalized inputs (standard deviations from the mean).
Also, it looks like :
def constraint1(bet):
a,b,c = bet
return 100 - a + b + c
should be:
def constraint1(bet):
a,b,c = bet
return 100 - (a + b + c)
but I don't think that impacts your results
Assuming we have the below function to optimize for 4 parameters, we have to write the function as below, but if we want the same function with more number of parameters, we have to rewrite the function definition.
def radius (z,a0,a1,k0,k1,):
k = np.array([k0,k1,])
a = np.array([a0,a1,])
w = 1.0
phi = 0.0
rs = r0 + np.sum(a*np.sin(k*z +w*t +phi), axis=1)
return rs
The question is if this can be done easier in a more automatic way, and more intuitive than this question suggests.
example would be as following which has to be written by hand.
def radius (z,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,k0,k1,k2,k3,k4,k5,k6,k7,k8,k9,):
k = np.array([k0,k1,k2,k3,k4,k5,k6,k7,k8,k9,])
a = np.array([a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,])
w = 1.0
phi = 0.0
rs = r0 + np.sum(a*np.sin(k*z +w*t +phi), axis=1)
return rs
I'm tring to approximate an empirical cumulative distribution function (ECDF I want to approximate) with a smooth function (with less than 5 parameter) such as the generalized logistic function.
However, using scipy.optimize.curve_fit, the fitting operation gives really bad approximations or it doesn't work at all (depending on the initial values). The variable series represents my data stored as pandas.Series.
from scipy.optimize import curve_fit
def fit_ecdf(x):
x = np.sort(x)
def result(v):
return np.searchsorted(x, v, side='right') / x.size
return result
ecdf = fit_ecdf(series)
def genlogistic(x, B, M, Q, v):
return 1 / (1 + Q * np.exp(-B * (x - M))) ** (1 / v)
params = curve_fit(genlogistic, xdata = series, ydata = ecdf(series), p0 = (0.1, 10.0, 0.1, 0.1))[0]
Should I use another type of function for the fit?
Are there any code mistakes?
UPDATE - 1
As asked, I link to a csv containing the data.
UPDATE - 2
After a lot of search and trial and error I find out this function
f(x; a, b, c) = 1 - 1 / (1 + (x / b) ** a) ** c
with a = 4.61320000, b = 2.94570952, c = 0.5886922
which fits a lot better than the other one. The only problem is the little step that the ECDF shows near x=1. How can I modify f to improve the quality of the fit? I was thinking of adding some sort of function that is "relevant" only in those kind of points. Here are the graphical results of the fit where the solid blue line is the ECDF and the dotted line represents the (x, f(x)) points.
I find out how to deal with that little step near x=1. As expressed in the question, adding some sort of function that is significant only in that interval was the game changer.
The "step" ends at about (1.7, 0.04) so I needed a sort of function that flattens for x > 1.7 and has y = 0.04 as asymptote. The natural choice (just to stay on point) was to take a function like f(x) = 1/exp(x).
Thanks to JamesPhillips, I also picked up the proper data for the regression (no double values = no overweighted points).
Python Code
from scipy.optimize import curve_fit
def fit_ecdf(x):
x = np.sort(x)
def result(v):
return np.searchsorted(x, v, side = 'right') / x.size
return result
ecdf = fit_ecdf(series)
unique_series = series.unique().tolist()
def cdf_interpolation(x, a, b, c, d):
f_1 = 0.95 + (0 - 0.95) / (1 + (x / b) ** a) ** c + 0.05
f_2 = (0 - 0.05)/(np.exp(d * x))
return f_1 + f_2
params = curve_fit(cdf_interpolation,
xdata = unique_series ,
ydata = ecdf(unique_series),
p0 = (6.0, 3.0, 0.4, 1.0))[0]
Parameters
a = 6.03256462
b = 2.89418871
c = 0.42997956
d = 1.06864006
Graphical results
I got an OK fit for a 5-parameter logistic equation (see image and code) using unique values, not sure if the low end curve is sufficient for your needs, please check.
import numpy as np
def Sigmoidal_FiveParameterLogistic_model(x_in): # from zunzun.com
# coefficients
a = 9.9220221252324947E-01
b = -3.1572339989462903E+00
c = 2.2303376075685142E+00
d = 2.6271495036080207E-02
f = 3.4399008905318986E+00
return d + (a - d) / np.power(1.0 + np.power(x_in / c, b), f)
I need to calculate binomial confidence intervals for large set of data within a script of python. Do you know any function or library of python that can do this?
Ideally I would like to have a function like this http://statpages.org/confint.html implemented on python.
Thanks for your time.
Just noting because it hasn't been posted elsewhere here that statsmodels.stats.proportion.proportion_confint lets you get a binomial confidence interval with a variety of methods. It only does symmetric intervals, though.
I would say that R (or another stats package) would probably serve you better if you have the option. That said, if you only need the binomial confidence interval you probably don't need an entire library. Here's the function in my most naive translation from javascript.
def binP(N, p, x1, x2):
p = float(p)
q = p/(1-p)
k = 0.0
v = 1.0
s = 0.0
tot = 0.0
while(k<=N):
tot += v
if(k >= x1 and k <= x2):
s += v
if(tot > 10**30):
s = s/10**30
tot = tot/10**30
v = v/10**30
k += 1
v = v*q*(N+1-k)/k
return s/tot
def calcBin(vx, vN, vCL = 95):
'''
Calculate the exact confidence interval for a binomial proportion
Usage:
>>> calcBin(13,100)
(0.07107391357421874, 0.21204372406005856)
>>> calcBin(4,7)
(0.18405151367187494, 0.9010086059570312)
'''
vx = float(vx)
vN = float(vN)
#Set the confidence bounds
vTU = (100 - float(vCL))/2
vTL = vTU
vP = vx/vN
if(vx==0):
dl = 0.0
else:
v = vP/2
vsL = 0
vsH = vP
p = vTL/100
while((vsH-vsL) > 10**-5):
if(binP(vN, v, vx, vN) > p):
vsH = v
v = (vsL+v)/2
else:
vsL = v
v = (v+vsH)/2
dl = v
if(vx==vN):
ul = 1.0
else:
v = (1+vP)/2
vsL =vP
vsH = 1
p = vTU/100
while((vsH-vsL) > 10**-5):
if(binP(vN, v, 0, vx) < p):
vsH = v
v = (vsL+v)/2
else:
vsL = v
v = (v+vsH)/2
ul = v
return (dl, ul)
While the scipy.stats module has a method .interval() to compute the equal tails confidence, it lacks a similar method to compute the highest density interval. Here is a rough way to do it using methods found in scipy and numpy.
This solution also assumes you want to use a Beta distribution as a prior. The hyper-parameters a and b are set to 1, so that the default prior is a uniform distribution between 0 and 1.
import numpy
from scipy.stats import beta
from scipy.stats import norm
def binomial_hpdr(n, N, pct, a=1, b=1, n_pbins=1e3):
"""
Function computes the posterior mode along with the upper and lower bounds of the
**Highest Posterior Density Region**.
Parameters
----------
n: number of successes
N: sample size
pct: the size of the confidence interval (between 0 and 1)
a: the alpha hyper-parameter for the Beta distribution used as a prior (Default=1)
b: the beta hyper-parameter for the Beta distribution used as a prior (Default=1)
n_pbins: the number of bins to segment the p_range into (Default=1e3)
Returns
-------
A tuple that contains the mode as well as the lower and upper bounds of the interval
(mode, lower, upper)
"""
# fixed random variable object for posterior Beta distribution
rv = beta(n+a, N-n+b)
# determine the mode and standard deviation of the posterior
stdev = rv.stats('v')**0.5
mode = (n+a-1.)/(N+a+b-2.)
# compute the number of sigma that corresponds to this confidence
# this is used to set the rough range of possible success probabilities
n_sigma = numpy.ceil(norm.ppf( (1+pct)/2. ))+1
# set the min and max values for success probability
max_p = mode + n_sigma * stdev
if max_p > 1:
max_p = 1.
min_p = mode - n_sigma * stdev
if min_p > 1:
min_p = 1.
# make the range of success probabilities
p_range = numpy.linspace(min_p, max_p, n_pbins+1)
# construct the probability mass function over the given range
if mode > 0.5:
sf = rv.sf(p_range)
pmf = sf[:-1] - sf[1:]
else:
cdf = rv.cdf(p_range)
pmf = cdf[1:] - cdf[:-1]
# find the upper and lower bounds of the interval
sorted_idxs = numpy.argsort( pmf )[::-1]
cumsum = numpy.cumsum( numpy.sort(pmf)[::-1] )
j = numpy.argmin( numpy.abs(cumsum - pct) )
upper = p_range[ (sorted_idxs[:j+1]).max()+1 ]
lower = p_range[ (sorted_idxs[:j+1]).min() ]
return (mode, lower, upper)
Just been trying this myself. If it helps here's my solution, which takes two lines of code and seems to give equivalent results to that JS page. This is the frequentist one-sided interval, I'm calling the input argument the MLE (maximum likelihood estimate) of the binomial parameter theta. I.e. mle = number of successes/number of trials. I find the upper bound of the one sided interval. The alpha value used here is therefore double the one in the JS page for the upper limit.
from scipy.stats import binom
from scipy.optimize import bisect
def binomial_ci( mle, N, alpha=0.05 ):
"""
One sided confidence interval for a binomial test.
If after N trials we obtain mle as the proportion of those
trials that resulted in success, find c such that
P(k/N < mle; theta = c) = alpha
where k/N is the proportion of successes in the set of trials,
and theta is the success probability for each trial.
"""
to_minimise = lambda c: binom.cdf(mle*N,N,c)-alpha
return bisect(to_minimise,0,1)
To find the two sided interval, call with (1-alpha/2) and alpha/2 as arguments.
The following gives exact (Clopper-Pearson) interval for binomial distribution in a simple way.
def binomial_ci(x, n, alpha=0.05):
#x is number of successes, n is number of trials
from scipy import stats
if x==0:
c1 = 0
else:
c1 = stats.beta.interval(1-alpha, x,n-x+1)[0]
if x==n:
c2=1
else:
c2 = stats.beta.interval(1-alpha, x+1,n-x)[1]
return c1, c2
You may check the code by e.g.:
p1,p2 = binomial_ci(2,7)
from scipy import stats
assert abs(stats.binom.cdf(1,7,p1)-.975)<1E-5
assert abs(stats.binom.cdf(2,7,p2)-.025)<1E-5
assert abs(binomial_ci(0,7, alpha=.1)[0])<1E-5
assert abs((1-binomial_ci(0,7, alpha=.1)[1])**7-0.05)<1E-5
assert abs(binomial_ci(7,7, alpha=.1)[1]-1)<1E-5
assert abs((binomial_ci(7,7, alpha=.1)[0])**7-0.05)<1E-5
I used the relation between the binomial proportion confidence interval and the regularized incomplete beta function, as described here:
https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper%E2%80%93Pearson_interval
I needed to do this as well. I was using R and wanted to learn a way to work it out for myself. I would not say it is strictly pythonic.
The docstring explains most of it. It assumes you have scipy installed.
def exact_CI(x, N, alpha=0.95):
"""
Calculate the exact confidence interval of a proportion
where there is a wide range in the sample size or the proportion.
This method avoids the assumption that data are normally distributed. The sample size
and proportion are desctibed by a beta distribution.
Parameters
----------
x: the number of cases from which the proportion is calulated as a positive integer.
N: the sample size as a positive integer.
alpha : set at 0.95 for 95% confidence intervals.
Returns
-------
The proportion with the lower and upper confidence intervals as a dict.
"""
from scipy.stats import beta
x = float(x)
N = float(N)
p = round((x/N)*100,2)
intervals = [round(i,4)*100 for i in beta.interval(alpha,x,N-x+1)]
intervals.insert(0,p)
result = {'Proportion': intervals[0], 'Lower CI': intervals[1], 'Upper CI': intervals[2]}
return result
A numpy/scipy-free way of computing the same thing using the Wilson score and an approximation to the normal cumulative density function,
import math
def binconf(p, n, c=0.95):
'''
Calculate binomial confidence interval based on the number of positive and
negative events observed.
Parameters
----------
p: int
number of positive events observed
n: int
number of negative events observed
c : optional, [0,1]
confidence percentage. e.g. 0.95 means 95% confident the probability of
success lies between the 2 returned values
Returns
-------
theta_low : float
lower bound on confidence interval
theta_high : float
upper bound on confidence interval
'''
p, n = float(p), float(n)
N = p + n
if N == 0.0: return (0.0, 1.0)
p = p / N
z = normcdfi(1 - 0.5 * (1-c))
a1 = 1.0 / (1.0 + z * z / N)
a2 = p + z * z / (2 * N)
a3 = z * math.sqrt(p * (1-p) / N + z * z / (4 * N * N))
return (a1 * (a2 - a3), a1 * (a2 + a3))
def erfi(x):
"""Approximation to inverse error function"""
a = 0.147 # MAGIC!!!
a1 = math.log(1 - x * x)
a2 = (
2.0 / (math.pi * a)
+ a1 / 2.0
)
return (
sign(x) *
math.sqrt( math.sqrt(a2 * a2 - a1 / a) - a2 )
)
def sign(x):
if x < 0: return -1
if x == 0: return 0
if x > 0: return 1
def normcdfi(p, mu=0.0, sigma2=1.0):
"""Inverse CDF of normal distribution"""
if mu == 0.0 and sigma2 == 1.0:
return math.sqrt(2) * erfi(2 * p - 1)
else:
return mu + math.sqrt(sigma2) * normcdfi(p)
Astropy provides such a function (although installing and importing astropy may be a bit excessive):
astropy.stats.binom_conf_interval
I am not an expert on statistics, but binomtest is built into SciPy and produces the same results as the accepted answer:
from scipy.stats import binomtest
binomtest(13, 100).proportion_ci()
Out[11]: ConfidenceInterval(low=0.07107304618545972, high=0.21204067708744978)
binomtest(4, 7).proportion_ci()
Out[25]: ConfidenceInterval(low=0.18405156764007, high=0.9010117215575631)
It uses Clopper-Pearson exact method by default, which matches Curt's accepted answer, which gives these values, for comparison:
Usage:
>>> calcBin(13,100)
(0.07107391357421874, 0.21204372406005856)
>>> calcBin(4,7)
(0.18405151367187494, 0.9010086059570312)
It also has options for Wilson's method, with or without continuity correction, which matches TheBamf's astropy answer:
binomtest(4, 7).proportion_ci(method='wilson')
Out[32]: ConfidenceInterval(low=0.2504583645276572, high=0.8417801447485302)
binom_conf_interval(4, 7, 0.95, interval='wilson')
Out[33]: array([0.25045836, 0.84178014])
This also matches R's binom.test and statsmodels.stats.proportion.proportion_confint, according to cxrodgers' comment:
For 30 successes in 60 trials, both R's binom.test and statsmodels.stats.proportion.proportion_confint give (.37, .63) using Klopper-Pearson.
binomtest(30, 60).proportion_ci(method='exact')
Out[34]: ConfidenceInterval(low=0.3680620319424367, high=0.6319379680575633)