How can I create a function from this data? - python

I have a dataset in the form of a table:
Score Percentile
381 1
382 2
383 2
...
569 98
570 99
The complete table is here as a Google spreadsheet.
Currently, I am computing a score and then doing a lookup on this dataset (table) to find the corresponding percentile rank.
Is it possible to create a function to calculate the corresponding percentile rank for a given score using a formula instead of looking it up in the table?

It's impossible to recreate the function that generated a given table of data, if no information is provided about the process behind that data.
That being said, we can make some speculation.
Since it's a "percentile" function, it probably represents the cumulative value of a probability distribution of some sort. A very common probability distribution is the normal distribution, whose "cumulative" counterpart (i.e. its integral) is the so called "error function" ("erf").
In fact, your tabulated data looks a lot like an error function for a variable whose average value is 473.09:
your dataset: orange; fitted error function (erf): blue
However, the agreement is not perfect and that could be because of three reasons:
the fitting procedure I've used to generate the parameters for the error function didn't use the right constraints (because I have no idea what I'm modelling!)
your dataset doesn't represent an exact normal distribution, but rather real world data whose underlying distribution is the normal distribution. The features of your sample data that deviate from the model are being ignored altogether.
the underlying distribution is not a normal distribution at all, its integral just happens to look like the error function by chance.
There is literally no way for me to tell!
If you want to use this function, this is its definition:
import numpy as np
from scipy.special import erf
def fitted_erf(x):
c = 473.09090474
w = 37.04826334
return 50+50*erf((x-c)/(w*np.sqrt(2)))
Tests:
In [2]: fitted_erf(439) # 17 from the table
Out[2]: 17.874052406601457
In [3]: fitted_erf(457) # 34 from the table
Out[3]: 33.20270318344252
In [4]: fitted_erf(474) # 51 from the table
Out[4]: 50.97883169390196
In [5]: fitted_erf(502) # 79 from the table
Out[5]: 78.23955071273468
however I'd strongly advise you to check if a fitted function, made without knowledge of your data source, is the right tool for your task.
P.S.
In case you're interested, this is the code used to obtain the parameters:
import numpy as np
from scipy.special import erf
from scipy.optimize import curve_fit
tab=np.genfromtxt('table.csv', delimiter=',', skip_header=1)
# using a 'table.csv' file generated by Google Spreadsheets
x = tab[:,0]
y = tab[:,1]
def parametric_erf(x, c, w):
return 50+50*erf((x-c)/(w*np.sqrt(2)))
pars, j = curve_fit(parametric_erf, x, y, p0=[475,10])
print(pars)
# outputs [ 473.09090474, 37.04826334]
and to generate the plot
import matplotlib.pyplot as plt
plt.plot(x,parametric_erf(x,*pars))
plt.plot(x,y)
plt.show()

Your question is quite vague but it seems whatever calculation you do ends up with a number in the range 381-570, is this correct. You have a multiline calculation which gives this number? I'm guessing you are repeating this in many places in your code which is why you want to procedurise it?
For any calculation you can wrap it in a function. For instance:
answer = variable_1 * variable_2 + variable_3
can be written as:
def calculate(v1, v2, v3):
''' calculate the result from the inputs
'''
return v1 * v2 + v3
answer = calculate(variable_1, variable_2, variable_3)
if you would like a definitive answer then simply post your calculation and I can make it into a function for you

Related

Expected mean of correlated data in python

I have successfully generated three correlated random variables with Cholesky. I use the same mean(10) and the same standard deviation(5) for all of them. However, I tried to calculate the expected mean of the correlated variables, but I got some an unpleasant results I can't seem to know where exactly the problem. Please here is a working code:
import numpy as np
import pandas as pd
corr = np.array([[1,0.7,0.7], [0.7,1,0.7],[0.7,0.7,1]])
chol = np.linalg.cholesky(corr)
N=1000
rand_data = np.random.normal(10, 5, size=(3,N))
# generate uncorrelated data
uncorrelated_data = pd.DataFrame(rand_data, index=['A','B','C']).T/100
uncorrelated_data.corr() # shows barely any correlation as it should
uncorrelated_data.mean()*100 # shows each mean around 10
Output
A 10.308595
B 9.931958
C 10.165347
Generating correlation among them
x = np.dot(chol, rand_data) # cholesky
correlated_data = pd.DataFrame(x, index=['A','B','C']).T/100
print(correlated_data.corr()) # shows there are correlations among variable
sim_corr_rets.mean()*100 # mean keep increasing in across the variables
Output:
A 10.308595
B 14.308853
C 16.752117
The means of the uncorrelated variables were as expected but the mean of the correlated variables keeps increasing from the first variable to the last variable. My expectation is that each mean will be around the actual mean. Please could my noble seniors help me figure out the problem or suggest an alternative solution?

Python - Find coefficients minimizing error in csv data

I've recently run into a problem. I have data looking like this :
Value 1
Value 2
Target
1345
4590
2.45
1278
3567
2.48
1378
4890
2.46
1589
4987
2.50
...
...
...
The data goes on for a few thousand lines.
I need to find two values (A & B), that minimize the error when the data is inputted like so :
Value 1 * A + Value 2 * B = Target
I've looked into scipy.optimize.curve_fit, but I can't seem to understand how it would work, because the function changes at every iteration of the data (since Value 1 and Value 2 are not the same over every row).
Any help is greatly appreciated, thanks in advance !
The function curve_fit takes 3 arguments :
a function f that takes an input argument, let's call it X and parameters params (as many as you want)
the input X_data you have from your dataset
the output Y_data you have from your dataset
The point of this function is the give you best params to input in f(X_data, params) to get Y_data.
Intuitively the form X in your function f is a simple numpy 1D array, but actually it can have the form you want. Here your input a tuple of two 1D arrays (or a 2D array if you want to implemente it this way).
Here's a code example :
import numpy as np
from scipy.optimize import curve_fit
X_data = (np.array([1345,1278,1378,1589]),
np.array([4590,3567,4890,4987]))
Y_data = np.array([2.45,2.48,2.46,2.50])
def my_func(X, A, B):
x1, x2 = X
return A*x1 + B*x2
(A, B), _ = curve_fit(my_func, X_data, Y_data)
interpolated_results = my_func(X_data, A, B)
relative_error_in_percent = abs((Y_data - interpolated_results)/Y_data)*100
print(relative_error_in_percent)
Unfortunataly you have not provided any test data so I have come up with my own:
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
def f(V1,V2,A,B): #Target function
return V1*A+V2*B
# Generate Test-Data
def generateData(A,B):
np.random.seed(0)
V1=np.random.uniform(low=1000, high=1500, size=(100,))
V2=np.random.uniform(low=3500, high=5000, size=(100,))
Target=f(V1,V2,A,B) +np.random.normal(0,1,100)
return V1,V2,Target
data=generateData(2,3) #Important:
data={"Value 1":data[0], "Value 2":data[1], "Target":data[2]}
df=pd.DataFrame(data) #Similar structure as given in Table
df.head() looks like this:
Value 1 Value 2 Target
0 1292.0525763109854 3662.162080896163 13570.276523473405
1 1155.0421489258965 4907.133274663096 17033.392287295104
2 1430.7172112685223 4844.422515098364 17395.412651006143
3 1396.0480757043242 4076.5845114488666 15022.720636830541
4 1346.2120476329646 3570.9567326419674 13406.565815022896
Your question is answered in the following:
## Plot Data to check whether linear function is useful
df.head()
fig=plt.figure()
ax1=fig.add_subplot(211)
ax2=fig.add_subplot(212)
ax1.scatter(df["Value 1"], df["Target"])
ax2.scatter(df["Value 2"], df["Target"])
def fmin(x, df): #Returns Error at given parameters
def RMSE(y,y_target): #Definition for error term
return np.sqrt(np.mean((y-y_target)**2))
A,B=x
V1,V2,y_target=df["Value 1"], df["Value 2"], df["Target"]
y=f(V1,V2,A,B) #Calculate target value with given parameter set
return RMSE(y,y_target)
res=minimize(fmin,x0=[1,1],args=df, options={"disp":True})
print(res.x)
I prefere scipy.optimize.minimize() over curve_fit since you can define the error function yourself. The documentation can be found here.
You need:
a function fun that returns the error for a given set of parameter x (here fmin with RMSE)
an initial guess x0 (here [1,1]), if your guess is totally off you will probably do not find a solution or (with more complex problems) just a local one
additional arguments args provided to the fun here the data df but also helpful for fixed parameters
options={"disp":True} is for printing additional information
your parameters can be found besides further information in the returned variable res
For this case the result is:
[1.9987209 3.0004212]
Similar to the given parameters when generating the data.

lmfit for exponential data returns linear function

I'm working on fitting muon lifetime data to a curve to extract the mean lifetime using the lmfit function. The general process I'm using is to bin the 13,000 data points into 10 bins using the histogram function, calculating the uncertainty with the square root of the counts in each bin (it's an exponential model), then use the lmfit module to determine the best fit along with means and uncertainty. However, graphing the output of the model.fit() method returns this graph, where the red line is the fit (and obviously not the correct fit). Fit result output graph
I've looked online and can't find a solution to this, I'd really appreciate some help figuring out what's going on. Here's the code.
import os
import numpy as np
import matplotlib.pyplot as plt
from numpy import sqrt, pi, exp, linspace
from lmfit import Model
class data():
def __init__(self,file_name):
times_dirty = sorted(np.genfromtxt(file_name, delimiter=' ',unpack=False)[:,0])
self.times = []
for i in range(len(times_dirty)):
if times_dirty[i]<40000:
self.times.append(times_dirty[i])
self.counts = []
self.binBounds = []
self.uncertainties = []
self.means = []
def binData(self,k):
self.counts, self.binBounds = np.histogram(self.times, bins=k)
self.binBounds = self.binBounds[:-1]
def calcStats(self):
if len(self.counts)==0:
print('Run binData function first')
else:
self.uncertainties = sqrt(self.counts)
def plotData(self,fit):
plt.errorbar(self.binBounds, self.counts, yerr=self.uncertainties, fmt='bo')
plt.plot(self.binBounds, fit.init_fit, 'k--')
plt.plot(self.binBounds, fit.best_fit, 'r')
plt.show()
def decay(t, N, lamb, B):
return N * lamb * exp(-lamb * t) +B
def main():
muonEvents = data('C:\Users\Colt\Downloads\muon.data')
muonEvents.binData(10)
muonEvents.calcStats()
mod = Model(decay)
result = mod.fit(muonEvents.counts, t=muonEvents.binBounds, N=1, lamb=1, B = 1)
muonEvents.plotData(result)
print(result.fit_report())
print (len(muonEvents.times))
if __name__ == "__main__":
main()
This might be a simple scaling problem. As a quick test, try dividing all raw data by a factor of 1000 (both X and Y) to see if changing the magnitude of the data has any effect.
Just to build on James Phillips answer, I think the data you show in your graph imply values for N, lamb, and B that are very different from 1, 1, 1. Keep in mind that exp(-lamb*t) is essentially 0 for lamb = 1, and t> 100. So, if the algorithm starts at lamb=1 and varies that by a little bit to find a better value, it won't actually be able to see any difference in how well the model matches the data.
I would suggest trying to start with values that are more reasonable for the data you have, perhaps N=1.e6, lamb=1.e-4, and B=100.
As James suggested, having the variables have values on the order of 1 and putting in scale factors as necessary is often helpful in getting numerically stable solutions.

How to make user defined functions for binned_statistic

I am using scipy stats package to take statistics along the an axis, but I am having trouble taking the percentile statistic using binned_statistic. I have generalized the code below, where I am trying taking the 10th percentile of a dataset with x, y values within a series of x bins, and it fails.
I can of course do function options, like median, and even the numpy standard deviation using np.std. However, I cannot figure out how to use np.percentile because it requires 2 arguments (e.g. np.percentile(y, 10)), but then it gives me a ValueError: statistic not understood error.
import numpy as np
import scipy.stats as scist
y_median = scist.binned_statistic(x,y,statistic='median',bins=20,range=[(0,5)])[0]
y_std = scist.binned_statistic(x,y,statistic=np.std,bins=20,range=[(0,5)])[0]
y_10 = scist.binned_statistic(x,y,statistic=np.percentile(10),bins=20,range=[(0,5)])[0]
print y_median
print y_std
print y_10
I am at a loss and have even played around with user defined functions like this, but with no luck:
def percentile10():
return(np.percentile(y,10))
Any help, is greatly appreciated.
Thanks.
The problem with the function you defined is that it takes no arguments at all! It needs to take a y argument that corresponds to your sample, like this:
def percentile10(y):
return(np.percentile(y,10))
You could also use a lambda function for brevity:
scist.binned_statistic(x, y, statistic=lambda y: np.percentile(y, 10), bins=20,
range=[(0, 5)])[0]

Ways to Create Tables and Presentable Objects Other than Plots in Python

I have the following code that runs through the following:
Draw a number of points from a true distribution.
Use those points with curve_fit to extract the parameters.
Check if those parameters are, on average, close to the true values.
(You can do this by creating the "Pull distribution" and see if it returns
a standard normal variable.
# This script calculates the mean and standard deviation for
# the pull distributions on the estimators that curve_fit returns
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import gauss
import format
numTrials = 10000
# Pull given by (a_j - a_true)/a_error)
error_vec_A = []
error_vec_mean = []
error_vec_sigma = []
# Loop to determine pull distribution
for i in xrange(0,numTrials):
# Draw from primary distribution
mean = 0; var = 1; sigma = np.sqrt(var);
N = 20000
A = 1/np.sqrt((2*np.pi*var))
points = gauss.draw_1dGauss(mean,var,N)
# Histogram parameters
bin_size = 0.1; min_edge = mean-6*sigma; max_edge = mean+9*sigma
Nn = (max_edge-min_edge)/bin_size; Nplus1 = Nn + 1
bins = np.linspace(min_edge, max_edge, Nplus1)
# Obtain histogram from primary distributions
hist, bin_edges = np.histogram(points,bins,density=True)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
# Initial guess
p0 = [5, 2, 4]
coeff, var_matrix = curve_fit(gauss.gaussFun, bin_centres, hist, p0=p0)
# Get the fitted curve
hist_fit = gauss.gaussFun(bin_centres, *coeff)
# Error on the estimates
error_parameters = np.sqrt(np.array([var_matrix[0][0],var_matrix[1][1],var_matrix[2][2]]))
# Obtain the error for each value: A,mu,sigma
A_std = (coeff[0]-A)/error_parameters[0]
mean_std = ((coeff[1]-mean)/error_parameters[1])
sigma_std = (np.abs(coeff[2])-sigma)/error_parameters[2]
# Store results in container
error_vec_A.append(A_std)
error_vec_mean.append(mean_std)
error_vec_sigma.append(sigma_std)
# Plot the distribution of each estimator
plt.figure(1); plt.hist(error_vec_A,bins,normed=True); plt.title('Pull of A')
plt.figure(2); plt.hist(error_vec_mean,bins,normed=True); plt.title('Pull of Mu')
plt.figure(3); plt.hist(error_vec_sigma,bins,normed=True); plt.title('Pull of Sigma')
# Store key information regarding distribution
mean_A = np.mean(error_vec_A); sigma_A = np.std(error_vec_A)
mean_mu = np.mean(error_vec_mean); sigma_mu = np.std(error_vec_mean)
mean_sigma = np.mean(error_vec_sigma); sigma_sig = np.std(error_vec_sigma)
info = np.array([[mean_A,sigma_A],[mean_mu,sigma_mu],[mean_sigma,sigma_sig]])
My problem is I don't know how to use python to format the data into a table. I have to manually go into the variables and go to google docs to present the information. I'm just wondering how I can do that using pandas or some other library.
Here's an example of the manual insertion:
Trial 1 Trial 2 Trial 3
Seed [0.2,0,1] [10,2,5] [5,2,4]
Bins for individual runs 20 20 20
Points Thrown 1000 1000 1000
Number of Runs 5000 5000 5000
Bins for pull dist fit 20 20 20
Mean_A -0.11177 -0.12249 -0.10965
sigma_A 1.17442 1.17517 1.17134
Mean_mu 0.00933 -0.02773 -0.01153
sigma_mu 1.38780 1.38203 1.38671
Mean_sig 0.05292 0.06694 0.04670
sigma_sig 1.19411 1.18438 1.19039
I would like to automate this table so If I change my parameters in my code, I get a new table with that new data.
I would go with the CSV module to generate a presentable table.
if you're not already using it, the IPython notebook is really good for rendering rich display formats. It's really good in a lot of other ways, too.
It will render pandas dataframe objects as an html table when they're either the last, unreturned value in a cell or if you explicitly call Ipython.core.display.display function instead of print.
If you're not already using pandas, I highly recommend it. It's basically a wrapper around 2D & 3D numpy arrays; it's just as fast, but it has nice naming conventions, data grouping and filtering funcitons, and some other cool stuff.
At that point, it depends on how you want to present it. You can use nbconvert to render a whole notebook as static html or a pdf. You can copy-paste the html table into Excel or PowerPoint or an E-mail.

Categories