I trying to use ready-made functions from the package IneqPy below you can see function and data.
The function is also described in this link https://mmngreco.github.io/IneqPy/_modules/ineqpy/inequality.html#concentration
# Import libraries
import numpy as np
import pandas as pd
# from .statistics import mean
# from . import utils
import utils
import stats
# Functions
def concentration(income, weights=None, data=None, sort=True):
"""Calculate concentration's index.
This function calculate the concentration index, according to the notation
used in [Jenkins1988]_ you can calculate the:
C_x = 2 / x ยท cov(x, F_x)
if x = g(x) then C_x becomes C_y
when there are taxes:
y = g(x) = x - t(x)
Parameters
----------
income : array-like
weights : array-like
data : pandas.DataFrame
sort : bool
If true, will sort the values.
Returns
-------
concentration : array-like
References
----------
Jenkins, S. (1988). Calculating income distribution indices
from micro-data. National Tax Journal. http://doi.org/10.2307/41788716
"""
# TODO complete docstring
# check if DataFrame is passed, if yes then extract variables else make a
# copy
income, weights = utils.extract_values(data, income, weights)
if weights is None:
weights = utils.not_empty_weights(weights, like=income)
# if sort is true then sort the variables.
if sort:
income, weights = utils._sort_values(income, weights)
if weights.ndim == 2:
weights = np.squeeze(weights, axis=1)
if income.ndim == 2:
income = np.squeeze(income, axis=1)
# main calc
f_x = utils.normalize(weights)
F_x = f_x.cumsum(axis=0)
mu = np.sum(income * f_x)
cov = np.cov(income, F_x, rowvar=False, aweights=f_x)[0, 1]
return 2 * cov / mu
After executing the above command I tried to implement these functions on this dataset explained below.
data = {
'income': list(range(1, 100))
}
df = pd.DataFrame(data, columns = ['income'])
Dataset titled df only has one column and I tried to implement the function above.But unfortunately, here I face with several problems. Namely first problem is everything that came utils package can not be executed e.g( utils.extract_values, utils.not_empty_weights, utils._sort_values, utils.normalize). I already have installed utils==1.0.1 and don't what is problem.
So can anybody help how to fix the problem with utils package and execute the above function with this line of command?
concentration(df['income'])
At the moment I can execute this function but with loaded package IneqPy and with this line of code I receive this results 0.33, but I want to have this with above described functions.
ineqpy.inequality.concentration(df['income'])
If you look at the contents of the IneqPy package, you'll see that there's a file named utils.py. That's the one that's supposed to be imported. Uninstall the utils package you installed and the code should work.
It should work anyway, but uninstalling it wouldn't hurt.
Related
I am estimating a model using the pyMC3 library in python. In my "real" model, there are four parameter arrays, two of which have over 170,000 parameters in them. Summarising this array of parameters is too computationally intensive on my computer. I have been trying to figure out if the summary function in arviz will allow me to only summarise one (or a small number) of parameters in the array. Below is a reprex where the same problem is present, though the model is a lot simpler. In the linear regression model below, the parameter array b has three parameters in it b[0], b[1], b[2]. I would like to know how to get the summary for just b[0] and b[1] or alternatively for just a single parameter, e.g., b[0].
import pandas as pd
import pymc3 as pm
import arviz as az
d = pd.read_csv("https://quantoid.net/files/mtcars.csv")
mpg = d['mpg'].values
hp = d['hp'].values
weight = d['wt'].values
with pm.Model() as model:
b = pm.Normal("b", mu=0, sigma=10, shape=3)
sig = pm.HalfCauchy("sig", beta=2)
mu = pm.Deterministic('mu', b[0] + b[1]*hp + b[2]*weight)
like = pm.Normal('like', mu=mu, sigma=sig, observed=mpg)
fit = pm.fit(10000, method='advi')
samp = fit.sample(1500)
with model:
smry = az.summary(samp, var_names = ["b"])
It looked like the coords argument to the summary() function would do it, but after googling around and finding a few examples, like the one here with plot_posterior() instead of summary(), I was unable to get something to work. In particular, I tried the following in the hopes that it would return the summary for b[0] and b[1].
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": range(1)})
or this to return the summary of b[0]:
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": [0]})
I suspect I am missing something simple (I'm an R user who dabbles occasionally with Python). Any help is greatly appreciated.
(BTW, I am using Python 3.8.0, pyMC3 3.9.3, arviz 0.10.0)
To use coords for this, you need to update to the development (which will still show 0.11.2 but has the code from github or any >0.11.2 release) version of ArviZ. Until 0.11.2, the coords argument in summary was not used to subset the data (like it did in all plotting functions) but instead it was only taken into account if the input was not already InferenceData in which case it was passed to the converter.
With older versions, you need to use xarray to subset the data before passing it to summary. Therefore you need to explicitly convert the trace to inferencedata beforehand. In the example above it would look like:
with model:
...
samp = fit.sample(1500)
idata = az.from_pymc3(samp)
az.summary(idata.posterior[["b"]].sel({"b_dim_0": [0]}))
Moreover, you may also want to indicate summary to compute only a subset of the stats/diagnostics as shown in the docstring examples.
I'm trying to run this code from Kaggle. For clustering time series using DTW.
More specifically the part:
In[24/25]:
"""
From a list of series, compute a distance matrix by computing the
DTW distance of all pairwise combinations of series.
"""
diff_matrix = {}
cross = itertools.product(cols, cols)
for (col1, col2) in cross:
series1 = daily_sales_item_lookup_scaled_weekly[col1]
series2 = daily_sales_item_lookup_scaled_weekly[col2]
diff = dtw(
series1,
series2,
keep_internals=True,
step_pattern=rabinerJuangStepPattern(2, "c")
)\
.normalizedDistance
diff_matrix[(col1, col2)] = [diff]
return diff_matrix
As one of the parameters, the authors claim "step_pattern=rabinerJuangStepPattern(2, "c"))" however, when I run it, I get the error mentioned.
Does anyone know what might be wrong?
Thank you!
You need to import this function from the dtw package like this first:
from dtw import *
If you scroll to the top of the Kaggle Page you can see that it is imported there too.
I have a dataset in the form of a table:
Score Percentile
381 1
382 2
383 2
...
569 98
570 99
The complete table is here as a Google spreadsheet.
Currently, I am computing a score and then doing a lookup on this dataset (table) to find the corresponding percentile rank.
Is it possible to create a function to calculate the corresponding percentile rank for a given score using a formula instead of looking it up in the table?
It's impossible to recreate the function that generated a given table of data, if no information is provided about the process behind that data.
That being said, we can make some speculation.
Since it's a "percentile" function, it probably represents the cumulative value of a probability distribution of some sort. A very common probability distribution is the normal distribution, whose "cumulative" counterpart (i.e. its integral) is the so called "error function" ("erf").
In fact, your tabulated data looks a lot like an error function for a variable whose average value is 473.09:
your dataset: orange; fitted error function (erf): blue
However, the agreement is not perfect and that could be because of three reasons:
the fitting procedure I've used to generate the parameters for the error function didn't use the right constraints (because I have no idea what I'm modelling!)
your dataset doesn't represent an exact normal distribution, but rather real world data whose underlying distribution is the normal distribution. The features of your sample data that deviate from the model are being ignored altogether.
the underlying distribution is not a normal distribution at all, its integral just happens to look like the error function by chance.
There is literally no way for me to tell!
If you want to use this function, this is its definition:
import numpy as np
from scipy.special import erf
def fitted_erf(x):
c = 473.09090474
w = 37.04826334
return 50+50*erf((x-c)/(w*np.sqrt(2)))
Tests:
In [2]: fitted_erf(439) # 17 from the table
Out[2]: 17.874052406601457
In [3]: fitted_erf(457) # 34 from the table
Out[3]: 33.20270318344252
In [4]: fitted_erf(474) # 51 from the table
Out[4]: 50.97883169390196
In [5]: fitted_erf(502) # 79 from the table
Out[5]: 78.23955071273468
however I'd strongly advise you to check if a fitted function, made without knowledge of your data source, is the right tool for your task.
P.S.
In case you're interested, this is the code used to obtain the parameters:
import numpy as np
from scipy.special import erf
from scipy.optimize import curve_fit
tab=np.genfromtxt('table.csv', delimiter=',', skip_header=1)
# using a 'table.csv' file generated by Google Spreadsheets
x = tab[:,0]
y = tab[:,1]
def parametric_erf(x, c, w):
return 50+50*erf((x-c)/(w*np.sqrt(2)))
pars, j = curve_fit(parametric_erf, x, y, p0=[475,10])
print(pars)
# outputs [ 473.09090474, 37.04826334]
and to generate the plot
import matplotlib.pyplot as plt
plt.plot(x,parametric_erf(x,*pars))
plt.plot(x,y)
plt.show()
Your question is quite vague but it seems whatever calculation you do ends up with a number in the range 381-570, is this correct. You have a multiline calculation which gives this number? I'm guessing you are repeating this in many places in your code which is why you want to procedurise it?
For any calculation you can wrap it in a function. For instance:
answer = variable_1 * variable_2 + variable_3
can be written as:
def calculate(v1, v2, v3):
''' calculate the result from the inputs
'''
return v1 * v2 + v3
answer = calculate(variable_1, variable_2, variable_3)
if you would like a definitive answer then simply post your calculation and I can make it into a function for you
Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?
Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
df = pd.read_csv("C:\...\Advertising.csv")
x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]
print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params
print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params
print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_
The output is as follows:
Statsmodel.Formula.Api Method
Intercept 7.032594
TV 0.047537
dtype: float64
Statsmodel.Api Method
TV 0.08325
dtype: float64
Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]
The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.
What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?
Came across this issue today and wanted to elaborate on #stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.
Unless you are using actual R-style string-formulas when instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.api and plain statsmodels.api. #Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.
Furthermore it doesn't matter whether you specify the hasconst parameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconst is ignored even though it is supposed to
[Indicate] whether the RHS includes a user-supplied constant
because, in the footnotes
No constant is added by the model unless you are using formulas.
The example below shows that both .formulas.api and .api will require a user-added column vector of 1s if not using R-style string formulas.
# Generate some relational data
np.random.seed(123)
nobs = 25
x = np.random.random((nobs, 2))
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1]
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e
Now throw x and y into Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:
Intercept 1.497761024
X Variable 1 0.012073045
X Variable 2 0.623936056
Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.api or statsmodels.api with hasconst set to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)
import statsmodels.formula.api as smf
import statsmodels.api as sm
print('smf models')
print('-' * 10)
for hc in [None, True, False]:
model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
print(model.params)
# smf models
# ----------
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
Now running things correctly with a column vector of 1.0s added to x. You can use smf here but it's really not necessary if you're not using formulas.
print('sm models')
print('-' * 10)
for hc in [None, True, False]:
model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
print(model.params)
# sm models
# ----------
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
The difference is due to the presence of intercept or not:
in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constant from statsmodels.api
x1 = sm.add_constant(x1)
I had a similar issue with the Logit function.
(I used patsy to create my matrices, so the intercept was there.)
My sm.logit was not converging.
My sm.formula.logit was converging however.
Data going in was exactly the same.
I changed the solver method to 'newton' and the sm.logit converged also.
Is it possible the two versions have different default solver methods.
I'm trying to convert the following code from R to Python using the Statsmodels module:
model <- glm(goals ~ att + def + home - (1), data=df, family=poisson, weights=weight)
I've got a similar dataframe (named df) using pandas, and currently have the following line in Python (version 3.4 if it makes a difference):
model = sm.Poisson.from_formula("goals ~ att + def + home - 1", df).fit()
Or, using GLM:
smf.glm("goals ~ att + def + home - 1", df, family=sm.families.Poisson()).fit()
However, I can't get the weighting terms to work. Each record in the dataframe has a date, and I want more recent records to be more valuable for fitting the model than older ones. I've not seen an example of it being used, but surely if it can be done in R, it can be done on Statsmodels... right?
freq_weights is now supported on GLM Poisson, but unfortunately not on sm.Poisson
To use it, pass freq_weights when creating the GLM:
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula = "goals ~ att + def + home - 1"
smf.glm(formula, df, family=sm.families.Poisson(), freq_weights=df['freq_weight']).fit()
I've encountered the same issue.
there is a workaround that should lead to same results. add the weight in logarithm scale (np.log(weight)) you need as one of the explanatory variables with beta equal to 1 (offset option).
I can see there is an option for the exposure which doing the same as I explained above.
There are two solutions for setting up weights for Poisson regression. The first is to use freq_weigths in the GLM function as mentioned by MarkWPiper. The second is to just go with Poisson regression and pass the weights to exposure. As documented here: "Log(exposure) is added to the linear prediction with coefficient equal to 1." This does the same mathematical trick as mentioned by Yaron, although the parameter has a different original meaning. A sample code is as follows:
import statsmodels.api as sm
# or: from statsmodels.discrete.discrete_model import Poisson
fitted = sm.Poisson.from_formula("goals ~ att + def + home - 1", data=df, exposure=df['weight']).fit()