Design-corrected Variance Estimation in Python - python

I am working with survey data in Python and I am trying to estimate design-corrected variance. In R, I know I could use svydesign to specify the weights, strata, and ID, as with the following...
svydesign2019 <- svydesign(id=~HB9_METH_VPSUPU,
strata=~HB9_METH_VSTRATUMPU,
weights=~HB9_METH_WEIGHT,
data=uhb_2019)
And in STATA I know I could use svyset like so...
svyset [pweight=HB9_METH_WEIGHT],
strata(HB9_METH_VSTRATUMPU) psu(HB9_METH_VPSUPU)
singleunit(scaled)
Is there an equivalent package in Python?
Thank you!

Related

Converting from R to Python, trying to understand a line

I have a fairly simple question. I have been converting some statistical analysis code from R to Python. Up until now, I have been doing just fine, but I have gotten stuck on this particular line:
nlsfit <- nls(N~pnorm(m, mean=mean, sd=sd),data=data4fit,start=list(mean=mu, sd=sig), control=list(maxiter=100,warnOnly = TRUE))
Essentially, the program is calculating the non-linear least-squares fit for a set of data, the "nls" command. In the original text, the "tilde" looks like an "enye", I'm not sure if that is significant.
As I understand the equivalent of pnorm in Python is norm.cdf from from scipy.stats. What I want to know is, what does the "tilde/enye" do before the pnorm function is invoked. "m" is a predefined variable, while "mean" and "sd" are not.
I also found some code, essentially reproducing nls in Python: nls Python code, however, because of the date of the post (2013), I was wondering if there are any more recent equivalents, preferably written in Pyton 3.
Any advice is appreiated, thanks!
As you can see from ?nls: the first argument in nsl is formula:
formula: a nonlinear model formula including variables and parameters.
Will be coerced to a formula if necessary
Now, if you do ?formula, we can read this:
The models fit by, e.g., the lm and glm functions are specified in a
compact symbolic form. The ~ operator is basic in the formation of
such models. An expression of the form y ~ model is interpreted as a
specification that the response y is modelled by a linear predictor
specified symbolically by model
Therefore, the ~ in your case nls join the response/dependent/regressand variable in the left with the regressors/explanatory variables in the right part of your nonlinear least squares.
Best!
This minimizes
sum((N - pnorm(m, mean=mean, sd=sd))^2)
using starting values for mean and sd specified in start. It will perform a maximum of 100 iterations and it will return instead of signalling an error in the case of termination before convergence.
The first argument to nls is an R formula which specifies the regression where the left hand side of the tilde (N) is the dependent variable and the right side is the function of the parameters (mean, sd) and data (m) used to predict it.
Note that formula objects do not have a fixed meaning in R but rather each function can interpret them in any way it likes. For example, formula objects used by nls are interpreted differently than formula objects used by lm. In nls the formula y ~ a + b * x would be used to specify a linear regression but in lm the same regression would be expressed as y ~ x .
See ?pnorm, ?nls, ?nls.control and ?formula .

How to implement this R Poisson distribution in Python?

I've coded something in R but I can't seem to do the same in Python.
Below is the code - it definitely works in R.
I am having trouble with the Python syntax to achieve the same with numpy.
myMaxAC = qpois(p=as.numeric(0.95),
lambda=(121412)*(0.005))
For clarity, 0.95 is the confidence interval, 121412 is my population size, and 0.005 is a frequency within the population.
I just want to know how to get the same answer in Python, which incidentally is 648.
You can get this using poisson.ppf:
from scipy.stats import poisson
myMaxAC = poisson.ppf(0.95, (121412)*(0.005))
print(myMaxAC)
648.0

Exponential Moving Average Pandas vs Ta-lib

I'm currently writng a code involving some financial calculation. More in particular some exponential moving average. To do the job I have tried Pandas and Talib:
talib_ex=pd.Series(talib.EMA(self.PriceAdjusted.values,timeperiod=200),self.PriceAdjusted.index)
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=True,min_periods=200-1).mean()
They both work fine, but they provide different results at the begining of the array:
So there is some parameter to be change into pandas's EWMA or it is a bug and I should worry?
Thanks in advance
Luca
For the talib ema, the formula is:
So when using the pandas, if you want to make pandas ema the same as talib, you should use it as:
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=False,min_periods=200-1).mean()
Set the adjust as False according to the document(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html) if you want to use the same formula as talib:
When adjust is True (default), weighted averages are calculated using weights (1-alpha)(n-1), (1-alpha)(n-2), ..., 1-alpha, 1.
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0]; weighted_average[i] = (1-alpha)weighted_average[i-1] + alphaarg[i].
You can also reference here:
https://en.wikipedia.org/wiki/Moving_average
PS: however, in my project, i still find some small differences between the talib and the pandas.ewm and don't know why yet...

computing Std deviation using RDD v/s SparkSQL in Python

I am pretty new to the world of spark ( and to an extend even Python , but better). I am trying to compute the standard deviation and had used the following code. The first using SparkSQL and the code is as follows:
sqlsd=spark.sql("SELECT STDDEV(temperature) as stdtemp from
washing").first().stdtemp
print(sqlsd)
The above works fine ( I think) and it gives the result as 6.070
Now when I try to do this using RDD with the following code:-
def sdTemperature(df,spark):
n=float(df.count())
m=meanTemperature(df,spark)
df=df.fillna({'_id':0,'_rev':0,'count':0,'flowrate':0,'fluidlevel':0,
'frequency':0,'hardness':0,'speed':0,'temperature':0,'ts':0,'voltage':0})
rddT=df.rdd.map(lambda r: r.temperature)
c=rddT.count()
s=rddT.map(lambda x: pow(x-m,2)).sum()
print(n,c,s)
sd=sqrt(s/c)
return sd
when I run the above code, I get a different result. the value I get is 53.195
what I am in doing wrong?. All I am trying to do above is to compute the std deviation for a spark dataframe column temperature and use lambda.
thanks in advance for help ..
thanks to Zero323 who gave me the clue. I skipped the null values . the modified code is as follows:-
df2=df.na.drop(subset=["temperature"])
rddT=df2.rdd.map(lambda r: r.temperature)
c=rddT.count()
s=rddT.map(lambda x: pow(x-m,2)).sum()
sd=math.sqrt(s/c)
return(sd)
There are two types of standard deviations - please refer to this : https://math.stackexchange.com/questions/15098/sample-standard-deviation-vs-population-standard-deviation
Similar question -
Calculate the standard deviation of grouped data in a Spark DataFrame
The stddev() in Hive is a pointer to the stddev_samp(). The stddev_pop() is what you are looking for(inferred from the 2nd part of you code). So your sql query should be select stddev_pop(temperature) as stdtemp from washing

R summary() equivalent in numpy

Is there an equivalent of R's summary() function in numpy?
numpy has std, mean, average functions separately, but does it have a function that sums up everything, like summary does in R?
If found this question which relates to pandas and this article with R-to-numpy equivalents, but it doesn't have what I seek for.
1. Load Pandas in console and load csv data file
import pandas as pd
data = pd.read_csv("data.csv", sep = ",")
2. Examine first few rows of data
data.head()
3. Calculate summary statistics
summary = data.describe()
4. Transpose statistics to get similar format as R summary() function
summary = summary.transpose()
5. Visualize summary statistics in console
summary.head()
No. You'll need to use pandas.
R is for language for statistics, so many of the basic functionality you need, like summary() and lm(), are loaded when you boot it up. Python has many uses, so you need to install and import the appropriate statistical packages. numpy isn't a statistics package - it's for numerical computation more generally, so you need to use packages like pandas, scipy and statsmodels to allow Python to do what R can do out of the box.
If you are looking for details like summary() in R i.e
5 point summary for numeric variables
Frequency of occurrence of each class for categorical variable
To achieve above in Python you can use df.describe(include= 'all').

Categories