How to implement this R Poisson distribution in Python? - python

I've coded something in R but I can't seem to do the same in Python.
Below is the code - it definitely works in R.
I am having trouble with the Python syntax to achieve the same with numpy.
myMaxAC = qpois(p=as.numeric(0.95),
lambda=(121412)*(0.005))
For clarity, 0.95 is the confidence interval, 121412 is my population size, and 0.005 is a frequency within the population.
I just want to know how to get the same answer in Python, which incidentally is 648.

You can get this using poisson.ppf:
from scipy.stats import poisson
myMaxAC = poisson.ppf(0.95, (121412)*(0.005))
print(myMaxAC)
648.0

Related

Design-corrected Variance Estimation in Python

I am working with survey data in Python and I am trying to estimate design-corrected variance. In R, I know I could use svydesign to specify the weights, strata, and ID, as with the following...
svydesign2019 <- svydesign(id=~HB9_METH_VPSUPU,
strata=~HB9_METH_VSTRATUMPU,
weights=~HB9_METH_WEIGHT,
data=uhb_2019)
And in STATA I know I could use svyset like so...
svyset [pweight=HB9_METH_WEIGHT],
strata(HB9_METH_VSTRATUMPU) psu(HB9_METH_VPSUPU)
singleunit(scaled)
Is there an equivalent package in Python?
Thank you!

Which gives correct standard deviation ..numpy.std() or statistics.stdev()

import statistics
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
print(statistics.stdev(speed))
print(numpy.std(speed))`
#9.636336148089395
#9.258292301032677
why both answers are not same..since it's not same which answer is the correct standard deviation??? please explain someone
For stdev of entire population as in numpy.std() use:
statistics.pstdev()
I think both are correct. statistics.stdev(speed) makes the calculation using n-1 degrees of freedom and numpy.std(speed) uses n instead. If you're trying to estimate the standard deviation from a population using a sample of data, then you can use statistics.stdev(speed).

What simple filter could I use to de-noise my data?

I'm processing some experimental data in Python 3. The data (raw_data in my code) is pretty noisy:
One of my goal is to find the peaks, and for this I'd like to filter the noise. Based on what I found in the documentation of SciPy's Signal module, the theory of filtering seems to be really complicated, and unfortunately I have zero background. Of course I got to learn it sooner or later - and I intend to - but now now the profit doesn't worth the time (and learning filter theory isn't the purpose of my work), so I shamefully copied the code in Lyken Syu's answer without a chance of understanding the background:
import numpy as np
from scipy import signal as sg
from matplotlib import pyplot as plt
# [...] code, resulting in this:
raw_data = [arr_of_xvalues, arr_of_yvalues] # xvalues are in decreasing order
# <magic beyond my understanding>
n = 20 # the larger n is, the smoother the curve will be
b = [1.0 / n] * n
a = 2
filt = sg.lfilter(b, a, raw_data)
filtered = sg.lfilter(b, a, filt)
# <\magic>
plt.plot(filtered[0], filtered[1], ".")
plt.show()
It kind of works:
What concerns me is the curve from 0 to the beginning of my dataset the filter adds. I guess it's a property of the IIR filter I used, but I don't know how to prevent this. Also, I couldn't make other filters work so far. I need to use this code on other experimental results alike this, so I need a somewhat more general solution than e.g. cutting out all y<10 points.
Is there a better (possibly simpler) way, or choice of filter that is easy to implement without serious theoretical background?
How, if, could I prevent my filter adding that curve to my data?

Exponential Moving Average Pandas vs Ta-lib

I'm currently writng a code involving some financial calculation. More in particular some exponential moving average. To do the job I have tried Pandas and Talib:
talib_ex=pd.Series(talib.EMA(self.PriceAdjusted.values,timeperiod=200),self.PriceAdjusted.index)
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=True,min_periods=200-1).mean()
They both work fine, but they provide different results at the begining of the array:
So there is some parameter to be change into pandas's EWMA or it is a bug and I should worry?
Thanks in advance
Luca
For the talib ema, the formula is:
So when using the pandas, if you want to make pandas ema the same as talib, you should use it as:
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=False,min_periods=200-1).mean()
Set the adjust as False according to the document(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html) if you want to use the same formula as talib:
When adjust is True (default), weighted averages are calculated using weights (1-alpha)(n-1), (1-alpha)(n-2), ..., 1-alpha, 1.
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0]; weighted_average[i] = (1-alpha)weighted_average[i-1] + alphaarg[i].
You can also reference here:
https://en.wikipedia.org/wiki/Moving_average
PS: however, in my project, i still find some small differences between the talib and the pandas.ewm and don't know why yet...

What algorithm does Pandas use for computing variance?

Which method does Pandas use for computing the variance of a Series?
For example, using Pandas (v0.14.1):
pandas.Series(numpy.repeat(500111,2000000)).var()
12.579462289731145
Obviously due to some numeric instability. However, in R we get:
var(rep(500111,2000000))
0
I wasn't able to make enough sense of the Pandas source-code to figure out what algorithm it uses.
This link may be useful: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Update: To summarize the comments below - If the Python bottleneck package for fast NumPy array functions is installed, a stabler two-pass algorithm similar to np.sqrt(((arr - arr.mean())**2).mean()) is used and gives 0.0 (as indicated by #Jeff); whereas if it is not installed, the naive implementation indicated by #BrenBarn is used.
The algorithm can be seen in nanops.py, in the function nanvar, the last line of which is:
return np.fabs((XX - X ** 2 / count) / d)
This is the "naive" implementation at the beginning of the Wikipedia article you mention. (d will be set to N-1 in the default case.)
The behavior you're seeing appears to be due to the sum of squared values overflowing the numpy datatypes. It's not an issue of how the variance is calculated per se.
I don't know the answer, but it seems related to how Series are stored, not necessarily the var function.
np.var(pd.Series(repeat(100000000,100000)))
26848.788479999999
np.var(repeat(100000000,100000))
0.0
Using Pandas 0.11.0.

Categories