Coin tosses, arithmetic of random variables, and PyMC3 - python

I find myself wanting to perform arithmetic of random variables in Python; for the sake of example, let us consider the experiment of repeatedly tossing two independent fair coins and counting the number of heads.
Sampling from each random variable independently is straightforward with scipy.stats, and we can start getting results right away
In [5]: scipy.stats.bernoulli(0.5).rvs(10) + scipy.stats.bernoulli(0.5).rvs(10)
Out[5]: array([1, 0, 0, 0, 1, 1, 1, 2, 1, 2])
Now, a pessimist would remark that we wouldn't even have to go that far and could instead just do np.random.randint(2, size=10) + np.random.randint(2, size=10), and a cynic would notice that we could just calculate the sum and never have to sample anything.
And they'd be right. So, say that we have many more variables and more complex operations to perform on them, and graphical models quickly become useful. That is, we might want to operate on the random variables themselves and only start sampling when our graph of computation is set up. In lea, which does exactly that (albeit only for discrete distributions), the example above becomes
In [1]: from lea import Lea
In [7]: (Lea.bernoulli(0.5) + Lea.bernoulli(0.5)).random(10)
Out[7]: (0, 2, 0, 2, 0, 2, 1, 1, 1, 2)
Appears to be working like a charm. Enter PyMC3, one of the more popular libraries for probabilistic programming. Now, PyMC3 is intended for usage with MCMC and Bayesian modeling in particular, but it has the building blocks we need for our experiment above. Alas,
In [1]: import pymc3 as pm
In [2]: pm.__version__
Out[2]: '3.2'
In [3]: with pm.Model() as model:
...: x = pm.Bernoulli('x', 0.5)
...: y = pm.Bernoulli('y', 0.5)
...: z = pm.Deterministic('z', x+y)
...: trace = pm.sample(10)
...:
Assigned BinaryGibbsMetropolis to x
Assigned BinaryGibbsMetropolis to y
100%|███████████████████████████████████████| 510/510 [00:02<00:00, 254.22it/s]
In [4]: trace['z']
Out[4]: array([2, 0, 2, 0, 2, 0, 2, 0, 2, 0], dtype=int64)
Not exactly random. Unfortunately, I lack the theoretical understanding of why the Gibbs sampler produces this particular result (and really I should probably just hit the books). Using step=pm.Metropolis() instead, we get the correct distribution at the end of the day, even if the individual samples correlate strongly with their neighbours (as is to be expected from MCMC).
In [8]: with pm.Model() as model:
...: x = pm.Bernoulli('x', 0.5)
...: y = pm.Bernoulli('y', 0.5)
...: z = pm.Deterministic('z', x+y)
...: trace = pm.sample(10000, step=pm.Metropolis())
...:
100%|██████████████████████████████████████████████████████████████████████████████████████████| 10500/10500 [00:02<00:00, 5161.18it/s]
In [14]: collections.Counter(trace['z'])
Out[14]: Counter({0: 2493, 1: 5024, 2: 2483})
So, maybe I could just go ahead and use pm.Metropolis for simulating my post-arithmetic distribution, but I'd be afraid that I was missing something, and so the question finally becomes: Why does the step-less simulation above fail, and are there any pitfalls in using PyMC3 for ordinary, non-MC, MC, and is what I'm trying to do even possible in PyMC3 in the first place?

Comments by colcarroll:
[Feb. 21, 2018]: Definitely a bug - github.com/pymc-devs/pymc3/issues/2866 . What you are doing should work, but is not the intention of the library. You would use PyMC3 to reason about uncertainty (perhaps observing z and reasoning about the probabilities of x and y). I think your first two approaches, and perhaps the pomegranate library would be more efficient. See stackoverflow.com/questions/46454814/… –
[Feb. 25, 2018]: This is now fixed on master (see github.com/pymc-devs/pymc3/pull/2867) by Junpeng Lao. See andrewgelman.com/2018/01/18/… for background on "Anticorrelated draws". I am not sure how stackoverflow wants to handle a question like this.

Related

Has the regularization parameter C changed in sklearn.LogisticRegression or is there a bug in newer versions (>0.24)?

In the past I used Python 3.5 and sklearn version <0.19 to study the effect of 3 different values of the regularization parameter C of sklearn.LogisticRegression.
My sample code:
from sklearn.linear_model import LogisticRegression
import numpy as np
X = np.array([[ 90], [ 90], [130],[170],[ 90],[180],[110],[ 70],[ 70],\
[140],[170],[110],[ 70],[160],[ 80]])
y = np.array([0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0])
for cc in [0.1, 1, 100]: # [10]:#
clfLogR = LogisticRegression(C=cc).fit(X, y)
a = clfLogR.intercept_
b = clfLogR.coef_
print(a, b)
This resulted as expected in three much different values of the coefficient and intercept as the values of C are quite distinctive.
Recently I upgraded to python 3.9.7, and skcikit-learn 1.1.3, and I cannot reproduce the results. When running the same Python code the values of a and b are much closer to each other: if I would plot the logistics curves these virtually coincides unless I greatly change the values that C can take: e.g. C in [0.0001, 1, 10000].
Has the processing/interpretation of regularization parameter changed?
How to get to the original results without having to downgrade Python or sklearn?
I downgraded sklearn to version 0.24, with no success. I cannot downgrade further as that would require to downgrade Python. I am not familiar with using environments. I checked the source code of sklearn.LogisticRegression, but did not find a bug. Neither did I find a hint in the release history of scikit-learn.
How could I get to the original results without downgrading?
If I should ask the developer team, how can I contact them?
Thanks

How can I use rv_continuous to generate random samples from a gaussian kde?

I've been stuck on this for awhile now. Here's my problem: I have a set of observed data. I want to turn this data into a pmf, use a gaussian kernel density estimator to estimate a continuous pdf from this observed data, and then be able to sample from that pdf.
I am using Scipy, and have managed to get the kernel density estimation to work. I think what I need to sample from it is to subclass rv_continuous and overwrite the _pdf method by returning the evaluation of my kernel density estimation. However, when I try to define the kernel density estimation in my rv_continuous class' init method, I am unable to call .rvs() on the resulting class to sample from it. But when I define a function separately and call this function independently in the rv_continuous class' _pdf method, it works fine.
This sounds confusing, I know, but see what I'm talking about in the code below.
from scipy.stats import gaussian_kde, rv_continuous
import numpy.random as npr
# Create fake data just to test if this works
test_data = [0, 8, 12, 35, 40, 4, 1, 0, 0, 0]
output = []
for entry in test_data:
number_obs = entry
for i in range(number_obs):
mao = npr.uniform()
output.append(mao)
# First, what I would like to work
class mao_pdf(rv_continuous):
"""
Class for creating a pdf, round-by-round, from which samples may be drawn.
"""
def __init__(self, data):
super(rv_continuous, self).__init__()
self.kde = gaussian_kde(data, bw_method = 0.18)
def _pdf(self, x):
return self.kde.evaluate(x)[0]
pdf = mao_pdf(output)
print(pdf.rvs()) # This does not work
# Now, what paradoxically works (but is really the exact same thing
# just in a convoluted way)
test_kde = gaussian_kde(output, bw_method = 0.18)
def f(x):
return test_kde.evaluate(x)[0]
class test_pdf(rv_continuous):
def _pdf(self, x):
return f(x)
pdf = test_pdf(a = 0, b = 1)
print(pdf.rvs()) # This one works
So it seems like it might have something to do with the bounds (setting a = 0, b = 1), but for the life of me I cannot figure out why this is so critical or how to even implmement this in my class mao_pdf. I tried just defining self.a = 0 and self.b = 1 in the __init__() method of my mao_pdf class, but that did not work.
This really should not be so complicated, I'm just trying to turn actual observed data into a sample-able continuous probability density function. Any help is greatly appreciated.
Actually, it's very simple to sample from a KDE distribution, without having to calculate the PDF for the KDE:
Choose a data point uniformly at random (with replacement).
Add a normally distributed random number to the data point, with a mean of 0 and a standard deviation equal to the bandwidth.
In fact, this distribution can be sampled even without SciPy, as long as you know the bandwidth:
import numpy
gen=numpy.random.Generator(npr.PCG64())
test_data = [0, 8, 12, 35, 40, 4, 1, 0, 0, 0]
# Choose 10 data points from `test_data` at random.
c=gen.integers(0, len(test_data), size=10)
c=numpy.asarray([float(test_data[d]) for d in c])
# Add a Gaussian jitter.
# Use the bandwidth factor in your example.
c+=gen.normal(0, 0.18)
Note that the code above requires NumPy 1.17 or later, which includes an improved system for random number generation. If you must still use NumPy 1.16 or earlier, the following code can be used:
import numpy
gen=numpy.random.Generator(npr.PCG64())
test_data = [0, 8, 12, 35, 40, 4, 1, 0, 0, 0]
c=numpy.random.uniform(0, 10, size=10)
c=numpy.asarray([float(test_data[int(d)]) for d in c])
c+=numpy.random.normal(0, 0.18) # Use the bandwidth factor in your example

How to generate random values with a 99% probability of values falling into (3, 9) exclusive?

this code is to generate a 2 x 5 array of ints between 3 and 9, inclusive:
>>> np.random.randint(3,10, size=(2, 5))
array([[6, 8, 5, 7, 6],
[9, 4, 8, 4, 9]])
this code is to generate a 2 x 5 array of the normal distribution.
>>> np.random.randn(2,5)
array([[-1.87600791, 0.01958029, -1.07254967, -1.15393634, -0.43278059],
[ 0.17111773, 1.45624528, -0.74829039, -0.60530629, -0.07440962]])
Any normal distribution is unbounded, is it reasonable to generate a 2 x 5 array of normal distribution between 3 and 9, exclusive under a specific probability, such as 99%.
In another word, is it reasonable to implement a function by which all of the random values generated have a probability of 99% fallen into (3,9) exclusive
Of course you can. You just need to find the most adequate parameters for the general normal distribution function: mean value and standard deviation.
Now, the first one is easy to calculate as it is just midway between 3 and 9, which is 6; the second one unfortunately not so much, as you'll need more advanced mathematical tools or at list a very good GDC calculator, but I think you can find some app online that does just that. Anyway, if you trust me with this I calculated it and it should be approximately 1.165.
All there is left to do is to implement them. Easier done than said, as in the documentation here it's written very clearly how to proceed.
This is the code plus the result outputted:
>>> 1.165 * np.random.randn(2, 5) + 6
[[5.24339407 6.7414676 4.13757041 7.58498417 5.68613585]
[6.73871503 8.09501399 7.57774228 4.41143519 5.69703988]]
I hope this answer satisfies your question.
To add to Michele's answer (which is correct) - you can verify your results as below. Also, I just made this interactive tool you can use for finding any such interval (x_low, x_high) - sigma & mu replaced w/ alpha & beta, as Desmos doesn't support them.
import matplotlib.pyplot as plt
import numpy as np
X = 1.165 * np.random.randn(50000) + 6
plt.hist(np.ndarray.flatten(X), bins=1000)
plt.axvline(x=3, color='r')
plt.axvline(x=9, color='r')
frac_between_3_and_9 = np.sum((X > 3) & (X < 9)) / X.size
print(frac_between_3_and_9)
# .99008

Neural Network Data Sparsity

I am using PyBrain to train a network on music. The input is two notes, and the output is the next two notes.
Each note is represented by an integer mapped to a note (E.G C# = 11, F = 7), the octave, and the duration. So I was using a dataset as such:
ds = SupervisedDataSet(6, 6)
Which would look like ([note1, octave1, duration1, note2, octave2, duration2], [note1, octave1, duration1, note2, octave2, duration2])
However, I ran into a problem with chords (I.E more than one note played at once). To solve this, I got rid of the first integer representing a note and replaced it with 22 integers, set to either one or zero, to indicate which notes are being played. I still have this followed by octave and duration.
So for example, the following
[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 4, 0.5]
represents a chord of C#, E and A, with an octave of 4 and duration of 0.5.
PyBrain always gives me an output of all zeros after training and testing. I understand why it's doing this but I don't know how to fix it.
Is there a better way to represent the notes/chords so that PyBrain won't have this problem?
EDIT: I have since converted the bit vector into a decimal number, and while the network isn't just giving zeros anymore it's still pretty clear it's not learning the patterns correctly.
I am using a network like this:
net = buildNetwork(6, 24, 6, bias=True, hiddenclass=LSTMLayer, recurrent=True)
and a trainer like this:
trainer = BackpropTrainer(net, ds, verbose = True)
when I train I am getting a huge error, something like ten or a hundred thousand.
Your problem is not so clear for me, I think it needs more detailed explanation, but depended what I understood I suppose that you don't need reccurence in your network, also try to use another activation function in hidden layer, for example Softmax. I tested it on some data set of samples with 6 nodes input and 6 - output and it is being trained properly, so I there I suggest you my version:
from pybrain.tools.shortcuts import buildNetwork
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.structure.modules import SoftmaxLayer
ds = SupervisedDataSet(6, 6)
#
# fill dataset
#
net = buildNetwork(6, 24, 6, bias=True, hiddenclass=SoftmaxLayer)
trainer = BackpropTrainer(net, ds)
train:
error = 10
while error > 0.00001: #choose error like you want
error = trainer.train()
print error #just for logging
#and activate
print net.activate([*,*,*,*,*,*])

Custom PDF from scipy.stats.rv_continuous unwanted upper-bound

I am attempting to generate a random probability density function of QSO's of certain luminosity with the form:
1/( (L/L_B^* )^alpha + (L/L_B^* )^beta )
where L_B^*, alpha, and beta are all constants. To do this, the following code is used:
import scipy.stats as st
logLbreak = 43.88
alpha = 3.4
beta = 1.6
class my_pdf(st.rv_continuous):
def _pdf(self,l_L):
#"l_L" in this is always log L
L = 10**(l_L/logLbreak)
D = 1/(L**alpha + L**beta)
return D
dist_Log_L = my_pdf(momtype = 0, a = 0,name='l_L_dist')
distro = dist_Log_L.rvs(size = 10000)
(L/L^* is rased to a power of 10 since everything is being done in a log scale)
The distribution is supposed to produce a graph that approximates this, trailing off to infinity, but in reality the graph it produces looks like this (10,000 samples). The upper bound is the same regardless of the amount of samples that are used. Is there a reason it is being restricted in the way it is?
Your PDF is not properly normalized. The integral of a PDF over the domain must be 1. Your PDF integrates to approximately 3.4712:
In [72]: from scipy.integrate import quad
In [73]: quad(dist_Log_L._pdf, 0, 100)
Out[73]: (3.4712183965415373, 2.0134487716044682e-11)
In [74]: quad(dist_Log_L._pdf, 0, 800)
Out[74]: (3.4712184965748905, 2.013626296581202e-11)
In [75]: quad(dist_Log_L._pdf, 0, 1000)
Out[75]: (3.47121849657489, 8.412130378805368e-10)
This will break the class's implementation of inverse transform sampling. It will only generate samples from the domain up to where the integral of the PDF from 0 to x first reaches 1.0, which in your case is about 2.325
In [81]: quad(dist_Log_L._pdf, 0, 2.325)
Out[81]: (1.0000875374350238, 1.1103202107010366e-14)
That is, in fact, what you see in your histogram.
As a quick fix to verify the issue, I modified the return statement of the _pdf() method to:
return D/3.47121849657489
and ran your script again. (In a real fix, that value will be a function of the other parameters.) Then the commands
In [85]: import matplotlib.pyplot as plt
In [86]: plt.hist(distro, bins=31)
generates this plot:

Categories