I wish to create a sklearn GMM object with a predefined set of means, weights, and covariances ( on a grid ).
I managed to do it:
from sklearn.mixture import GaussianMixture
import numpy as np
def get_grid_gmm(subdivisions=[10,10,10], variance=0.05 ):
n_gaussians = reduce(lambda x, y: x*y,subdivisions)
step = [ 1.0/(2*subdivisions[0]), 1.0/(2*subdivisions[1]), 1.0/(2*subdivisions[2])]
means = np.mgrid[ step[0] : 1.0-step[0]: complex(0,subdivisions[0]),
step[1] : 1.0-step[1]: complex(0,subdivisions[1]),
step[2] : 1.0-step[2]: complex(0,subdivisions[2])]
means = np.reshape(means,[-1,3])
covariances = variance*np.ones_like(means)
weights = (1.0/n_gaussians)*np.ones(n_gaussians)
gmm = GaussianMixture(n_components=n_gaussians, covariance_type='spherical' )
gmm.weights_ = weights
gmm.covariances_ = covariances
gmm.means_ = means
return gmm
def main():
xx = np.random.rand(100,3)
gmm = get_grid_gmm()
y= gmm.predict_proba(xx)
if __name__ == "__main__":
main()
The problem is its missing the gmm.predict_proba() method that I need to use later on.
How can I overcome this?
UPDATE : I updated the code to be a complete example that shows the error
UPDATE2
I updated the code according to comments and answers
from sklearn.mixture import GaussianMixture
import numpy as np
def get_grid_gmm(subdivisions=[10,10,10], variance=0.05 ):
n_gaussians = reduce(lambda x, y: x*y,subdivisions)
step = [ 1.0/(2*subdivisions[0]), 1.0/(2*subdivisions[1]), 1.0/(2*subdivisions[2])]
means = np.mgrid[ step[0] : 1.0-step[0]: complex(0,subdivisions[0]),
step[1] : 1.0-step[1]: complex(0,subdivisions[1]),
step[2] : 1.0-step[2]: complex(0,subdivisions[2])]
means = np.reshape(means,[3,-1])
covariances = variance*np.ones(n_gaussians)
cov_type = 'spherical'
weights = (1.0/n_gaussians)*np.ones(n_gaussians)
gmm = GaussianMixture(n_components=n_gaussians, covariance_type=cov_type )
gmm.weights_ = weights
gmm.covariances_ = covariances
gmm.means_ = means
from sklearn.mixture.gaussian_mixture import _compute_precision_cholesky
gmm.precisions_cholesky_ = _compute_precision_cholesky(covariances, cov_type)
gmm.precisions_ = gmm.precisions_cholesky_ ** 2
return gmm
def main():
xx = np.random.rand(100,3)
gmm = get_grid_gmm()
_, y = gmm._estimate_log_prob(xx)
y = np.exp(y)
if __name__ == "__main__":
main()
No more errors but _estimate_log_prob and predict_proba do not produce the same result for a fitted GMM. Why could that be?
Since you don't train the model but just use the function for estimation, you don't need to use the object but you could use the same function they use under the hood. You could try _estimate_log_gaussian_prob. That is what they do internaly I think.
Have a look at the source:
in particular at the base class
https://github.com/scikit-learn/scikit-learn/blob/ab93d657eb4268ac20c4db01c48065b5a1bfe80d/sklearn/mixture/base.py#L342
that is calling the specific method, that in turn is calling a function
https://github.com/scikit-learn/scikit-learn/blob/ab93d657eb4268ac20c4db01c48065b5a1bfe80d/sklearn/mixture/gaussian_mixture.py#L671
Related
I am trying to deconvolve complex gas chromatogram signals into individual gaussian signals. Here is an example, where the dotted line represents the signal I am trying to deconvolve.
I was able to write the code to do this using scipy.optimize.curve_fit; however, once applied to real data the results were unreliable. I believe being able to set bounds to my parameters will improve my results, so I am attempting to use lmfit, which allows this. I am having a problem getting lmfit to work with a variable number of parameters. The signals I am working with may have an arbitrary number of underlying gaussian components, so the number of parameters I need will vary. I found some hints here, but still can't figure it out...
Creating a python lmfit Model with arbitrary number of parameters
Here is the code I am currently working with. The code will run, but the parameter estimates do not change when the model is fit. Does anyone know how I can get my model to work?
import numpy as np
from collections import OrderedDict
from scipy.stats import norm
from lmfit import Parameters, Model
def add_peaks(x_range, *pars):
y = np.zeros(len(x_range))
for i in np.arange(0, len(pars), 3):
curve = norm.pdf(x_range, pars[i], pars[i+1]) * pars[i+2]
y = y + curve
return(y)
# generate some fake data
x_range = np.linspace(0, 100, 1000)
peaks = [50., 40., 60.]
a = norm.pdf(x_range, peaks[0], 5) * 2
b = norm.pdf(x_range, peaks[1], 1) * 0.1
c = norm.pdf(x_range, peaks[2], 1) * 0.1
fake = a + b + c
param_dict = OrderedDict()
for i in range(0, len(peaks)):
param_dict['pk' + str(i)] = peaks[i]
param_dict['wid' + str(i)] = 1.
param_dict['mult' + str(i)] = 1.
# In case, you'd like to see the plot of fake data
#y = add_peaks(x_range, *param_dict.values())
#plt.plot(x_range, y)
#plt.show()
# Initialize the model and fit
pmodel = Model(add_peaks)
params = pmodel.make_params()
for i in param_dict.keys():
params.add(i, value=param_dict[i])
result = pmodel.fit(fake, params=params, x_range=x_range)
print(result.fit_report())
I think you would be better off using lmfits ability to build composite model.
That is, with a single peak defined with
from scipy.stats import norm
def peak(x, amp, center, sigma):
return amp * norm.pdf(x, center, sigma)
(see also lmfit.models.GaussianModel), you can build a model with many peaks:
npeaks = 3
model = Model(peak, prefix='p1_')
for i in range(1, npeaks):
model = model + Model(peak, prefix='p%d_' % (i+1))
params = model.make_params()
Now model will be a sum of 3 Gaussian functions, and the params created for that model will have names like p1_amp, p1_center, p2_amp, ..., which you can add sensible initial values and/or bounds and/or constraints.
Given your example data, you could pass in initial values to make_params like
params = model.make_params(p1_amp=2.0, p1_center=50., p1_sigma=2,
p2_amp=0.2, p2_center=40., p2_sigma=2,
p3_amp=0.2, p3_center=60., p3_sigma=2)
result = model.fit(fake, params, x=x_range)
I was able to find a solution here:
https://lmfit.github.io/lmfit-py/builtin_models.html#example-3-fitting-multiple-peaks-and-using-prefixes
Building on the code above, the following accomplishes what I was trying to do...
from lmfit.models import GaussianModel
gauss1 = GaussianModel(prefix='g1_')
gauss2 = GaussianModel(prefix='g2_')
gauss3 = GaussianModel(prefix='g3_')
gauss4 = GaussianModel(prefix='g4_')
gauss5 = GaussianModel(prefix='g5_')
gauss = [gauss1, gauss2, gauss3, gauss4, gauss5]
prefixes = ['g1_', 'g2_', 'g3_', 'g4_', 'g5_']
mod = np.sum(gauss[0:len(peaks)])
pars = mod.make_params()
for i, prefix in zip(range(0, len(peaks)), prefixes[0:len(peaks)]):
pars[prefix + 'center'].set(peaks[i])
init = mod.eval(pars, x=x_range)
out = mod.fit(fake, pars, x=x_range)
print(out.fit_report(min_correl=0.5))
out.plot_fit()
plt.show()
In trying to make my way through Bayesian Methods for Hackers, which is in pymc, I came across this code:
first_coin_flips = pm.Bernoulli("first_flips", 0.5, size=N)
I've tried to translate this to pymc3 with the following, but it just returns a numpy array, rather than a tensor (?):
first_coin_flips = pm.Bernoulli("first_flips", 0.5).random(size=50)
The reason the size matters is that it's used later on in a deterministic variable. Here's the entirety of the code that I have so far:
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
import mpld3
import theano.tensor as tt
model = pm.Model()
with model:
N = 100
p = pm.Uniform("cheating_freq", 0, 1)
true_answers = pm.Bernoulli("truths", p)
print(true_answers)
first_coin_flips = pm.Bernoulli("first_flips", 0.5)
second_coin_flips = pm.Bernoulli("second_flips", 0.5)
# print(first_coin_flips.value)
# Create model variables
def calc_p(true_answers, first_coin_flips, second_coin_flips):
observed = first_coin_flips * true_answers + (1-first_coin_flips) * second_coin_flips
# NOTE: Where I think the size param matters, since we're dividing by it
return observed.sum() / float(N)
calced_p = pm.Deterministic("observed", calc_p(true_answers, first_coin_flips, second_coin_flips))
step = pm.Metropolis(model.free_RVs)
trace = pm.sample(1000, tune=500, step=step)
pm.traceplot(trace)
html = mpld3.fig_to_html(plt.gcf())
with open("output.html", 'w') as f:
f.write(html)
f.close()
And the output:
The coin flips and uniform cheating_freq output look correct, but the observed doesn't look like anything to me, and I think it's because I'm not translating that size param correctly.
The pymc3 way to specify the size of a Bernoulli distribution is by using the shape parameter, like:
first_coin_flips = pm.Bernoulli("first_flips", 0.5, shape=N)
Is there something big broken in the tensorflow framework or am I just making some simple mistakes here. I tried to get the GMM or KMeans clustering to work but I'm totally stuck.
https://pastebin.com/eNxs5mUQ
import numpy as np
import tensorflow as tf
from tensorflow.contrib.learn.python.learn.estimators.kmeans import KMeansClustering
def make_random_centers(num_centers, num_dims):
return np.round(np.random.rand(num_centers,
num_dims).astype(np.float32) * 500)
def make_random_points(centers, num_points):
num_centers, num_dims = centers.shape
assignments = np.random.choice(num_centers, num_points)
offsets = np.round(np.random.randn(num_points,
num_dims).astype(np.float32) * 20)
points = centers[assignments] + offsets
return points
num_centers = 3
num_dims = 2
num_points = 100
true_centers = make_random_centers(num_centers, num_dims)
points = make_random_points(true_centers, num_points)
print(points.shape)
print(points.dtype)
km = KMeansClustering(num_centers)
km.fit(x=points)
clusters = km.clusters()
print(clusters)
I'm getting a InvalidArgumentError though my data seems to be the correct shape and type (= (100, 2), float32):
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'input' with dtype float and shape [?,2]
Try using input_fn like this
def get_input_fn(x):
def input_fn():
return tf.constant(x.astype('float32')), None
return input_fn
I am currently using scikit-learn for text classification on the 20ng dataset. I want to calculate the information gain for a vectorized dataset. It has been suggested to me that this can be accomplished, using mutual_info_classif from sklearn. However, this method is really slow, so I was trying to implement information gain myself based on this post.
I came up with the following solution:
from scipy.stats import entropy
import numpy as np
def information_gain(X, y):
def _entropy(labels):
counts = np.bincount(labels)
return entropy(counts, base=None)
def _ig(x, y):
# indices where x is set/not set
x_set = np.nonzero(x)[1]
x_not_set = np.delete(np.arange(x.shape[1]), x_set)
h_x_set = _entropy(y[x_set])
h_x_not_set = _entropy(y[x_not_set])
return entropy_full - (((len(x_set) / f_size) * h_x_set)
+ ((len(x_not_set) / f_size) * h_x_not_set))
entropy_full = _entropy(y)
f_size = float(X.shape[0])
scores = np.array([_ig(x, y) for x in X.T])
return scores
Using a very small dataset, most scores from sklearn and my implementation are equal. However, sklearn seems to take frequencies into account, which my algorithm clearly doesn't. For example
categories = ['talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
X, y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
max_features=100,
stop_words='english')
X_vec = cv.fit_transform(X)
t0 = time()
res_sk = mutual_info_classif(X_vec, y, discrete_features=True)
print("Time passed for sklearn method: %3f" % (time()-t0))
t0 = time()
res_ig = information_gain(X_vec, y)
print("Time passed for ig: %3f" % (time()-t0))
for name, res_mi, res_ig in zip(cv.get_feature_names(), res_sk, res_ig):
print("%s: mi=%f, ig=%f" % (name, res_mi, res_ig))
sample output:
center: mi=0.011824, ig=0.003548
christian: mi=0.128629, ig=0.127122
color: mi=0.028413, ig=0.026397
com: mi=0.041184, ig=0.030458
computer: mi=0.020590, ig=0.012327
cs: mi=0.007291, ig=0.001574
data: mi=0.020734, ig=0.008986
did: mi=0.035613, ig=0.024604
different: mi=0.011432, ig=0.005492
distribution: mi=0.007175, ig=0.004675
does: mi=0.019564, ig=0.006162
don: mi=0.024000, ig=0.017605
earth: mi=0.039409, ig=0.032981
edu: mi=0.023659, ig=0.008442
file: mi=0.048056, ig=0.045746
files: mi=0.041367, ig=0.037860
ftp: mi=0.031302, ig=0.026949
gif: mi=0.028128, ig=0.023744
god: mi=0.122525, ig=0.113637
good: mi=0.016181, ig=0.008511
gov: mi=0.053547, ig=0.048207
So I was wondering if my implementation is wrong, or it is correct, but a different variation of the mutual information algorithm scikit-learn uses.
A little late with my answer but you should look at Orange's implementation. Within their app it is used as a behind-the-scenes processor to help inform the dynamic model parameter building process.
The implementation itself looks fairly straightforward and could most likely be ported out. The entropy calculation first
The sections starting at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L233
def _entropy(dist):
"""Entropy of class-distribution matrix"""
p = dist / np.sum(dist, axis=0)
pc = np.clip(p, 1e-15, 1)
return np.sum(np.sum(- p * np.log2(pc), axis=0) * np.sum(dist, axis=0) / np.sum(dist))
Then the second portion.
https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L305
class GainRatio(ClassificationScorer):
"""
Information gain ratio is the ratio between information gain and
the entropy of the feature's
value distribution. The score was introduced in [Quinlan1986]_
to alleviate overestimation for multi-valued features. See `Wikipedia entry on gain ratio
<http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
"""
def from_contingency(self, cont, nan_adjustment):
h_class = _entropy(np.sum(cont, axis=1))
h_residual = _entropy(np.compress(np.sum(cont, axis=0), cont, axis=1))
h_attribute = _entropy(np.sum(cont, axis=0))
if h_attribute == 0:
h_attribute = 1
return nan_adjustment * (h_class - h_residual) / h_attribute
The actual scoring process happens at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L218
I am trying to make the simpliest regression on pyBrain but somehow I'm failing.
The Neural Network should learn the function Y=3*X
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import SupervisedDataSet
from pybrain.structure import FullConnection, FeedForwardNetwork, TanhLayer, LinearLayer, BiasUnit
import matplotlib.pyplot as plt
from numpy import *
n = FeedForwardNetwork()
n.addInputModule(LinearLayer(1, name = 'in'))
n.addInputModule(BiasUnit(name = 'bias'))
n.addModule(TanhLayer(1,name = 'tan'))
n.addOutputModule(LinearLayer(1, name = 'out'))
n.addConnection(FullConnection(n['bias'], n['tan']))
n.addConnection(FullConnection(n['in'], n['tan']))
n.addConnection(FullConnection(n['tan'], n['out']))
n.sortModules()
# initialize the backprop trainer and train
t = BackpropTrainer(n, learningrate = 0.1, momentum = 0.0, verbose = True)
#DATASET
DS = SupervisedDataSet( 1, 1 )
X = random.rand(100,1)*100
Y = X*3+random.rand(100,1)*5
for r in xrange(X.shape[0]):
DS.appendLinked((X[r]),(Y[r]))
t.trainOnDataset(DS, 200)
plt.plot(X,Y,'.b')
X=[[i] for i in arange(0,100,0.1)]
Y=map(n.activate,X)
plt.plot(X,Y,'-g')
It doesn't learn anything. I have tried to remove the hidden layer (because in this example we don't even need that) and the network started to predict NaNs.
What's going on?
EDIT: This is the code that solved my problem:
#DATASET
DS = SupervisedDataSet( 1, 1 )
X = random.rand(100,1)*100
Y = X*3+random.rand(100,1)*5
maxy = float(max(Y))
maxx = 100.0
for r in xrange(X.shape[0]):
DS.appendLinked((X[r]/maxx),(Y[r]/maxy))
t.trainOnDataset(DS, 200)
plt.plot(X,Y,'.b')
X=[[i] for i in arange(0,100,0.1)]
Y=map(lambda x: n.activate(array(x)/maxx)*maxy,X)
plt.plot(X,Y,'-g')
The basic pybrain neurons are going to output something between 0 and 1. Divide your Y by 300 (the maximum possible value), and you'll get better results.
More generally, find the maximum Y for your dataset, and scale everything by that.