I have fit a linearmodels.PanelOLS model and stored it in m. I now want to test if certain coefficients are simultaneously equal to zero.
Does a fitted linearmodels.PanelOLS object have an F-test function where I can pass my own restriction matrix?
I am looking for something like statsmodels' f_test method.
Here's a minimum reproducible example.
# Libraries
from linearmodels.panel import PanelOLS
from linearmodels.datasets import wage_panel
# Load data and set index
df = wage_panel.load()
df = df.set_index(['nr','year'])
# Add constant term
df['const'] = 1
# Fit model
m = PanelOLS(dependent=df['lwage'], exog=df[['const','expersq','married']])
m = m.fit(cov_type='clustered', cluster_entity=True)
# Is there an f_test method for m???
m.f_test(r_mat=some_matrix_here) # Something along these lines?
You can use wald_test (a standard F-test is numerically identical to a Walkd test under some assumptions on the covariance).
# Libraries
from linearmodels.panel import PanelOLS
from linearmodels.datasets import wage_panel
# Load data and set index
df = wage_panel.load()
df = df.set_index(['nr','year'])
# Add constant term
df['const'] = 1
# Fit model
m = PanelOLS(dependent=df['lwage'], exog=df[['const','expersq','married']])
m = m.fit(cov_type='clustered', cluster_entity=True)
Then the test
import numpy as np
# Use matrix notation RB - q = 0 where R is restr and q is value
# Restrictions: expersq = 0.001 & expersq+married = 0.2
restr = np.array([[0,1,0],[0,1,1]])
value = np.array([0.01, 0.2])
m.wald_test(restr, value)
This returns
Linear Equality Hypothesis Test
H0: Linear equality constraint is valid
Statistic: 0.2608
P-value: 0.8778
Distributed: chi2(2)
WaldTestStatistic, id: 0x2271cc6fdf0
You can also use formula syntax if you used formulas to define your model, which can be easier to code up.
fm = PanelOLS.from_formula("lwage~ 1 + expersq + married", data=df)
fm = fm.fit(cov_type='clustered', cluster_entity=True)
fm.wald_test(formula="expersq = 0.001,expersq+married = 0.2")
The result is the same as above.
I am trying to deconvolve complex gas chromatogram signals into individual gaussian signals. Here is an example, where the dotted line represents the signal I am trying to deconvolve.
I was able to write the code to do this using scipy.optimize.curve_fit; however, once applied to real data the results were unreliable. I believe being able to set bounds to my parameters will improve my results, so I am attempting to use lmfit, which allows this. I am having a problem getting lmfit to work with a variable number of parameters. The signals I am working with may have an arbitrary number of underlying gaussian components, so the number of parameters I need will vary. I found some hints here, but still can't figure it out...
Creating a python lmfit Model with arbitrary number of parameters
Here is the code I am currently working with. The code will run, but the parameter estimates do not change when the model is fit. Does anyone know how I can get my model to work?
import numpy as np
from collections import OrderedDict
from scipy.stats import norm
from lmfit import Parameters, Model
def add_peaks(x_range, *pars):
y = np.zeros(len(x_range))
for i in np.arange(0, len(pars), 3):
curve = norm.pdf(x_range, pars[i], pars[i+1]) * pars[i+2]
y = y + curve
return(y)
# generate some fake data
x_range = np.linspace(0, 100, 1000)
peaks = [50., 40., 60.]
a = norm.pdf(x_range, peaks[0], 5) * 2
b = norm.pdf(x_range, peaks[1], 1) * 0.1
c = norm.pdf(x_range, peaks[2], 1) * 0.1
fake = a + b + c
param_dict = OrderedDict()
for i in range(0, len(peaks)):
param_dict['pk' + str(i)] = peaks[i]
param_dict['wid' + str(i)] = 1.
param_dict['mult' + str(i)] = 1.
# In case, you'd like to see the plot of fake data
#y = add_peaks(x_range, *param_dict.values())
#plt.plot(x_range, y)
#plt.show()
# Initialize the model and fit
pmodel = Model(add_peaks)
params = pmodel.make_params()
for i in param_dict.keys():
params.add(i, value=param_dict[i])
result = pmodel.fit(fake, params=params, x_range=x_range)
print(result.fit_report())
I think you would be better off using lmfits ability to build composite model.
That is, with a single peak defined with
from scipy.stats import norm
def peak(x, amp, center, sigma):
return amp * norm.pdf(x, center, sigma)
(see also lmfit.models.GaussianModel), you can build a model with many peaks:
npeaks = 3
model = Model(peak, prefix='p1_')
for i in range(1, npeaks):
model = model + Model(peak, prefix='p%d_' % (i+1))
params = model.make_params()
Now model will be a sum of 3 Gaussian functions, and the params created for that model will have names like p1_amp, p1_center, p2_amp, ..., which you can add sensible initial values and/or bounds and/or constraints.
Given your example data, you could pass in initial values to make_params like
params = model.make_params(p1_amp=2.0, p1_center=50., p1_sigma=2,
p2_amp=0.2, p2_center=40., p2_sigma=2,
p3_amp=0.2, p3_center=60., p3_sigma=2)
result = model.fit(fake, params, x=x_range)
I was able to find a solution here:
https://lmfit.github.io/lmfit-py/builtin_models.html#example-3-fitting-multiple-peaks-and-using-prefixes
Building on the code above, the following accomplishes what I was trying to do...
from lmfit.models import GaussianModel
gauss1 = GaussianModel(prefix='g1_')
gauss2 = GaussianModel(prefix='g2_')
gauss3 = GaussianModel(prefix='g3_')
gauss4 = GaussianModel(prefix='g4_')
gauss5 = GaussianModel(prefix='g5_')
gauss = [gauss1, gauss2, gauss3, gauss4, gauss5]
prefixes = ['g1_', 'g2_', 'g3_', 'g4_', 'g5_']
mod = np.sum(gauss[0:len(peaks)])
pars = mod.make_params()
for i, prefix in zip(range(0, len(peaks)), prefixes[0:len(peaks)]):
pars[prefix + 'center'].set(peaks[i])
init = mod.eval(pars, x=x_range)
out = mod.fit(fake, pars, x=x_range)
print(out.fit_report(min_correl=0.5))
out.plot_fit()
plt.show()
I'm new to pycharm and so far my impression of the debugger is that it's marverlous! However, it behaves weird in my code and I cannot figure out what is going wrong.
If I set a breakpoint to these lines of code and then press "step over" or "step into my code" it runs until the end ignoring all other upcoming breakpoints. Any idea what I do wrong? Breakpoints before that line work perfectly fine.
for ind, fit in zip(pop, fitnesses):
ind.fitness.values = fit
my code
You need to pip install deap, efel and brian2 for the code to run.
# DEAP
# https://github.com/DEAP/deap/tree/54b83e2cc7a73be7657cb64b01033a31584b891d
# import array
import matplotlib.pyplot as plt
import pandas as pd
from scipy.io import loadmat
import random, numpy, os, efel, scipy, math, time, array, json
from deap import algorithms, base, creator, tools, benchmarks
from deap.benchmarks.tools import diversity, convergence # , hypervolume
from machine_hh_model_v02 import brian_hh_model
# parallel processing
# from scoop import futures
parallel_processing = "no"
# Starting values
channels = {"ENa": 65,
"EK": -90,
"El": -70,
"ECa": 120,
"gNa": 0.05,
"gK": 0.005,
"gL": 1e-4,
"gM": 8e-5,
"gCa": 1e-5}
# Boundaries
bounds = {"ENa": [50, 70],
"EK": [-100, -80],
"El": [-50, -100],
"ECa": [100, 120],
"gNa": [0, 1],
"gK": [0, 1],
"gL": [0, 1],
"gM": [0, 1],
"gCa": [0, 1]}
low, up = [x[0] for x in bounds.values()], [x[1] for x in bounds.values()]
# Set parameters
ext = 2.5 # external current stimulation [nA]
num_gen = 2 # number of generations
num_parents = 40 # number of parents
num_params = len(channels) # number of parameters to optimize
prob_crossover = 0.9 # probability that crossover takes place
# How to generate individuals
def initIndividual(container, sigma):
return container(random.gauss(x, sigma) for x in channels.values())
# CREATOR
# http://deap.readthedocs.io/en/master/tutorials/basic/part1.html
# The create() function takes at least two arguments, a name for the newly created class and a base class. Any
# subsequent argument becomes an attribute of the class. Neg. weights relate to minizing, pos. weight to maximizing
# problems.
# -- define fitness problem (which params are min./max. problems with which weight)
creator.create("FitnessMulti", base.Fitness, weights=tuple(numpy.ones(num_params) * -1))
# Next we will create the class Individual, which will inherit the class list and contain our previously defined
# FitnessMulti class in its fitness attribute. Note that upon creation all our defined classes will be part of the
# creator container and can be called directly.
# -- associate fitness problem to individuals, that are going to be created
creator.create("Individual", list, fitness=creator.FitnessMulti)
# TOOLBOX
# http://deap.readthedocs.io/en/master/examples/ga_onemax.html
# http://deap.readthedocs.io/en/master/api/tools.html#module-deap.tools
# All the objects we will use on our way, an individual, the population, as well as all functions, operators, and
# arguments will be stored in a DEAP container called Toolbox. It contains two methods for adding and removing content,
# register() and unregister().
toolbox = base.Toolbox()
# The newly introduced register() method takes at least two arguments: an alias and a function. Toolbox.attr_bool(),
# when called, will draw a random integer between -100 and 100. Toolbox.attr_float(), when called, will draw a random
# floating point number.
# -- how to generate values for each individual
# toolbox.register("attr_float", random.uniform, -100, 100)
# toolbox.register("attr_float", lambda: [random.gauss(x, 5) for x in channels.values()])
# Our individuals will be generated using the function initRepeat(). Its first argument is a container class, the
# Individual one we defined in the previous section. This container will be filled using the method attr_float(),
# provided as second argument, and will contain 10 integers, as specified using the third argument. When called, the
# individual() method will thus return an individual initialized with what would be returned by calling the attr_float()
# method 100 times.
# -- how and how many individuals to create
# toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_float, num_params)
toolbox.register("individual", initIndividual, creator.Individual, sigma=1)
# Finally, the population() method uses the same paradigm.
# -- how and how many parents to create
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
# LOAD EXPERIMENTAL DATA
# set path
# mainpath = r'C:\OwnCloud\Masterarbeit' # mainpath
# pathfit = os.path.join(mainpath, r'fitness_params') # fitness files
# os.chdir(os.path.join(mainpath, pathfit)) # change directory
# load fitness file
# xl = pd.ExcelFile('fitness.xlsx') # load excel file
# xl_mean = xl.parse("median") # load sheet 'mean' containing mean/median values
# xl_var = xl.parse("quartile-to-median distance") # load sheet 'std containing std/quantiles values
xl_mean = pd.read_json("median")
xl_var = pd.read_json("distance")
########################
############ SOMETHING IS WRONG HERE
########################
# EFEL: median
def efel_stats(features):
# get latency of first spike
if features['peak_time'].any():
features['first_peak_time'] = features['peak_time'][0]
del features['peak_time']
# get median
for key, val in features.items():
if val is None or numpy.isnan(val) or not val:
features[key] = 9999
else:
features[key] = scipy.nanmedian(val)
# get median
# if features['Spikecount'] == 0:
# for key, val in f#eatures.items( ):
# features[key] = 9999
# else:
# for key, val in features.items():
# features[key] = scipy.nanmedian(val)
return features
# ERROR FUNCTION
# The returned value must be iterable and of a length equal to the number of objectives (weights).
def error_function(external_current, indi, xl_mean=xl_mean, xl_var=xl_var):
# output variable
allerrors = []
# BRIAN: run model
stim_start, stim_duration = 500, 1000
voltage, time = brian_hh_model(1, 0, stim_start, stim_duration,
ENa=indi[0], EK=indi[1], El=indi[2], ECa=indi[3], gNa=indi[4], gK=indi[5],
gL=indi[6], gM=indi[7], gCa=indi[8])
# EFEL: extract features and get median
feature_names = ['Spikecount', 'peak_voltage', 'min_AHP_values', 'AP_begin_voltage', 'spike_half_width',
'voltage_base', 'steady_state_voltage_stimend',
'AP_begin_time', 'peak_time']
trace = {'T': time, 'V': voltage, 'stim_start': [stim_start], 'stim_end': [stim_start + stim_duration]}
features = efel.getFeatureValues([trace], feature_names)[0]
features = efel_stats(features)
# # ERROR FUNCTION: get error value
for feature, value in features.items():
# median for one external current (experimental data)
experiment_vals = xl_mean.loc[xl_mean['stimulus'] == external_current, feature].values[0]
error = float(abs(value - experiment_vals))
# my model can produce the same, less or more #spikes, peakvoltage, ...
if value == experiment_vals:
error = 0.
elif value < experiment_vals:
error = error / float(xl_var.loc[xl_var['stimulus'] == external_current, feature].values[0][0])
elif value > experiment_vals:
error = error / float(xl_var.loc[xl_var['stimulus'] == external_current, feature].values[0][1])
# append error value of this feature
allerrors.append(error)
return allerrors
# GENETIC OPERATORS
# Within DEAP there are two ways of using operators. We can 1) simply call a function from the tools module or
# 2) register it with its arguments in a toolbox, as we have already seen for our initialization methods. The second
# option allows us to to easily switch between the operators if desired.
# see http://deap.readthedocs.io/en/master/api/tools.html#module-deap.tools
# Crossover
# Register the crossover function to the toolbox
# tools.cxOnePoint --> one point crossover
# tools.cxTwoPoint --> two-point crossover
toolbox.register("mate", tools.cxSimulatedBinary, eta=20.0)
# Mutation
# Register the mutation function to the toolbox
# tools.mutGaussian --> applies a gaussian mutation of mean mu and standard deviation sigma on the input individual. The indpb argument is the probability of each attribute to be mutated.
# tools.mutPolynomialBounded --> Polynomial mutation as implemented in original NSGA-II algorithm in C by Deb.
# toolbox.register("mutate", tools.mutPolynomialBounded, eta=20, low=low, up=up, indpb=1.0/num_params)
toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=2, indpb=0.9)
# Selection
# tools.sortNondominated(individuals, k, first_front_only=False) --> Sort the first k individuals into different nondomination levels using the “Fast Nondominated Sorting Approach” proposed by Deb et al., see [Deb2002].
# tools.sortLogNondominated(individuals, k, first_front_only=False) --> Sort individuals in pareto non-dominated fronts using the Generalized Reduced Run-Time Complexity Non-Dominated Sorting Algorithm presented by Fortin et al. (2013).
toolbox.register("select", tools.selNSGA2)
# Evaluation
# Register error function in the toolbox.
# The evaluation will be performed by calling the alias "evaluate".
# toolbox.register("evaluate", error_function, ext, model_vals)
# ALGORITHM
# Now that everything is ready, we can start to write our own algorithm. It is usually done in a main function.
if parallel_processing=="yes":
toolbox.register("map", futures.map)
def main():
# register statistics to the toolbox to maintain stats of the evolution
stats = tools.Statistics(lambda ind: ind.fitness.values)
stats.register("avg", numpy.mean, axis=0)
stats.register("std", numpy.std, axis=0)
stats.register("min", numpy.min, axis=0)
stats.register("max", numpy.max, axis=0)
logbook = tools.Logbook()
logbook.header = "gen", "evals", "min"
###
### NSGA-II algorithm as in "Deb 2002: A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II"
### https://github.com/DEAP/deap/blob/master/examples/ga/nsga2.py
###
# create random parent population pop
pop = toolbox.population(n=num_parents)
# register error function
toolbox.register("evaluate", error_function, ext)
# evaluate parent population
invalid_ind = [ind for ind in pop if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
# assign crowding distance to the individuals, no actual selection is done
pop = toolbox.select(pop, len(pop))
# print logbook
record = stats.compile(pop)
logbook.record(gen=0, evals=len(invalid_ind), **record)
print(logbook.stream)
# print(record)
# Begin the generational process
for gen in range(1, num_gen):
# increase the variance in my population
offspring = tools.selTournamentDCD(pop, len(pop))
# I have no idea why
offspring = [toolbox.clone(ind) for ind in offspring]
# crossover
for ind1, ind2 in zip(offspring[::2], offspring[1::2]):
if random.random() <= prob_crossover:
toolbox.mate(ind1, ind2)
# mutation
toolbox.mutate(ind1)
toolbox.mutate(ind2)
del ind1.fitness.values, ind2.fitness.values
# Fitness
invalid_ind = [ind for ind in offspring if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
# Select the next generation population
pop = toolbox.select(pop + offspring, num_parents)
record = stats.compile(pop)
logbook.record(gen=gen, evals=len(invalid_ind), **record)
print(logbook.stream)
# print(record)
# print("Final population hypervolume is %f" % hypervolume(pop, [11.0, 11.0]))
return pop, logbook
# create pareto front (all non-dominated individuals that ever lived)
# pareto = tools.ParetoFront()
if __name__ == "__main__":
pop, logbook = main()
print(logbook)
print("POPULATION", pop)
for indi in pop:
stim_start, stim_duration = 500, 1000
voltage, time = brian_hh_model(1, 1, stim_start, stim_duration,
ENa=indi[0], EK=indi[1], El=indi[2], ECa=indi[3], gNa=indi[4], gK=indi[5],
gL=indi[6], gM=indi[7], gCa=indi[8])
print("INDIVIDUAL ", indi)
machine_hh_model_v02
# brian user guide
# http://brian2.readthedocs.io/en/2.0rc/user/index.html
from brian2 import *
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style()
def brian_hh_model(input_current, plotflag, stim_start, stim_duration, **parameter):
# C++ standalone mode
# At the beginning of the script, i.e. after the import statements, add:
# set_device('cpp_standalone', build_on_run=False)
# ********
# Handling units
# You can generate a physical quantity by multiplying a scalar or vector value with its physical unit:
# tau = 20*ms --> 20. ms
# rates = [10, 20, 30] * Hz --> [ 10. 20. 30.] Hz
# Most Brian functions will also complain about non-specified or incorrect units:
# G = NeuronGroup(10, 'dv/dt = -v/tau: volt', dt=0.5) --> "dt" has wrong dimensions, dimensions were (1) (s)
# Directly get the unitless value of a state variable by appending an underscore to the name
# print(rates_) --> [ 10. 20. 30.]
# ********
# Default parameters
# Set default parameters
default = {"ENa": 65, "EK": -90, "El": -70, "ECa": 120, "gNa": 0.05, "gK": 0.005, "gL": 1e-4, "gM": 8e-5,
"gCa": 1e-5}
# Use default parameter, if not defined as an input parameter
for key, val in default.items():
if key not in parameter:
parameter[key] = val
# Parameters
# Extract parameters that were given as an input (as a dictionary).
d = 96 * umetre # 79.8 umetre
area = d*d*3.141
Cm = 1*ufarad*cm**-2 * area # 1 ufarad
ENa = parameter["ENa"]*mV
EK = parameter["EK"]*mV
El = parameter["El"]*mV
ECa = parameter["ECa"]*mV
g_na = parameter["gNa"]*siemens*cm**-2 * area # 0.05 siemens
g_kd = parameter["gK"]*siemens*cm**-2 * area # 0.005 siemens
gl = parameter["gL"]*siemens*cm**-2 * area # 1e-4 siemens
gm = parameter["gM"]*siemens*cm**-2 * area # 8e-5 siemens
gCa = parameter["gCa"]*siemens*cm**-2 * area
tauMax = 4000 * ms
VT = -63*mV
# Equations
# Define both state variables and continuous-updates on these variables through differential equations. An Equation is
# a set of single lines in a string:
# 1. dx/dt = f : unit (differential equation)
# 2. x = f : unit (subexpression)
# 3. x : unit (parameter)
# There are three special units, "1" -> floating point number, "boolean" and "integer"
# Some special variables are defined: t, dt (time) and xi (white noise). Some other variable names (e.g. _pre) are
# forbidden. Flags -- dx/dt = f : unit (constant), parameter will not be changed during a run.
# The model
eqs = Equations('''
Im = gm * p * (v-EK) : amp
Ica = gCa * q*q * r * (v-ECa) : amp
dv/dt = (gl*(El-v) - g_na*(m*m*m)*h*(v-ENa) - g_kd*(n*n*n*n)*(v-EK) - Im - Ica + I)/Cm : volt
dm/dt = 0.32*(mV**-1)*(13.*mV-v+VT)/
(exp((13.*mV-v+VT)/(4.*mV))-1.)/ms*(1-m)-0.28*(mV**-1)*(v-VT-40.*mV)/
(exp((v-VT-40.*mV)/(5.*mV))-1.)/ms*m : 1
dn/dt = 0.032*(mV**-1)*(15.*mV-v+VT)/
(exp((15.*mV-v+VT)/(5.*mV))-1.)/ms*(1.-n)-.5*exp((10.*mV-v+VT)/(40.*mV))/ms*n : 1
dh/dt = 0.128*exp((17.*mV-v+VT)/(18.*mV))/ms*(1.-h)-4./(1+exp((40.*mV-v+VT)/(5.*mV)))/ms*h : 1
# K+ current
dp/dt = (1/(1+exp(-(v-VT+35.*mV)/(10.*mV))) - p) / (tauMax / (3.3 * exp((v - VT + 35.*mV)/(20.*mV) + exp(-(v - VT + 35.*mV)/(20.*mV))))) : 1
# Ca2+ current
dq/dt = 0.055*(mV**-1) * (-27.*mV - v) / (exp((-27.*mV - v) / (3.8*mV)) - 1.)/ms * (1.-q) - 0.94*exp((-75.*mV - v) / (17.*mV))/ms*q : 1
dr/dt = 0.000457 * exp((-13.*mV - v) / (50.*mV))/ms * (1.-r) - 0.0065 / (1. + exp((-15.*mV - v) / (28.*mV)))/ms*r : 1
I : amp
''')
# NeuronGroup
# The core of every simulation is a NeuronGroup, a group of neurons that share the same equations defining their
# properties. Minimum inputs are "number of neurons" and "model description in the form of equations". Threshold and
# refractoriness are only used for emiting spikes. To make a neuron non-excitable for a certain time period after a
# spike, the refractory keyword can be used.
# G = NeuronGroup(10, 'dv/dt = -v/tau : volt', threshold='v > -50*mV', reset='v = -70*mV', refractory=5*ms)
# Dictionary
# You can set multiple initial values at once using a dictionary and the Group.get_states() and Group.set_states()
# methods.
# initial_values = {'v': 1, 'tau': 10*ms}
# group.set_states(initial_values)
# group.v[:] --> 1)
# states = group.get_states()
# states['v'] --> 1)
group = NeuronGroup(1, eqs, method="exponential_euler")
group.v = El
group.I = 0*nA
# Recording
# Recording variables during a simulation is done with “monitor” objects. Specifically, spikes are recorded with
# SpikeMonitor, the time evolution of variables with StateMonitor and the firing rate of a population of neurons with
# PopulationRateMonitor. You can get all the stored values in a monitor with the Group.get_states().
# In this example, we record two variables v and u, and record from indices 0, 10 and 100 --> three neurons.
# G = NeuronGroup(...)
# M = StateMonitor(G, ('v', 'u'), record=[0, 10, 100])
# M.v[1] will return the values for the second recorded neuron which is the neuron with the index 10.
M = StateMonitor(group, 'v', record=0)
# Run the model
# The command run(100*ms) runs the simulation for 100 ms.
run(stim_start*ms)
group.I[0] = input_current*nA # current injection at one end
run(stim_duration*ms)
group.I = 0*nA
run(stim_start*ms)
# C++ standalone mode
# After the last run() call, call device.build() explicitly:
# device.build(directory='output', compile=True, run=True, debug=False)
# Timing
# profiling_summary(show=5) -- show the 5 objects that took the longest
# profiling_summary(show=2)
# Output
time = M.t/ms
voltage = M.v[0]/mV
# plot output
if plotflag:
plt.plot(time, voltage)
xlabel('Time [ms]')
ylabel('Membrane potential [mV]')
plt.show()
# For multiple calls
# device.reinit()
# device.activate()
return (voltage, time)
#brian_hh_model(1, 1, 500, 1000, ENa=65, EK=-90, El=-70, ECa=120, gNa=0.05, gK=0.005, gL=1e-4, gM=8e-5, gCa=1e-5)
Thanks a lot in advance!!
I am currently using scikit-learn for text classification on the 20ng dataset. I want to calculate the information gain for a vectorized dataset. It has been suggested to me that this can be accomplished, using mutual_info_classif from sklearn. However, this method is really slow, so I was trying to implement information gain myself based on this post.
I came up with the following solution:
from scipy.stats import entropy
import numpy as np
def information_gain(X, y):
def _entropy(labels):
counts = np.bincount(labels)
return entropy(counts, base=None)
def _ig(x, y):
# indices where x is set/not set
x_set = np.nonzero(x)[1]
x_not_set = np.delete(np.arange(x.shape[1]), x_set)
h_x_set = _entropy(y[x_set])
h_x_not_set = _entropy(y[x_not_set])
return entropy_full - (((len(x_set) / f_size) * h_x_set)
+ ((len(x_not_set) / f_size) * h_x_not_set))
entropy_full = _entropy(y)
f_size = float(X.shape[0])
scores = np.array([_ig(x, y) for x in X.T])
return scores
Using a very small dataset, most scores from sklearn and my implementation are equal. However, sklearn seems to take frequencies into account, which my algorithm clearly doesn't. For example
categories = ['talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
X, y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
max_features=100,
stop_words='english')
X_vec = cv.fit_transform(X)
t0 = time()
res_sk = mutual_info_classif(X_vec, y, discrete_features=True)
print("Time passed for sklearn method: %3f" % (time()-t0))
t0 = time()
res_ig = information_gain(X_vec, y)
print("Time passed for ig: %3f" % (time()-t0))
for name, res_mi, res_ig in zip(cv.get_feature_names(), res_sk, res_ig):
print("%s: mi=%f, ig=%f" % (name, res_mi, res_ig))
sample output:
center: mi=0.011824, ig=0.003548
christian: mi=0.128629, ig=0.127122
color: mi=0.028413, ig=0.026397
com: mi=0.041184, ig=0.030458
computer: mi=0.020590, ig=0.012327
cs: mi=0.007291, ig=0.001574
data: mi=0.020734, ig=0.008986
did: mi=0.035613, ig=0.024604
different: mi=0.011432, ig=0.005492
distribution: mi=0.007175, ig=0.004675
does: mi=0.019564, ig=0.006162
don: mi=0.024000, ig=0.017605
earth: mi=0.039409, ig=0.032981
edu: mi=0.023659, ig=0.008442
file: mi=0.048056, ig=0.045746
files: mi=0.041367, ig=0.037860
ftp: mi=0.031302, ig=0.026949
gif: mi=0.028128, ig=0.023744
god: mi=0.122525, ig=0.113637
good: mi=0.016181, ig=0.008511
gov: mi=0.053547, ig=0.048207
So I was wondering if my implementation is wrong, or it is correct, but a different variation of the mutual information algorithm scikit-learn uses.
A little late with my answer but you should look at Orange's implementation. Within their app it is used as a behind-the-scenes processor to help inform the dynamic model parameter building process.
The implementation itself looks fairly straightforward and could most likely be ported out. The entropy calculation first
The sections starting at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L233
def _entropy(dist):
"""Entropy of class-distribution matrix"""
p = dist / np.sum(dist, axis=0)
pc = np.clip(p, 1e-15, 1)
return np.sum(np.sum(- p * np.log2(pc), axis=0) * np.sum(dist, axis=0) / np.sum(dist))
Then the second portion.
https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L305
class GainRatio(ClassificationScorer):
"""
Information gain ratio is the ratio between information gain and
the entropy of the feature's
value distribution. The score was introduced in [Quinlan1986]_
to alleviate overestimation for multi-valued features. See `Wikipedia entry on gain ratio
<http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
"""
def from_contingency(self, cont, nan_adjustment):
h_class = _entropy(np.sum(cont, axis=1))
h_residual = _entropy(np.compress(np.sum(cont, axis=0), cont, axis=1))
h_attribute = _entropy(np.sum(cont, axis=0))
if h_attribute == 0:
h_attribute = 1
return nan_adjustment * (h_class - h_residual) / h_attribute
The actual scoring process happens at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L218
I wish to create a sklearn GMM object with a predefined set of means, weights, and covariances ( on a grid ).
I managed to do it:
from sklearn.mixture import GaussianMixture
import numpy as np
def get_grid_gmm(subdivisions=[10,10,10], variance=0.05 ):
n_gaussians = reduce(lambda x, y: x*y,subdivisions)
step = [ 1.0/(2*subdivisions[0]), 1.0/(2*subdivisions[1]), 1.0/(2*subdivisions[2])]
means = np.mgrid[ step[0] : 1.0-step[0]: complex(0,subdivisions[0]),
step[1] : 1.0-step[1]: complex(0,subdivisions[1]),
step[2] : 1.0-step[2]: complex(0,subdivisions[2])]
means = np.reshape(means,[-1,3])
covariances = variance*np.ones_like(means)
weights = (1.0/n_gaussians)*np.ones(n_gaussians)
gmm = GaussianMixture(n_components=n_gaussians, covariance_type='spherical' )
gmm.weights_ = weights
gmm.covariances_ = covariances
gmm.means_ = means
return gmm
def main():
xx = np.random.rand(100,3)
gmm = get_grid_gmm()
y= gmm.predict_proba(xx)
if __name__ == "__main__":
main()
The problem is its missing the gmm.predict_proba() method that I need to use later on.
How can I overcome this?
UPDATE : I updated the code to be a complete example that shows the error
UPDATE2
I updated the code according to comments and answers
from sklearn.mixture import GaussianMixture
import numpy as np
def get_grid_gmm(subdivisions=[10,10,10], variance=0.05 ):
n_gaussians = reduce(lambda x, y: x*y,subdivisions)
step = [ 1.0/(2*subdivisions[0]), 1.0/(2*subdivisions[1]), 1.0/(2*subdivisions[2])]
means = np.mgrid[ step[0] : 1.0-step[0]: complex(0,subdivisions[0]),
step[1] : 1.0-step[1]: complex(0,subdivisions[1]),
step[2] : 1.0-step[2]: complex(0,subdivisions[2])]
means = np.reshape(means,[3,-1])
covariances = variance*np.ones(n_gaussians)
cov_type = 'spherical'
weights = (1.0/n_gaussians)*np.ones(n_gaussians)
gmm = GaussianMixture(n_components=n_gaussians, covariance_type=cov_type )
gmm.weights_ = weights
gmm.covariances_ = covariances
gmm.means_ = means
from sklearn.mixture.gaussian_mixture import _compute_precision_cholesky
gmm.precisions_cholesky_ = _compute_precision_cholesky(covariances, cov_type)
gmm.precisions_ = gmm.precisions_cholesky_ ** 2
return gmm
def main():
xx = np.random.rand(100,3)
gmm = get_grid_gmm()
_, y = gmm._estimate_log_prob(xx)
y = np.exp(y)
if __name__ == "__main__":
main()
No more errors but _estimate_log_prob and predict_proba do not produce the same result for a fitted GMM. Why could that be?
Since you don't train the model but just use the function for estimation, you don't need to use the object but you could use the same function they use under the hood. You could try _estimate_log_gaussian_prob. That is what they do internaly I think.
Have a look at the source:
in particular at the base class
https://github.com/scikit-learn/scikit-learn/blob/ab93d657eb4268ac20c4db01c48065b5a1bfe80d/sklearn/mixture/base.py#L342
that is calling the specific method, that in turn is calling a function
https://github.com/scikit-learn/scikit-learn/blob/ab93d657eb4268ac20c4db01c48065b5a1bfe80d/sklearn/mixture/gaussian_mixture.py#L671