I am trying to write a function that properly calculates the entropy of a given dataset. However, I am getting very weird entropy values.
I am following the understanding that all entropy calculations must fall between 0 and 1, yet I am consistently getting values above 2.
Note: I must use log base 2 for this
Can someone explain why am I yielding incorrect entropy results?
The dataset I am testing is the ecoli dataset from the UCI Machine Learning Repository
import numpy
import math
#################### DATA HANDLING LIBRARY ####################
def csv_to_array(file):
# Open the file, and load it in delimiting on the ',' for a comma separated value file
data = open(file, 'r')
data = numpy.loadtxt(data, delimiter=',')
# Loop through the data in the array
for index in range(len(data)):
# Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0
try:
data[index] = [float(x) for x in data[index]]
except Exception:
data[index] = 0
except ValueError:
data[index] = 0
# Return the now type-formatted data
return data
# Function that utilizes the numpy library to randomize the dataset.
def randomize_data(csv):
csv = numpy.random.shuffle(csv)
return csv
# Function to split the data into test, training set, and validation sets
def split_data(csv):
# Call the randomize data function
randomize_data(csv)
# Grab the number of rows and calculate where to split
num_rows = csv.shape[0]
validation_split = int(num_rows * 0.10)
training_split = int(num_rows * 0.72)
testing_split = int(num_rows * 0.18)
# Validation set as the first 10% of the data
validation_set = csv[:validation_split]
# Training set as the next 72
training_set = csv[validation_split:training_split + validation_split]
# Testing set as the last 18
testing_set = csv[training_split + validation_split:]
# Split the data into classes vs actual data
training_cols = training_set.shape[1]
testing_cols = testing_set.shape[1]
validation_cols = validation_set.shape[1]
training_classes = training_set[:, training_cols - 1]
testing_classes = testing_set[:, testing_cols - 1]
validation_classes = validation_set[:, validation_cols - 1]
# Take the sets and remove the last (classification) column
training_set = training_set[:-1]
testing_set = testing_set[:-1]
validation_set = validation_set[:-1]
# Return the datasets
return testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes
#################### DATA HANDLING LIBRARY ####################
# This function returns the list of classes, and their associated weights (i.e. distributions)
# for a given dataset
def class_distribution(dataset):
# Ensure the dataset is a numpy array
dataset = numpy.asarray(dataset)
# Collect # of total rows and columns, using numpy
num_total_rows = dataset.shape[0]
num_columns = dataset.shape[1]
# Create a numpy array of just the classes
classes = dataset[:, num_columns - 1]
# Use numpy.unique to remove duplicates
classes = numpy.unique(classes)
# Create an empty array for the class weights
class_weights = []
# Loop through the classes one by one
for aclass in classes:
# Create storage variables
total = 0
weight = 0
# Now loop through the dataset
for row in dataset:
# If the class of the dataset is equal to the current class you are evaluating, increase the total
if numpy.array_equal(aclass, row[-1]):
total = total + 1
# If not, continue
else:
continue
# Divide the # of occurences by total rows
weight = float((total / num_total_rows))
# Add that weight to the list of class weights
class_weights.append(weight)
# Turn the weights into a numpy array
class_weights = numpy.asarray(class_weights)
# Return the array
return classes, class_weights
# This function returns the entropy for a given dataset
# Can be used across an entire csv, or just for a column of data (feature)
def get_entropy(dataset):
# Set initial entropy
entropy = 0.0
# Determine the classes and their frequencies (weights) of the dataset
classes, class_freq = class_distribution(dataset)
# Utilize numpy's quicksort to test the most occurring class first
numpy.sort(class_freq)
# Determine the max entropy for the dataset
max_entropy = math.log(len(classes), 2)
print("MAX ENTROPY FOR THIS DATASET: ", max_entropy)
# Loop through the frequencies and use given formula to calculate entropy
# For...Each simulates the sequence operator
for freq in class_freq:
entropy += float(-freq * math.log(freq, 2))
# Return the entropy value
return entropy
def main():
ecol = csv_to_array('ecoli.csv')
testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes = split_data(ecol)
entropy = get_entropy(ecol)
print(entropy)
main()
The following function was used to calculate Entropy:
# Function to return Shannon's Entropy
def entropy(attributes, dataset, targetAttr):
freq = {}
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
As #MattTimmermans had indicated, entropy's value is actually contingent on the number of classes. For strictly 2 classes, it is contained in the 0 to 1 (inclusive) range. However, for more than 2 classes (which is what was being tested), entropy is calculated with a different formula (converted to Pythonic code above). This post here explains those mathematics and calculations a bit more in detail.
I'm new to pycharm and so far my impression of the debugger is that it's marverlous! However, it behaves weird in my code and I cannot figure out what is going wrong.
If I set a breakpoint to these lines of code and then press "step over" or "step into my code" it runs until the end ignoring all other upcoming breakpoints. Any idea what I do wrong? Breakpoints before that line work perfectly fine.
for ind, fit in zip(pop, fitnesses):
ind.fitness.values = fit
my code
You need to pip install deap, efel and brian2 for the code to run.
# DEAP
# https://github.com/DEAP/deap/tree/54b83e2cc7a73be7657cb64b01033a31584b891d
# import array
import matplotlib.pyplot as plt
import pandas as pd
from scipy.io import loadmat
import random, numpy, os, efel, scipy, math, time, array, json
from deap import algorithms, base, creator, tools, benchmarks
from deap.benchmarks.tools import diversity, convergence # , hypervolume
from machine_hh_model_v02 import brian_hh_model
# parallel processing
# from scoop import futures
parallel_processing = "no"
# Starting values
channels = {"ENa": 65,
"EK": -90,
"El": -70,
"ECa": 120,
"gNa": 0.05,
"gK": 0.005,
"gL": 1e-4,
"gM": 8e-5,
"gCa": 1e-5}
# Boundaries
bounds = {"ENa": [50, 70],
"EK": [-100, -80],
"El": [-50, -100],
"ECa": [100, 120],
"gNa": [0, 1],
"gK": [0, 1],
"gL": [0, 1],
"gM": [0, 1],
"gCa": [0, 1]}
low, up = [x[0] for x in bounds.values()], [x[1] for x in bounds.values()]
# Set parameters
ext = 2.5 # external current stimulation [nA]
num_gen = 2 # number of generations
num_parents = 40 # number of parents
num_params = len(channels) # number of parameters to optimize
prob_crossover = 0.9 # probability that crossover takes place
# How to generate individuals
def initIndividual(container, sigma):
return container(random.gauss(x, sigma) for x in channels.values())
# CREATOR
# http://deap.readthedocs.io/en/master/tutorials/basic/part1.html
# The create() function takes at least two arguments, a name for the newly created class and a base class. Any
# subsequent argument becomes an attribute of the class. Neg. weights relate to minizing, pos. weight to maximizing
# problems.
# -- define fitness problem (which params are min./max. problems with which weight)
creator.create("FitnessMulti", base.Fitness, weights=tuple(numpy.ones(num_params) * -1))
# Next we will create the class Individual, which will inherit the class list and contain our previously defined
# FitnessMulti class in its fitness attribute. Note that upon creation all our defined classes will be part of the
# creator container and can be called directly.
# -- associate fitness problem to individuals, that are going to be created
creator.create("Individual", list, fitness=creator.FitnessMulti)
# TOOLBOX
# http://deap.readthedocs.io/en/master/examples/ga_onemax.html
# http://deap.readthedocs.io/en/master/api/tools.html#module-deap.tools
# All the objects we will use on our way, an individual, the population, as well as all functions, operators, and
# arguments will be stored in a DEAP container called Toolbox. It contains two methods for adding and removing content,
# register() and unregister().
toolbox = base.Toolbox()
# The newly introduced register() method takes at least two arguments: an alias and a function. Toolbox.attr_bool(),
# when called, will draw a random integer between -100 and 100. Toolbox.attr_float(), when called, will draw a random
# floating point number.
# -- how to generate values for each individual
# toolbox.register("attr_float", random.uniform, -100, 100)
# toolbox.register("attr_float", lambda: [random.gauss(x, 5) for x in channels.values()])
# Our individuals will be generated using the function initRepeat(). Its first argument is a container class, the
# Individual one we defined in the previous section. This container will be filled using the method attr_float(),
# provided as second argument, and will contain 10 integers, as specified using the third argument. When called, the
# individual() method will thus return an individual initialized with what would be returned by calling the attr_float()
# method 100 times.
# -- how and how many individuals to create
# toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_float, num_params)
toolbox.register("individual", initIndividual, creator.Individual, sigma=1)
# Finally, the population() method uses the same paradigm.
# -- how and how many parents to create
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
# LOAD EXPERIMENTAL DATA
# set path
# mainpath = r'C:\OwnCloud\Masterarbeit' # mainpath
# pathfit = os.path.join(mainpath, r'fitness_params') # fitness files
# os.chdir(os.path.join(mainpath, pathfit)) # change directory
# load fitness file
# xl = pd.ExcelFile('fitness.xlsx') # load excel file
# xl_mean = xl.parse("median") # load sheet 'mean' containing mean/median values
# xl_var = xl.parse("quartile-to-median distance") # load sheet 'std containing std/quantiles values
xl_mean = pd.read_json("median")
xl_var = pd.read_json("distance")
########################
############ SOMETHING IS WRONG HERE
########################
# EFEL: median
def efel_stats(features):
# get latency of first spike
if features['peak_time'].any():
features['first_peak_time'] = features['peak_time'][0]
del features['peak_time']
# get median
for key, val in features.items():
if val is None or numpy.isnan(val) or not val:
features[key] = 9999
else:
features[key] = scipy.nanmedian(val)
# get median
# if features['Spikecount'] == 0:
# for key, val in f#eatures.items( ):
# features[key] = 9999
# else:
# for key, val in features.items():
# features[key] = scipy.nanmedian(val)
return features
# ERROR FUNCTION
# The returned value must be iterable and of a length equal to the number of objectives (weights).
def error_function(external_current, indi, xl_mean=xl_mean, xl_var=xl_var):
# output variable
allerrors = []
# BRIAN: run model
stim_start, stim_duration = 500, 1000
voltage, time = brian_hh_model(1, 0, stim_start, stim_duration,
ENa=indi[0], EK=indi[1], El=indi[2], ECa=indi[3], gNa=indi[4], gK=indi[5],
gL=indi[6], gM=indi[7], gCa=indi[8])
# EFEL: extract features and get median
feature_names = ['Spikecount', 'peak_voltage', 'min_AHP_values', 'AP_begin_voltage', 'spike_half_width',
'voltage_base', 'steady_state_voltage_stimend',
'AP_begin_time', 'peak_time']
trace = {'T': time, 'V': voltage, 'stim_start': [stim_start], 'stim_end': [stim_start + stim_duration]}
features = efel.getFeatureValues([trace], feature_names)[0]
features = efel_stats(features)
# # ERROR FUNCTION: get error value
for feature, value in features.items():
# median for one external current (experimental data)
experiment_vals = xl_mean.loc[xl_mean['stimulus'] == external_current, feature].values[0]
error = float(abs(value - experiment_vals))
# my model can produce the same, less or more #spikes, peakvoltage, ...
if value == experiment_vals:
error = 0.
elif value < experiment_vals:
error = error / float(xl_var.loc[xl_var['stimulus'] == external_current, feature].values[0][0])
elif value > experiment_vals:
error = error / float(xl_var.loc[xl_var['stimulus'] == external_current, feature].values[0][1])
# append error value of this feature
allerrors.append(error)
return allerrors
# GENETIC OPERATORS
# Within DEAP there are two ways of using operators. We can 1) simply call a function from the tools module or
# 2) register it with its arguments in a toolbox, as we have already seen for our initialization methods. The second
# option allows us to to easily switch between the operators if desired.
# see http://deap.readthedocs.io/en/master/api/tools.html#module-deap.tools
# Crossover
# Register the crossover function to the toolbox
# tools.cxOnePoint --> one point crossover
# tools.cxTwoPoint --> two-point crossover
toolbox.register("mate", tools.cxSimulatedBinary, eta=20.0)
# Mutation
# Register the mutation function to the toolbox
# tools.mutGaussian --> applies a gaussian mutation of mean mu and standard deviation sigma on the input individual. The indpb argument is the probability of each attribute to be mutated.
# tools.mutPolynomialBounded --> Polynomial mutation as implemented in original NSGA-II algorithm in C by Deb.
# toolbox.register("mutate", tools.mutPolynomialBounded, eta=20, low=low, up=up, indpb=1.0/num_params)
toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=2, indpb=0.9)
# Selection
# tools.sortNondominated(individuals, k, first_front_only=False) --> Sort the first k individuals into different nondomination levels using the “Fast Nondominated Sorting Approach” proposed by Deb et al., see [Deb2002].
# tools.sortLogNondominated(individuals, k, first_front_only=False) --> Sort individuals in pareto non-dominated fronts using the Generalized Reduced Run-Time Complexity Non-Dominated Sorting Algorithm presented by Fortin et al. (2013).
toolbox.register("select", tools.selNSGA2)
# Evaluation
# Register error function in the toolbox.
# The evaluation will be performed by calling the alias "evaluate".
# toolbox.register("evaluate", error_function, ext, model_vals)
# ALGORITHM
# Now that everything is ready, we can start to write our own algorithm. It is usually done in a main function.
if parallel_processing=="yes":
toolbox.register("map", futures.map)
def main():
# register statistics to the toolbox to maintain stats of the evolution
stats = tools.Statistics(lambda ind: ind.fitness.values)
stats.register("avg", numpy.mean, axis=0)
stats.register("std", numpy.std, axis=0)
stats.register("min", numpy.min, axis=0)
stats.register("max", numpy.max, axis=0)
logbook = tools.Logbook()
logbook.header = "gen", "evals", "min"
###
### NSGA-II algorithm as in "Deb 2002: A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II"
### https://github.com/DEAP/deap/blob/master/examples/ga/nsga2.py
###
# create random parent population pop
pop = toolbox.population(n=num_parents)
# register error function
toolbox.register("evaluate", error_function, ext)
# evaluate parent population
invalid_ind = [ind for ind in pop if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
# assign crowding distance to the individuals, no actual selection is done
pop = toolbox.select(pop, len(pop))
# print logbook
record = stats.compile(pop)
logbook.record(gen=0, evals=len(invalid_ind), **record)
print(logbook.stream)
# print(record)
# Begin the generational process
for gen in range(1, num_gen):
# increase the variance in my population
offspring = tools.selTournamentDCD(pop, len(pop))
# I have no idea why
offspring = [toolbox.clone(ind) for ind in offspring]
# crossover
for ind1, ind2 in zip(offspring[::2], offspring[1::2]):
if random.random() <= prob_crossover:
toolbox.mate(ind1, ind2)
# mutation
toolbox.mutate(ind1)
toolbox.mutate(ind2)
del ind1.fitness.values, ind2.fitness.values
# Fitness
invalid_ind = [ind for ind in offspring if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
# Select the next generation population
pop = toolbox.select(pop + offspring, num_parents)
record = stats.compile(pop)
logbook.record(gen=gen, evals=len(invalid_ind), **record)
print(logbook.stream)
# print(record)
# print("Final population hypervolume is %f" % hypervolume(pop, [11.0, 11.0]))
return pop, logbook
# create pareto front (all non-dominated individuals that ever lived)
# pareto = tools.ParetoFront()
if __name__ == "__main__":
pop, logbook = main()
print(logbook)
print("POPULATION", pop)
for indi in pop:
stim_start, stim_duration = 500, 1000
voltage, time = brian_hh_model(1, 1, stim_start, stim_duration,
ENa=indi[0], EK=indi[1], El=indi[2], ECa=indi[3], gNa=indi[4], gK=indi[5],
gL=indi[6], gM=indi[7], gCa=indi[8])
print("INDIVIDUAL ", indi)
machine_hh_model_v02
# brian user guide
# http://brian2.readthedocs.io/en/2.0rc/user/index.html
from brian2 import *
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style()
def brian_hh_model(input_current, plotflag, stim_start, stim_duration, **parameter):
# C++ standalone mode
# At the beginning of the script, i.e. after the import statements, add:
# set_device('cpp_standalone', build_on_run=False)
# ********
# Handling units
# You can generate a physical quantity by multiplying a scalar or vector value with its physical unit:
# tau = 20*ms --> 20. ms
# rates = [10, 20, 30] * Hz --> [ 10. 20. 30.] Hz
# Most Brian functions will also complain about non-specified or incorrect units:
# G = NeuronGroup(10, 'dv/dt = -v/tau: volt', dt=0.5) --> "dt" has wrong dimensions, dimensions were (1) (s)
# Directly get the unitless value of a state variable by appending an underscore to the name
# print(rates_) --> [ 10. 20. 30.]
# ********
# Default parameters
# Set default parameters
default = {"ENa": 65, "EK": -90, "El": -70, "ECa": 120, "gNa": 0.05, "gK": 0.005, "gL": 1e-4, "gM": 8e-5,
"gCa": 1e-5}
# Use default parameter, if not defined as an input parameter
for key, val in default.items():
if key not in parameter:
parameter[key] = val
# Parameters
# Extract parameters that were given as an input (as a dictionary).
d = 96 * umetre # 79.8 umetre
area = d*d*3.141
Cm = 1*ufarad*cm**-2 * area # 1 ufarad
ENa = parameter["ENa"]*mV
EK = parameter["EK"]*mV
El = parameter["El"]*mV
ECa = parameter["ECa"]*mV
g_na = parameter["gNa"]*siemens*cm**-2 * area # 0.05 siemens
g_kd = parameter["gK"]*siemens*cm**-2 * area # 0.005 siemens
gl = parameter["gL"]*siemens*cm**-2 * area # 1e-4 siemens
gm = parameter["gM"]*siemens*cm**-2 * area # 8e-5 siemens
gCa = parameter["gCa"]*siemens*cm**-2 * area
tauMax = 4000 * ms
VT = -63*mV
# Equations
# Define both state variables and continuous-updates on these variables through differential equations. An Equation is
# a set of single lines in a string:
# 1. dx/dt = f : unit (differential equation)
# 2. x = f : unit (subexpression)
# 3. x : unit (parameter)
# There are three special units, "1" -> floating point number, "boolean" and "integer"
# Some special variables are defined: t, dt (time) and xi (white noise). Some other variable names (e.g. _pre) are
# forbidden. Flags -- dx/dt = f : unit (constant), parameter will not be changed during a run.
# The model
eqs = Equations('''
Im = gm * p * (v-EK) : amp
Ica = gCa * q*q * r * (v-ECa) : amp
dv/dt = (gl*(El-v) - g_na*(m*m*m)*h*(v-ENa) - g_kd*(n*n*n*n)*(v-EK) - Im - Ica + I)/Cm : volt
dm/dt = 0.32*(mV**-1)*(13.*mV-v+VT)/
(exp((13.*mV-v+VT)/(4.*mV))-1.)/ms*(1-m)-0.28*(mV**-1)*(v-VT-40.*mV)/
(exp((v-VT-40.*mV)/(5.*mV))-1.)/ms*m : 1
dn/dt = 0.032*(mV**-1)*(15.*mV-v+VT)/
(exp((15.*mV-v+VT)/(5.*mV))-1.)/ms*(1.-n)-.5*exp((10.*mV-v+VT)/(40.*mV))/ms*n : 1
dh/dt = 0.128*exp((17.*mV-v+VT)/(18.*mV))/ms*(1.-h)-4./(1+exp((40.*mV-v+VT)/(5.*mV)))/ms*h : 1
# K+ current
dp/dt = (1/(1+exp(-(v-VT+35.*mV)/(10.*mV))) - p) / (tauMax / (3.3 * exp((v - VT + 35.*mV)/(20.*mV) + exp(-(v - VT + 35.*mV)/(20.*mV))))) : 1
# Ca2+ current
dq/dt = 0.055*(mV**-1) * (-27.*mV - v) / (exp((-27.*mV - v) / (3.8*mV)) - 1.)/ms * (1.-q) - 0.94*exp((-75.*mV - v) / (17.*mV))/ms*q : 1
dr/dt = 0.000457 * exp((-13.*mV - v) / (50.*mV))/ms * (1.-r) - 0.0065 / (1. + exp((-15.*mV - v) / (28.*mV)))/ms*r : 1
I : amp
''')
# NeuronGroup
# The core of every simulation is a NeuronGroup, a group of neurons that share the same equations defining their
# properties. Minimum inputs are "number of neurons" and "model description in the form of equations". Threshold and
# refractoriness are only used for emiting spikes. To make a neuron non-excitable for a certain time period after a
# spike, the refractory keyword can be used.
# G = NeuronGroup(10, 'dv/dt = -v/tau : volt', threshold='v > -50*mV', reset='v = -70*mV', refractory=5*ms)
# Dictionary
# You can set multiple initial values at once using a dictionary and the Group.get_states() and Group.set_states()
# methods.
# initial_values = {'v': 1, 'tau': 10*ms}
# group.set_states(initial_values)
# group.v[:] --> 1)
# states = group.get_states()
# states['v'] --> 1)
group = NeuronGroup(1, eqs, method="exponential_euler")
group.v = El
group.I = 0*nA
# Recording
# Recording variables during a simulation is done with “monitor” objects. Specifically, spikes are recorded with
# SpikeMonitor, the time evolution of variables with StateMonitor and the firing rate of a population of neurons with
# PopulationRateMonitor. You can get all the stored values in a monitor with the Group.get_states().
# In this example, we record two variables v and u, and record from indices 0, 10 and 100 --> three neurons.
# G = NeuronGroup(...)
# M = StateMonitor(G, ('v', 'u'), record=[0, 10, 100])
# M.v[1] will return the values for the second recorded neuron which is the neuron with the index 10.
M = StateMonitor(group, 'v', record=0)
# Run the model
# The command run(100*ms) runs the simulation for 100 ms.
run(stim_start*ms)
group.I[0] = input_current*nA # current injection at one end
run(stim_duration*ms)
group.I = 0*nA
run(stim_start*ms)
# C++ standalone mode
# After the last run() call, call device.build() explicitly:
# device.build(directory='output', compile=True, run=True, debug=False)
# Timing
# profiling_summary(show=5) -- show the 5 objects that took the longest
# profiling_summary(show=2)
# Output
time = M.t/ms
voltage = M.v[0]/mV
# plot output
if plotflag:
plt.plot(time, voltage)
xlabel('Time [ms]')
ylabel('Membrane potential [mV]')
plt.show()
# For multiple calls
# device.reinit()
# device.activate()
return (voltage, time)
#brian_hh_model(1, 1, 500, 1000, ENa=65, EK=-90, El=-70, ECa=120, gNa=0.05, gK=0.005, gL=1e-4, gM=8e-5, gCa=1e-5)
Thanks a lot in advance!!
I am currently using scikit-learn for text classification on the 20ng dataset. I want to calculate the information gain for a vectorized dataset. It has been suggested to me that this can be accomplished, using mutual_info_classif from sklearn. However, this method is really slow, so I was trying to implement information gain myself based on this post.
I came up with the following solution:
from scipy.stats import entropy
import numpy as np
def information_gain(X, y):
def _entropy(labels):
counts = np.bincount(labels)
return entropy(counts, base=None)
def _ig(x, y):
# indices where x is set/not set
x_set = np.nonzero(x)[1]
x_not_set = np.delete(np.arange(x.shape[1]), x_set)
h_x_set = _entropy(y[x_set])
h_x_not_set = _entropy(y[x_not_set])
return entropy_full - (((len(x_set) / f_size) * h_x_set)
+ ((len(x_not_set) / f_size) * h_x_not_set))
entropy_full = _entropy(y)
f_size = float(X.shape[0])
scores = np.array([_ig(x, y) for x in X.T])
return scores
Using a very small dataset, most scores from sklearn and my implementation are equal. However, sklearn seems to take frequencies into account, which my algorithm clearly doesn't. For example
categories = ['talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
X, y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
max_features=100,
stop_words='english')
X_vec = cv.fit_transform(X)
t0 = time()
res_sk = mutual_info_classif(X_vec, y, discrete_features=True)
print("Time passed for sklearn method: %3f" % (time()-t0))
t0 = time()
res_ig = information_gain(X_vec, y)
print("Time passed for ig: %3f" % (time()-t0))
for name, res_mi, res_ig in zip(cv.get_feature_names(), res_sk, res_ig):
print("%s: mi=%f, ig=%f" % (name, res_mi, res_ig))
sample output:
center: mi=0.011824, ig=0.003548
christian: mi=0.128629, ig=0.127122
color: mi=0.028413, ig=0.026397
com: mi=0.041184, ig=0.030458
computer: mi=0.020590, ig=0.012327
cs: mi=0.007291, ig=0.001574
data: mi=0.020734, ig=0.008986
did: mi=0.035613, ig=0.024604
different: mi=0.011432, ig=0.005492
distribution: mi=0.007175, ig=0.004675
does: mi=0.019564, ig=0.006162
don: mi=0.024000, ig=0.017605
earth: mi=0.039409, ig=0.032981
edu: mi=0.023659, ig=0.008442
file: mi=0.048056, ig=0.045746
files: mi=0.041367, ig=0.037860
ftp: mi=0.031302, ig=0.026949
gif: mi=0.028128, ig=0.023744
god: mi=0.122525, ig=0.113637
good: mi=0.016181, ig=0.008511
gov: mi=0.053547, ig=0.048207
So I was wondering if my implementation is wrong, or it is correct, but a different variation of the mutual information algorithm scikit-learn uses.
A little late with my answer but you should look at Orange's implementation. Within their app it is used as a behind-the-scenes processor to help inform the dynamic model parameter building process.
The implementation itself looks fairly straightforward and could most likely be ported out. The entropy calculation first
The sections starting at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L233
def _entropy(dist):
"""Entropy of class-distribution matrix"""
p = dist / np.sum(dist, axis=0)
pc = np.clip(p, 1e-15, 1)
return np.sum(np.sum(- p * np.log2(pc), axis=0) * np.sum(dist, axis=0) / np.sum(dist))
Then the second portion.
https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L305
class GainRatio(ClassificationScorer):
"""
Information gain ratio is the ratio between information gain and
the entropy of the feature's
value distribution. The score was introduced in [Quinlan1986]_
to alleviate overestimation for multi-valued features. See `Wikipedia entry on gain ratio
<http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
"""
def from_contingency(self, cont, nan_adjustment):
h_class = _entropy(np.sum(cont, axis=1))
h_residual = _entropy(np.compress(np.sum(cont, axis=0), cont, axis=1))
h_attribute = _entropy(np.sum(cont, axis=0))
if h_attribute == 0:
h_attribute = 1
return nan_adjustment * (h_class - h_residual) / h_attribute
The actual scoring process happens at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L218