Alternating Least Square parameters tuning - python

Context: I am working an building a recommender system using implicit feedback (orders) using the implicit library in python.
Issue: when trying to tune the parameters in order to know the best parameters to use, the output is not looping over all variables, and is not calculating the auc. How can I make sure it loops over all combination, and add in the dictionary if the combinations lead to a highest AUC score ?
Also, please feel free to let me know if there is a library that I can use to tune it as I did not know how to use gridsearchCV for example for this use case (ALS model).
In the code:
training_set2 - The altered version of the original training_set with a certain percentage of the user-item pairs that originally had interaction set back to zero.
validation_set - A copy of the original training_set matrix, unaltered, so it can be used to see how the rank order compares with the actual interactions.
Expected output: Is a dictionary with all the combinations, in a desc order with the last one being the combination that has the highest AUC score. This combination will be the one I will use for my test set.
def auc_score(predictions, test):
fpr, tpr, thresholds = metrics.roc_curve(test, predictions)
return metrics.auc(fpr, tpr)
def calc_mean_auc(training_set, altered_users, predictions, test_set):
'''
This function will calculate the mean AUC by user for any user that had their user-item matrix altered.
'''
store_auc = [] # An empty list to store the AUC for each user that had an item removed from the training set
item_vecs = predictions[1]
for user in altered_users: # Iterate through each user that had an item altered
training_row = training_set[user,:].toarray().reshape(-1) # Get the training set row
zero_inds = np.where(training_row == 0) # Find where the interaction had not yet occurred
# Get the predicted values based on our user/item vectors
user_vec = predictions[0][user,:]
pred = user_vec.dot(item_vecs).toarray()[0,zero_inds].reshape(-1)
# Get only the items that were originally zero
# Select all ratings from the MF prediction for this user that originally had no iteraction
actual = test_set[user,:].toarray()[0,zero_inds].reshape(-1)
# Select the binarized yes/no interaction pairs from the original full data
# that align with the same pairs in training
store_auc.append(auc_score(pred, actual)) # Calculate AUC for the given user and store
# End users iteration
return float('%.3f'%np.mean(store_auc))
...
latent_factors = [5, 10, 20, 40, 80]
regularizations = [0.01, 0.1, 1., 10., 100.]
regularizations.sort()
iter_array = [1, 2, 5, 10, 25, 50, 100]
best_params = {}
best_params['n_factors'] = latent_factors[0]
best_params['reg'] = regularizations[0]
best_params['n_iter'] = 0
best_params['auc_result'] = np.inf
best_params['model'] = None
for fact in latent_factors:
print('Factors: {}'.format(fact))
for reg in regularizations:
print ('Regularization: {}'.format(reg))
for ite in iter_array:
print ('Iteration: {}'.format(ite))
model = implicit.als.AlternatingLeastSquares(
factors=fact,
regularization=reg,
iterations=ite)
model.fit((training2_set.T * 15).astype('double'))
customers_vecs = model.user_factors
restaurant_vecs = model.item_factors
auc_result = calc_mean_auc(training2_set, cust_altered2,
[sparse.csr_matrix(customers_vecs), sparse.csr_matrix(restaurant_vecs.T)], validation_set)
if auc_result > best_params['auc_result']:
best_params['n_factors'] = fact
best_params['reg'] = reg
best_params['n_iter'] = ite
best_params['auc_result'] = auc_result
best_params['model'] = 'AlternatingLeastSquare'
print ('New optimal hyperparameters')
print (pd.Series(best_params))
I cannot post a picture but this is the output that I get:
</b>
Factors: 5</b>
Regularization: 0.01</b>
Iteration: 1</b>
n_factors 5.00</b>
reg 0.01</b>
n_iter 0.00</b>
auc_result inf</b>
model NaN</b>
dtype: float64</b>
Iteration: 2</b>
n_factors 5.00</b>
reg 0.01</b>
n_iter 0.00</b>
auc_result inf</b>
model NaN</b>
dtype: float64</b>
Iteration: 5</b>

You are only keeping the best parameters - if auc_error < best_params['auc_error']:
Use a collections.defaultdict to accumulate all the parameters in lists
...
from collections import defaultdict
best_params = defaultdict(list)
##best_params = {}
##best_params['n_factors'] = latent_factors[0]
##best_params['reg'] = regularizations[0]
##best_params['n_iter'] = 0
##best_params['auc_error'] = np.inf
##best_params['model'] = None
...
for fact in latent_factors:
...
...
##if auc_error < best_params['auc_error']:
best_params['n_factors'].append(fact)
best_params['reg'].append(reg)
best_params['n_iter'].append(ite)
best_params['auc_error'].append(auc_error)
best_params['model'].append('AlternatingLeastSquare')
##print ('New optimal hyperparameters')
It appears you are using Pandas so make a DataFrame and sort it.
df = pd.DataFrame(best_params)
df = df.sort_values('auc_error')

Related

Odd Results on Entropy Calculation

I am trying to write a function that properly calculates the entropy of a given dataset. However, I am getting very weird entropy values.
I am following the understanding that all entropy calculations must fall between 0 and 1, yet I am consistently getting values above 2.
Note: I must use log base 2 for this
Can someone explain why am I yielding incorrect entropy results?
The dataset I am testing is the ecoli dataset from the UCI Machine Learning Repository
import numpy
import math
#################### DATA HANDLING LIBRARY ####################
def csv_to_array(file):
# Open the file, and load it in delimiting on the ',' for a comma separated value file
data = open(file, 'r')
data = numpy.loadtxt(data, delimiter=',')
# Loop through the data in the array
for index in range(len(data)):
# Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0
try:
data[index] = [float(x) for x in data[index]]
except Exception:
data[index] = 0
except ValueError:
data[index] = 0
# Return the now type-formatted data
return data
# Function that utilizes the numpy library to randomize the dataset.
def randomize_data(csv):
csv = numpy.random.shuffle(csv)
return csv
# Function to split the data into test, training set, and validation sets
def split_data(csv):
# Call the randomize data function
randomize_data(csv)
# Grab the number of rows and calculate where to split
num_rows = csv.shape[0]
validation_split = int(num_rows * 0.10)
training_split = int(num_rows * 0.72)
testing_split = int(num_rows * 0.18)
# Validation set as the first 10% of the data
validation_set = csv[:validation_split]
# Training set as the next 72
training_set = csv[validation_split:training_split + validation_split]
# Testing set as the last 18
testing_set = csv[training_split + validation_split:]
# Split the data into classes vs actual data
training_cols = training_set.shape[1]
testing_cols = testing_set.shape[1]
validation_cols = validation_set.shape[1]
training_classes = training_set[:, training_cols - 1]
testing_classes = testing_set[:, testing_cols - 1]
validation_classes = validation_set[:, validation_cols - 1]
# Take the sets and remove the last (classification) column
training_set = training_set[:-1]
testing_set = testing_set[:-1]
validation_set = validation_set[:-1]
# Return the datasets
return testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes
#################### DATA HANDLING LIBRARY ####################
# This function returns the list of classes, and their associated weights (i.e. distributions)
# for a given dataset
def class_distribution(dataset):
# Ensure the dataset is a numpy array
dataset = numpy.asarray(dataset)
# Collect # of total rows and columns, using numpy
num_total_rows = dataset.shape[0]
num_columns = dataset.shape[1]
# Create a numpy array of just the classes
classes = dataset[:, num_columns - 1]
# Use numpy.unique to remove duplicates
classes = numpy.unique(classes)
# Create an empty array for the class weights
class_weights = []
# Loop through the classes one by one
for aclass in classes:
# Create storage variables
total = 0
weight = 0
# Now loop through the dataset
for row in dataset:
# If the class of the dataset is equal to the current class you are evaluating, increase the total
if numpy.array_equal(aclass, row[-1]):
total = total + 1
# If not, continue
else:
continue
# Divide the # of occurences by total rows
weight = float((total / num_total_rows))
# Add that weight to the list of class weights
class_weights.append(weight)
# Turn the weights into a numpy array
class_weights = numpy.asarray(class_weights)
# Return the array
return classes, class_weights
# This function returns the entropy for a given dataset
# Can be used across an entire csv, or just for a column of data (feature)
def get_entropy(dataset):
# Set initial entropy
entropy = 0.0
# Determine the classes and their frequencies (weights) of the dataset
classes, class_freq = class_distribution(dataset)
# Utilize numpy's quicksort to test the most occurring class first
numpy.sort(class_freq)
# Determine the max entropy for the dataset
max_entropy = math.log(len(classes), 2)
print("MAX ENTROPY FOR THIS DATASET: ", max_entropy)
# Loop through the frequencies and use given formula to calculate entropy
# For...Each simulates the sequence operator
for freq in class_freq:
entropy += float(-freq * math.log(freq, 2))
# Return the entropy value
return entropy
def main():
ecol = csv_to_array('ecoli.csv')
testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes = split_data(ecol)
entropy = get_entropy(ecol)
print(entropy)
main()
The following function was used to calculate Entropy:
# Function to return Shannon's Entropy
def entropy(attributes, dataset, targetAttr):
freq = {}
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
As #MattTimmermans had indicated, entropy's value is actually contingent on the number of classes. For strictly 2 classes, it is contained in the 0 to 1 (inclusive) range. However, for more than 2 classes (which is what was being tested), entropy is calculated with a different formula (converted to Pythonic code above). This post here explains those mathematics and calculations a bit more in detail.

Simulations Confidence Interval Not Equal to conf_int Results

Given this simulated data:
import numpy as np
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.statespace.structural import UnobservedComponents
np.random.seed(12345)
ar = np.r_[1, 0.9]
ma = np.array([1])
arma_process = ArmaProcess(ar, ma)
X = 100 + arma_process.generate_sample(nsample=100)
y = 1.2 * X + np.random.normal(size=100)
We build a UnobservedComponents model with the first 70 points to run inferences on the last 30 points like so:
model = UnobservedComponents(y[:70], level='llevel', exog=X[:70])
f_model = model.fit()
forecaster = f_model.get_forecast(
steps=30,
exog=X[70:].reshape(-1, 1)
)
conf_int = forecaster.conf_int()
If we observe the mean for the 95% confidence interval, we get the following:
conf_int.mean(axis=0)
array([118.19789195, 122.14101161])
But when trying to get the same values through model simulations, we don't quite get the same results. Here's the script we run for the simulated boundaries:
sim_model = UnobservedComponents(np.zeros(30), level='llevel', exog=X[70:])
res = []
predicted_state = f_model.predicted_state[..., -1]
predicted_state_cov = f_model.predicted_state_cov[..., -1]
for i in range(1000):
init_state = np.random.multivariate_normal(
predicted_state,
predicted_state_cov
)
sim = sim_model.simulate(
f_model.params,
30,
initial_state=init_state)
res.append(sim.mean())
Printing the lower 2.5 and upper 97.5 percentile we get:
np.percentile(res, [2.5, 97.5])
array([119.06735028, 121.26810407])
As we use model simulations to distinguish signal from noise in data, this difference ended up being big enough to lead to contradictory conclusions. If we make for instance:
y[70:] += 1
Then according to the first technique we conclude the new y carries no signal as its mean is lower than 122.14. But the same is not true if we use the second technique: as the upper boundary is 121.2, we conclude that there's signal.
What we are trying to understand now is whether this is expected. Shouldn't the lower and upper 95% confidence interval of both techniques be equal?

Pycharm debugger skips breakpoints after specific line of code - why?

I'm new to pycharm and so far my impression of the debugger is that it's marverlous! However, it behaves weird in my code and I cannot figure out what is going wrong.
If I set a breakpoint to these lines of code and then press "step over" or "step into my code" it runs until the end ignoring all other upcoming breakpoints. Any idea what I do wrong? Breakpoints before that line work perfectly fine.
for ind, fit in zip(pop, fitnesses):
ind.fitness.values = fit
my code
You need to pip install deap, efel and brian2 for the code to run.
# DEAP
# https://github.com/DEAP/deap/tree/54b83e2cc7a73be7657cb64b01033a31584b891d
# import array
import matplotlib.pyplot as plt
import pandas as pd
from scipy.io import loadmat
import random, numpy, os, efel, scipy, math, time, array, json
from deap import algorithms, base, creator, tools, benchmarks
from deap.benchmarks.tools import diversity, convergence # , hypervolume
from machine_hh_model_v02 import brian_hh_model
# parallel processing
# from scoop import futures
parallel_processing = "no"
# Starting values
channels = {"ENa": 65,
"EK": -90,
"El": -70,
"ECa": 120,
"gNa": 0.05,
"gK": 0.005,
"gL": 1e-4,
"gM": 8e-5,
"gCa": 1e-5}
# Boundaries
bounds = {"ENa": [50, 70],
"EK": [-100, -80],
"El": [-50, -100],
"ECa": [100, 120],
"gNa": [0, 1],
"gK": [0, 1],
"gL": [0, 1],
"gM": [0, 1],
"gCa": [0, 1]}
low, up = [x[0] for x in bounds.values()], [x[1] for x in bounds.values()]
# Set parameters
ext = 2.5 # external current stimulation [nA]
num_gen = 2 # number of generations
num_parents = 40 # number of parents
num_params = len(channels) # number of parameters to optimize
prob_crossover = 0.9 # probability that crossover takes place
# How to generate individuals
def initIndividual(container, sigma):
return container(random.gauss(x, sigma) for x in channels.values())
# CREATOR
# http://deap.readthedocs.io/en/master/tutorials/basic/part1.html
# The create() function takes at least two arguments, a name for the newly created class and a base class. Any
# subsequent argument becomes an attribute of the class. Neg. weights relate to minizing, pos. weight to maximizing
# problems.
# -- define fitness problem (which params are min./max. problems with which weight)
creator.create("FitnessMulti", base.Fitness, weights=tuple(numpy.ones(num_params) * -1))
# Next we will create the class Individual, which will inherit the class list and contain our previously defined
# FitnessMulti class in its fitness attribute. Note that upon creation all our defined classes will be part of the
# creator container and can be called directly.
# -- associate fitness problem to individuals, that are going to be created
creator.create("Individual", list, fitness=creator.FitnessMulti)
# TOOLBOX
# http://deap.readthedocs.io/en/master/examples/ga_onemax.html
# http://deap.readthedocs.io/en/master/api/tools.html#module-deap.tools
# All the objects we will use on our way, an individual, the population, as well as all functions, operators, and
# arguments will be stored in a DEAP container called Toolbox. It contains two methods for adding and removing content,
# register() and unregister().
toolbox = base.Toolbox()
# The newly introduced register() method takes at least two arguments: an alias and a function. Toolbox.attr_bool(),
# when called, will draw a random integer between -100 and 100. Toolbox.attr_float(), when called, will draw a random
# floating point number.
# -- how to generate values for each individual
# toolbox.register("attr_float", random.uniform, -100, 100)
# toolbox.register("attr_float", lambda: [random.gauss(x, 5) for x in channels.values()])
# Our individuals will be generated using the function initRepeat(). Its first argument is a container class, the
# Individual one we defined in the previous section. This container will be filled using the method attr_float(),
# provided as second argument, and will contain 10 integers, as specified using the third argument. When called, the
# individual() method will thus return an individual initialized with what would be returned by calling the attr_float()
# method 100 times.
# -- how and how many individuals to create
# toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_float, num_params)
toolbox.register("individual", initIndividual, creator.Individual, sigma=1)
# Finally, the population() method uses the same paradigm.
# -- how and how many parents to create
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
# LOAD EXPERIMENTAL DATA
# set path
# mainpath = r'C:\OwnCloud\Masterarbeit' # mainpath
# pathfit = os.path.join(mainpath, r'fitness_params') # fitness files
# os.chdir(os.path.join(mainpath, pathfit)) # change directory
# load fitness file
# xl = pd.ExcelFile('fitness.xlsx') # load excel file
# xl_mean = xl.parse("median") # load sheet 'mean' containing mean/median values
# xl_var = xl.parse("quartile-to-median distance") # load sheet 'std containing std/quantiles values
xl_mean = pd.read_json("median")
xl_var = pd.read_json("distance")
########################
############ SOMETHING IS WRONG HERE
########################
# EFEL: median
def efel_stats(features):
# get latency of first spike
if features['peak_time'].any():
features['first_peak_time'] = features['peak_time'][0]
del features['peak_time']
# get median
for key, val in features.items():
if val is None or numpy.isnan(val) or not val:
features[key] = 9999
else:
features[key] = scipy.nanmedian(val)
# get median
# if features['Spikecount'] == 0:
# for key, val in f#eatures.items( ):
# features[key] = 9999
# else:
# for key, val in features.items():
# features[key] = scipy.nanmedian(val)
return features
# ERROR FUNCTION
# The returned value must be iterable and of a length equal to the number of objectives (weights).
def error_function(external_current, indi, xl_mean=xl_mean, xl_var=xl_var):
# output variable
allerrors = []
# BRIAN: run model
stim_start, stim_duration = 500, 1000
voltage, time = brian_hh_model(1, 0, stim_start, stim_duration,
ENa=indi[0], EK=indi[1], El=indi[2], ECa=indi[3], gNa=indi[4], gK=indi[5],
gL=indi[6], gM=indi[7], gCa=indi[8])
# EFEL: extract features and get median
feature_names = ['Spikecount', 'peak_voltage', 'min_AHP_values', 'AP_begin_voltage', 'spike_half_width',
'voltage_base', 'steady_state_voltage_stimend',
'AP_begin_time', 'peak_time']
trace = {'T': time, 'V': voltage, 'stim_start': [stim_start], 'stim_end': [stim_start + stim_duration]}
features = efel.getFeatureValues([trace], feature_names)[0]
features = efel_stats(features)
# # ERROR FUNCTION: get error value
for feature, value in features.items():
# median for one external current (experimental data)
experiment_vals = xl_mean.loc[xl_mean['stimulus'] == external_current, feature].values[0]
error = float(abs(value - experiment_vals))
# my model can produce the same, less or more #spikes, peakvoltage, ...
if value == experiment_vals:
error = 0.
elif value < experiment_vals:
error = error / float(xl_var.loc[xl_var['stimulus'] == external_current, feature].values[0][0])
elif value > experiment_vals:
error = error / float(xl_var.loc[xl_var['stimulus'] == external_current, feature].values[0][1])
# append error value of this feature
allerrors.append(error)
return allerrors
# GENETIC OPERATORS
# Within DEAP there are two ways of using operators. We can 1) simply call a function from the tools module or
# 2) register it with its arguments in a toolbox, as we have already seen for our initialization methods. The second
# option allows us to to easily switch between the operators if desired.
# see http://deap.readthedocs.io/en/master/api/tools.html#module-deap.tools
# Crossover
# Register the crossover function to the toolbox
# tools.cxOnePoint --> one point crossover
# tools.cxTwoPoint --> two-point crossover
toolbox.register("mate", tools.cxSimulatedBinary, eta=20.0)
# Mutation
# Register the mutation function to the toolbox
# tools.mutGaussian --> applies a gaussian mutation of mean mu and standard deviation sigma on the input individual. The indpb argument is the probability of each attribute to be mutated.
# tools.mutPolynomialBounded --> Polynomial mutation as implemented in original NSGA-II algorithm in C by Deb.
# toolbox.register("mutate", tools.mutPolynomialBounded, eta=20, low=low, up=up, indpb=1.0/num_params)
toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=2, indpb=0.9)
# Selection
# tools.sortNondominated(individuals, k, first_front_only=False) --> Sort the first k individuals into different nondomination levels using the “Fast Nondominated Sorting Approach” proposed by Deb et al., see [Deb2002].
# tools.sortLogNondominated(individuals, k, first_front_only=False) --> Sort individuals in pareto non-dominated fronts using the Generalized Reduced Run-Time Complexity Non-Dominated Sorting Algorithm presented by Fortin et al. (2013).
toolbox.register("select", tools.selNSGA2)
# Evaluation
# Register error function in the toolbox.
# The evaluation will be performed by calling the alias "evaluate".
# toolbox.register("evaluate", error_function, ext, model_vals)
# ALGORITHM
# Now that everything is ready, we can start to write our own algorithm. It is usually done in a main function.
if parallel_processing=="yes":
toolbox.register("map", futures.map)
def main():
# register statistics to the toolbox to maintain stats of the evolution
stats = tools.Statistics(lambda ind: ind.fitness.values)
stats.register("avg", numpy.mean, axis=0)
stats.register("std", numpy.std, axis=0)
stats.register("min", numpy.min, axis=0)
stats.register("max", numpy.max, axis=0)
logbook = tools.Logbook()
logbook.header = "gen", "evals", "min"
###
### NSGA-II algorithm as in "Deb 2002: A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II"
### https://github.com/DEAP/deap/blob/master/examples/ga/nsga2.py
###
# create random parent population pop
pop = toolbox.population(n=num_parents)
# register error function
toolbox.register("evaluate", error_function, ext)
# evaluate parent population
invalid_ind = [ind for ind in pop if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
# assign crowding distance to the individuals, no actual selection is done
pop = toolbox.select(pop, len(pop))
# print logbook
record = stats.compile(pop)
logbook.record(gen=0, evals=len(invalid_ind), **record)
print(logbook.stream)
# print(record)
# Begin the generational process
for gen in range(1, num_gen):
# increase the variance in my population
offspring = tools.selTournamentDCD(pop, len(pop))
# I have no idea why
offspring = [toolbox.clone(ind) for ind in offspring]
# crossover
for ind1, ind2 in zip(offspring[::2], offspring[1::2]):
if random.random() <= prob_crossover:
toolbox.mate(ind1, ind2)
# mutation
toolbox.mutate(ind1)
toolbox.mutate(ind2)
del ind1.fitness.values, ind2.fitness.values
# Fitness
invalid_ind = [ind for ind in offspring if not ind.fitness.valid]
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
# Select the next generation population
pop = toolbox.select(pop + offspring, num_parents)
record = stats.compile(pop)
logbook.record(gen=gen, evals=len(invalid_ind), **record)
print(logbook.stream)
# print(record)
# print("Final population hypervolume is %f" % hypervolume(pop, [11.0, 11.0]))
return pop, logbook
# create pareto front (all non-dominated individuals that ever lived)
# pareto = tools.ParetoFront()
if __name__ == "__main__":
pop, logbook = main()
print(logbook)
print("POPULATION", pop)
for indi in pop:
stim_start, stim_duration = 500, 1000
voltage, time = brian_hh_model(1, 1, stim_start, stim_duration,
ENa=indi[0], EK=indi[1], El=indi[2], ECa=indi[3], gNa=indi[4], gK=indi[5],
gL=indi[6], gM=indi[7], gCa=indi[8])
print("INDIVIDUAL ", indi)
machine_hh_model_v02
# brian user guide
# http://brian2.readthedocs.io/en/2.0rc/user/index.html
from brian2 import *
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style()
def brian_hh_model(input_current, plotflag, stim_start, stim_duration, **parameter):
# C++ standalone mode
# At the beginning of the script, i.e. after the import statements, add:
# set_device('cpp_standalone', build_on_run=False)
# ********
# Handling units
# You can generate a physical quantity by multiplying a scalar or vector value with its physical unit:
# tau = 20*ms --> 20. ms
# rates = [10, 20, 30] * Hz --> [ 10. 20. 30.] Hz
# Most Brian functions will also complain about non-specified or incorrect units:
# G = NeuronGroup(10, 'dv/dt = -v/tau: volt', dt=0.5) --> "dt" has wrong dimensions, dimensions were (1) (s)
# Directly get the unitless value of a state variable by appending an underscore to the name
# print(rates_) --> [ 10. 20. 30.]
# ********
# Default parameters
# Set default parameters
default = {"ENa": 65, "EK": -90, "El": -70, "ECa": 120, "gNa": 0.05, "gK": 0.005, "gL": 1e-4, "gM": 8e-5,
"gCa": 1e-5}
# Use default parameter, if not defined as an input parameter
for key, val in default.items():
if key not in parameter:
parameter[key] = val
# Parameters
# Extract parameters that were given as an input (as a dictionary).
d = 96 * umetre # 79.8 umetre
area = d*d*3.141
Cm = 1*ufarad*cm**-2 * area # 1 ufarad
ENa = parameter["ENa"]*mV
EK = parameter["EK"]*mV
El = parameter["El"]*mV
ECa = parameter["ECa"]*mV
g_na = parameter["gNa"]*siemens*cm**-2 * area # 0.05 siemens
g_kd = parameter["gK"]*siemens*cm**-2 * area # 0.005 siemens
gl = parameter["gL"]*siemens*cm**-2 * area # 1e-4 siemens
gm = parameter["gM"]*siemens*cm**-2 * area # 8e-5 siemens
gCa = parameter["gCa"]*siemens*cm**-2 * area
tauMax = 4000 * ms
VT = -63*mV
# Equations
# Define both state variables and continuous-updates on these variables through differential equations. An Equation is
# a set of single lines in a string:
# 1. dx/dt = f : unit (differential equation)
# 2. x = f : unit (subexpression)
# 3. x : unit (parameter)
# There are three special units, "1" -> floating point number, "boolean" and "integer"
# Some special variables are defined: t, dt (time) and xi (white noise). Some other variable names (e.g. _pre) are
# forbidden. Flags -- dx/dt = f : unit (constant), parameter will not be changed during a run.
# The model
eqs = Equations('''
Im = gm * p * (v-EK) : amp
Ica = gCa * q*q * r * (v-ECa) : amp
dv/dt = (gl*(El-v) - g_na*(m*m*m)*h*(v-ENa) - g_kd*(n*n*n*n)*(v-EK) - Im - Ica + I)/Cm : volt
dm/dt = 0.32*(mV**-1)*(13.*mV-v+VT)/
(exp((13.*mV-v+VT)/(4.*mV))-1.)/ms*(1-m)-0.28*(mV**-1)*(v-VT-40.*mV)/
(exp((v-VT-40.*mV)/(5.*mV))-1.)/ms*m : 1
dn/dt = 0.032*(mV**-1)*(15.*mV-v+VT)/
(exp((15.*mV-v+VT)/(5.*mV))-1.)/ms*(1.-n)-.5*exp((10.*mV-v+VT)/(40.*mV))/ms*n : 1
dh/dt = 0.128*exp((17.*mV-v+VT)/(18.*mV))/ms*(1.-h)-4./(1+exp((40.*mV-v+VT)/(5.*mV)))/ms*h : 1
# K+ current
dp/dt = (1/(1+exp(-(v-VT+35.*mV)/(10.*mV))) - p) / (tauMax / (3.3 * exp((v - VT + 35.*mV)/(20.*mV) + exp(-(v - VT + 35.*mV)/(20.*mV))))) : 1
# Ca2+ current
dq/dt = 0.055*(mV**-1) * (-27.*mV - v) / (exp((-27.*mV - v) / (3.8*mV)) - 1.)/ms * (1.-q) - 0.94*exp((-75.*mV - v) / (17.*mV))/ms*q : 1
dr/dt = 0.000457 * exp((-13.*mV - v) / (50.*mV))/ms * (1.-r) - 0.0065 / (1. + exp((-15.*mV - v) / (28.*mV)))/ms*r : 1
I : amp
''')
# NeuronGroup
# The core of every simulation is a NeuronGroup, a group of neurons that share the same equations defining their
# properties. Minimum inputs are "number of neurons" and "model description in the form of equations". Threshold and
# refractoriness are only used for emiting spikes. To make a neuron non-excitable for a certain time period after a
# spike, the refractory keyword can be used.
# G = NeuronGroup(10, 'dv/dt = -v/tau : volt', threshold='v > -50*mV', reset='v = -70*mV', refractory=5*ms)
# Dictionary
# You can set multiple initial values at once using a dictionary and the Group.get_states() and Group.set_states()
# methods.
# initial_values = {'v': 1, 'tau': 10*ms}
# group.set_states(initial_values)
# group.v[:] --> 1)
# states = group.get_states()
# states['v'] --> 1)
group = NeuronGroup(1, eqs, method="exponential_euler")
group.v = El
group.I = 0*nA
# Recording
# Recording variables during a simulation is done with “monitor” objects. Specifically, spikes are recorded with
# SpikeMonitor, the time evolution of variables with StateMonitor and the firing rate of a population of neurons with
# PopulationRateMonitor. You can get all the stored values in a monitor with the Group.get_states().
# In this example, we record two variables v and u, and record from indices 0, 10 and 100 --> three neurons.
# G = NeuronGroup(...)
# M = StateMonitor(G, ('v', 'u'), record=[0, 10, 100])
# M.v[1] will return the values for the second recorded neuron which is the neuron with the index 10.
M = StateMonitor(group, 'v', record=0)
# Run the model
# The command run(100*ms) runs the simulation for 100 ms.
run(stim_start*ms)
group.I[0] = input_current*nA # current injection at one end
run(stim_duration*ms)
group.I = 0*nA
run(stim_start*ms)
# C++ standalone mode
# After the last run() call, call device.build() explicitly:
# device.build(directory='output', compile=True, run=True, debug=False)
# Timing
# profiling_summary(show=5) -- show the 5 objects that took the longest
# profiling_summary(show=2)
# Output
time = M.t/ms
voltage = M.v[0]/mV
# plot output
if plotflag:
plt.plot(time, voltage)
xlabel('Time [ms]')
ylabel('Membrane potential [mV]')
plt.show()
# For multiple calls
# device.reinit()
# device.activate()
return (voltage, time)
#brian_hh_model(1, 1, 500, 1000, ENa=65, EK=-90, El=-70, ECa=120, gNa=0.05, gK=0.005, gL=1e-4, gM=8e-5, gCa=1e-5)
Thanks a lot in advance!!

How to remove outliers correctly and define predictors for linear model?

I am learning how to build a simple linear model to find a flat price based on its squared meters and the number of rooms. I have a .csv data set with several features and of course 'Price' is one of them, but it contains several suspicious values like '1' or '4000'. I want to remove these values based on mean and standard deviation, so I use the following function to remove outliers:
import numpy as np
import pandas as pd
def reject_outliers(data):
u = np.mean(data)
s = np.std(data)
data_filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return data_filtered
Then I construct function to build linear regression:
def linear_regression(data):
data_filtered = reject_outliers(data['Price'])
print(len(data)) # based on the lenght I see that several outliers have been removed
Next step is to define the data/predictors. I set my features:
features = data[['SqrMeters', 'Rooms']]
target = data_filtered
X = features
Y = target
And here is my question. How can I get the same set of observations for my X and Y? Now I have inconsistent numbers of samples (5000 for my X and 4995 for my Y after removing outliers). Thank you for any help in this topic.
The features and labels should have the same length
and you should pass the whole data object to reject_outliers:
def reject_outliers(data):
u = np.mean(data["Price"])
s = np.std(data["Price"])
data_filtered = data[(data["Price"]>(u-2*s)) & (data["Price"]<(u+2*s))]
return data_filtered
You can use it in this way:
data_filtered=reject_outliers(data)
features = data_filtered[['SqrMeters', 'Rooms']]
target = data_filtered['Price']
X=features
y=target
Following works for Pandas DataFrames (data):
def reject_outliers(data):
u = np.mean(data.Price)
s = np.std(data.Price)
data_filtered = data[(data.Price > u-2*s) & (data.Price < u+2*s)]
return data_filtered

Python Information gain implementation

I am currently using scikit-learn for text classification on the 20ng dataset. I want to calculate the information gain for a vectorized dataset. It has been suggested to me that this can be accomplished, using mutual_info_classif from sklearn. However, this method is really slow, so I was trying to implement information gain myself based on this post.
I came up with the following solution:
from scipy.stats import entropy
import numpy as np
def information_gain(X, y):
def _entropy(labels):
counts = np.bincount(labels)
return entropy(counts, base=None)
def _ig(x, y):
# indices where x is set/not set
x_set = np.nonzero(x)[1]
x_not_set = np.delete(np.arange(x.shape[1]), x_set)
h_x_set = _entropy(y[x_set])
h_x_not_set = _entropy(y[x_not_set])
return entropy_full - (((len(x_set) / f_size) * h_x_set)
+ ((len(x_not_set) / f_size) * h_x_not_set))
entropy_full = _entropy(y)
f_size = float(X.shape[0])
scores = np.array([_ig(x, y) for x in X.T])
return scores
Using a very small dataset, most scores from sklearn and my implementation are equal. However, sklearn seems to take frequencies into account, which my algorithm clearly doesn't. For example
categories = ['talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
X, y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
max_features=100,
stop_words='english')
X_vec = cv.fit_transform(X)
t0 = time()
res_sk = mutual_info_classif(X_vec, y, discrete_features=True)
print("Time passed for sklearn method: %3f" % (time()-t0))
t0 = time()
res_ig = information_gain(X_vec, y)
print("Time passed for ig: %3f" % (time()-t0))
for name, res_mi, res_ig in zip(cv.get_feature_names(), res_sk, res_ig):
print("%s: mi=%f, ig=%f" % (name, res_mi, res_ig))
sample output:
center: mi=0.011824, ig=0.003548
christian: mi=0.128629, ig=0.127122
color: mi=0.028413, ig=0.026397
com: mi=0.041184, ig=0.030458
computer: mi=0.020590, ig=0.012327
cs: mi=0.007291, ig=0.001574
data: mi=0.020734, ig=0.008986
did: mi=0.035613, ig=0.024604
different: mi=0.011432, ig=0.005492
distribution: mi=0.007175, ig=0.004675
does: mi=0.019564, ig=0.006162
don: mi=0.024000, ig=0.017605
earth: mi=0.039409, ig=0.032981
edu: mi=0.023659, ig=0.008442
file: mi=0.048056, ig=0.045746
files: mi=0.041367, ig=0.037860
ftp: mi=0.031302, ig=0.026949
gif: mi=0.028128, ig=0.023744
god: mi=0.122525, ig=0.113637
good: mi=0.016181, ig=0.008511
gov: mi=0.053547, ig=0.048207
So I was wondering if my implementation is wrong, or it is correct, but a different variation of the mutual information algorithm scikit-learn uses.
A little late with my answer but you should look at Orange's implementation. Within their app it is used as a behind-the-scenes processor to help inform the dynamic model parameter building process.
The implementation itself looks fairly straightforward and could most likely be ported out. The entropy calculation first
The sections starting at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L233
def _entropy(dist):
"""Entropy of class-distribution matrix"""
p = dist / np.sum(dist, axis=0)
pc = np.clip(p, 1e-15, 1)
return np.sum(np.sum(- p * np.log2(pc), axis=0) * np.sum(dist, axis=0) / np.sum(dist))
Then the second portion.
https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L305
class GainRatio(ClassificationScorer):
"""
Information gain ratio is the ratio between information gain and
the entropy of the feature's
value distribution. The score was introduced in [Quinlan1986]_
to alleviate overestimation for multi-valued features. See `Wikipedia entry on gain ratio
<http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
"""
def from_contingency(self, cont, nan_adjustment):
h_class = _entropy(np.sum(cont, axis=1))
h_residual = _entropy(np.compress(np.sum(cont, axis=0), cont, axis=1))
h_attribute = _entropy(np.sum(cont, axis=0))
if h_attribute == 0:
h_attribute = 1
return nan_adjustment * (h_class - h_residual) / h_attribute
The actual scoring process happens at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L218

Categories