Odd Results on Entropy Calculation

Odd Results on Entropy Calculation - python

I am trying to write a function that properly calculates the entropy of a given dataset. However, I am getting very weird entropy values.
I am following the understanding that all entropy calculations must fall between 0 and 1, yet I am consistently getting values above 2.
Note: I must use log base 2 for this
Can someone explain why am I yielding incorrect entropy results?
The dataset I am testing is the ecoli dataset from the UCI Machine Learning Repository
import numpy
import math
#################### DATA HANDLING LIBRARY ####################
def csv_to_array(file):
# Open the file, and load it in delimiting on the ',' for a comma separated value file
data = open(file, 'r')
data = numpy.loadtxt(data, delimiter=',')
# Loop through the data in the array
for index in range(len(data)):
# Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0
try:
data[index] = [float(x) for x in data[index]]
except Exception:
data[index] = 0
except ValueError:
data[index] = 0
# Return the now type-formatted data
return data
# Function that utilizes the numpy library to randomize the dataset.
def randomize_data(csv):
csv = numpy.random.shuffle(csv)
return csv
# Function to split the data into test, training set, and validation sets
def split_data(csv):
# Call the randomize data function
randomize_data(csv)
# Grab the number of rows and calculate where to split
num_rows = csv.shape[0]
validation_split = int(num_rows * 0.10)
training_split = int(num_rows * 0.72)
testing_split = int(num_rows * 0.18)
# Validation set as the first 10% of the data
validation_set = csv[:validation_split]
# Training set as the next 72
training_set = csv[validation_split:training_split + validation_split]
# Testing set as the last 18
testing_set = csv[training_split + validation_split:]
# Split the data into classes vs actual data
training_cols = training_set.shape[1]
testing_cols = testing_set.shape[1]
validation_cols = validation_set.shape[1]
training_classes = training_set[:, training_cols - 1]
testing_classes = testing_set[:, testing_cols - 1]
validation_classes = validation_set[:, validation_cols - 1]
# Take the sets and remove the last (classification) column
training_set = training_set[:-1]
testing_set = testing_set[:-1]
validation_set = validation_set[:-1]
# Return the datasets
return testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes
#################### DATA HANDLING LIBRARY ####################
# This function returns the list of classes, and their associated weights (i.e. distributions)
# for a given dataset
def class_distribution(dataset):
# Ensure the dataset is a numpy array
dataset = numpy.asarray(dataset)
# Collect # of total rows and columns, using numpy
num_total_rows = dataset.shape[0]
num_columns = dataset.shape[1]
# Create a numpy array of just the classes
classes = dataset[:, num_columns - 1]
# Use numpy.unique to remove duplicates
classes = numpy.unique(classes)
# Create an empty array for the class weights
class_weights = []
# Loop through the classes one by one
for aclass in classes:
# Create storage variables
total = 0
weight = 0
# Now loop through the dataset
for row in dataset:
# If the class of the dataset is equal to the current class you are evaluating, increase the total
if numpy.array_equal(aclass, row[-1]):
total = total + 1
# If not, continue
else:
continue
# Divide the # of occurences by total rows
weight = float((total / num_total_rows))
# Add that weight to the list of class weights
class_weights.append(weight)
# Turn the weights into a numpy array
class_weights = numpy.asarray(class_weights)
# Return the array
return classes, class_weights
# This function returns the entropy for a given dataset
# Can be used across an entire csv, or just for a column of data (feature)
def get_entropy(dataset):
# Set initial entropy
entropy = 0.0
# Determine the classes and their frequencies (weights) of the dataset
classes, class_freq = class_distribution(dataset)
# Utilize numpy's quicksort to test the most occurring class first
numpy.sort(class_freq)
# Determine the max entropy for the dataset
max_entropy = math.log(len(classes), 2)
print("MAX ENTROPY FOR THIS DATASET: ", max_entropy)
# Loop through the frequencies and use given formula to calculate entropy
# For...Each simulates the sequence operator
for freq in class_freq:
entropy += float(-freq * math.log(freq, 2))
# Return the entropy value
return entropy
def main():
ecol = csv_to_array('ecoli.csv')
testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes = split_data(ecol)
entropy = get_entropy(ecol)
print(entropy)
main()

The following function was used to calculate Entropy:
# Function to return Shannon's Entropy
def entropy(attributes, dataset, targetAttr):
freq = {}
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
As #MattTimmermans had indicated, entropy's value is actually contingent on the number of classes. For strictly 2 classes, it is contained in the 0 to 1 (inclusive) range. However, for more than 2 classes (which is what was being tested), entropy is calculated with a different formula (converted to Pythonic code above). This post here explains those mathematics and calculations a bit more in detail.

Related

How to process pandas dataframe by row

I am working on an ID3 algorithm implementation. The issue that I am running into is processing the branches from the new root attribute
As the print shows
gain: 1.263221025628615 for Material
processing attribute Volume
processing branch 1 for Volume
processing branch 6 for Volume
processing branch 4 for Volume
processing branch 2 for Volume
processing branch 5 for Volume
processing branch 3 for Volume
gain: 0.6036978279454468 for Volume
attribute Venue has the max gain of 0.6036978279454468
removing Venue
new root Venue has branches [2 1]
The last step on Step 3 should filter the dataframe by the unique values of the selected attribute:
from numpy.core.defchararray import count
import pandas as pd
import numpy as np
import numpy as np
from math import ceil, floor, log2
from sklearn.decomposition import PCA
from numpy import linalg as LA
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
def calculate_metrics(tp, tn, fn, p, n, fp):
# calculate the accuracy, error rate, sensitivity, specificity, and precision for the selected classifier in reference to the corresponding test set.
accuracy = tp + tn /(p+n)
error_rate = fp + fn /(p + n)
sensitivity = tp/ p
precision = tp/ (tp+fp)
specificity = tn/n
display_metrics(accuracy, error_rate, sensitivity, precision, specificity)
def display_metrics(accuracy, error_rate, sensitivity, precision, specificity):
print(f'Accuracy: {accuracy}, Error_rate:{error_rate}, Sensitivity:{sensitivity}, Precision:{precision}, specificity:{specificity}')
def mc(columnName,training_set):
column = training_set[columnName]
probs = column.value_counts(normalize=True)
messageConveyed = -1*np.sum(np.log2(probs)*probs)
# print(f'mc {messageConveyed}')
return messageConveyed
def isUnique(s):
a = s.to_numpy() # s.values (pandas<0.24)
return (a[0] == a).all()
def ID3(threshold,g):
# use the training set to predict the test set.
# use the Assignment 2--Training set to extract rules and test the quality of the extracted rules against the Assignment 2-- Test set for ID3.
test_set = pd.read_csv("Assignment 2--Test set for ID3.csv")
training_set = pd.read_csv("Assignment 2--Training set for ID3.csv")
print('***********************************')
print('TRAINING SET')
print(training_set)
print('***********************************')
print('***********************************')
print('TEST SET')
print(test_set)
print('***********************************')
print(f'test_set: {test_set}')
print(f'training_set: {training_set}')
# Step 1- Calculate MC (Message Conveyed) for the given data set in reference to the class attribute
print(f'Step 1- Calculate MC (Message Conveyed) for the given data set in reference to the class attribute')
# MC = -p1*log2(p1) - p2*log2(p2)
# For n classes MC = -p1log2(p1) - p2*log2(p2)-...-pn*log2(pn)
# For each column calculate the gain.
numberOfColumns = 0
mcDictionary = {}
print('***********************************')
print('For each column calculate the gain.')
for (columnName, columnData) in training_set.iteritems():
messageConveyed = mc(columnName,training_set)
mcDictionary.update({columnName:round(messageConveyed)})
numberOfColumns+=1
print('***********************************')
print(f'numberOfColumns {numberOfColumns}')
print(f'mcDictionary {mcDictionary}')
# The column with the highest gain is the root.
print(f'The column with the highest gain is the root.')
values = mcDictionary.values()
max_value = max(values)
print(f'The max value is {max_value}')
# print(f'The max value, {max_value}, is associated with column {columnWithMaximumInformationGain}')
val_list = list(values)
columnWithMaximumInformationGain = list(mcDictionary.keys())[list(mcDictionary.values()).index(max_value)]
print(f'The max value, {max_value}, is associated with column {columnWithMaximumInformationGain}')
# select the max value from the gain array
# this is the new root
root = columnWithMaximumInformationGain
print(f'root is {root}')
print("******************************************")
print("************** ROOT ******************")
print(f"TF is {root}**********************")
print("******************************************")
print(f'isUnique = {isUnique(training_set[root])}')
if(isUnique(training_set[root])):
return
# Step 2 - Repeat for every attribute
print(f'Step 2 - Repeat for every attribute')
# Loop 1
attribute = ""
maximum = 0
for (F, columnData) in training_set.iteritems():
print(f'processing attribute {F}')
# Loop 2
Total = 0
uniques = training_set[F].unique()
for k in uniques:
print(f'processing branch {k} for {F}')
# Calculate MC for column
messageConveyed = mc(F,training_set)
# Calculate the weight for F
F_D = training_set[F].count()
TF_D = training_set[root].count()
weight = F_D/TF_D
total = weight*messageConveyed
gain = mcDictionary[root] - total
if(gain > maximum):
attribute = F
maximum = gain
print(f"gain: {gain} for {F}")
print(f'attribute {attribute} has the max gain of {gain}')
print(f'removing {attribute}')
root = attribute
print(f'new root {root} has branches {training_set[root].unique()}')
del training_set[attribute]
# Step 3 - Examine dataset of each leaf
print(f'')
def BayesClassifier(training_set,test_set):
# use the assignment 2-- training set for Bayes as the training set to classify the records of the assignment 2 test set for bayes
X = test_set.values
Y = training_set.values
clf = GaussianNB()
clf.fit(X, Y)
# prompt user to select either ID3 or Bayes classifier.
selection = "ID3" #= input("Please enter your selection for either ID3 or Bayes classification: ")
threshold = 0.9 #= input("Please enter a threshold: ")
g = 0.05 #= input("Please enter a value for g: ")
if(selection == "ID3"):
ID3(threshold,g)
if(selection == "Bayes"):
BayesClassifier()
Given the training set
Venue,color,Model,Category,Location,weight,Veriety,Material,Volume
2,6,4,4,4,2,2,1,1
1,2,4,4,4,1,6,2,6
1,5,4,4,4,1,2,1,6
2,4,4,4,4,2,6,1,4
1,4,4,4,4,1,2,2,2
2,4,3,3,3,2,1,1,1
1,5,2,1,4,1,6,2,6
1,2,3,3,3,1,2,1,6
2,6,4,4,4,2,3,1,1
1,4,4,4,4,1,2,1,6
1,5,4,4,4,1,2,1,4
The data frame should be split into two frames by 1 and 2.
i.e.
Venue, color, Model....
1
1
1
1
1
1
1
1
Venue, color, Model....
2
2
2
2
2
2
2
2
2
Can some explain how this can be done? Thanks.

This seems to do it.
unique_values = training_set[root].unique()
datasets = []
for unique_value in unique_values:
print(f'processing for file : {unique_value} ')
df_1 = training_set[training_set[attribute] > unique_value]
df_2 = training_set[training_set[attribute] < unique_value]
datasets.append(df_1)
datasets.append(df_2)
del training_set[attribute]

How to speed up a high dimensional loop in python with numpy instead of pandas?

This Loop does its work in 5 hours. How can i speed it up? I read something about using numpy functions instead of pandas. I tried as you can see but i am to new to python to do it right. The big thing here is the high dimensional data with 6000 columns. Every data is static, except of the random weights. How do i write better code?
import numpy as np
import os
#Covarinace Matrix in Pandas Dataframe 6000 columns x 6000 rows
cov = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
mean_returns = input_table_2.copy().squeeze()
#Looping number
num_portfolios = 100.000
#Empty Resultsmatrix
results_matrix = np.zeros((len(cov.columns)+1, num_portfolios))
rf=0
#Loop corpus
for i in range(num_portfolios):
#Random numbers between 0 and 1 for every column
weights = np.random.uniform(0,1,len(cov.columns))
#Ensure sum of all random numbers is = 1
weights /= np.sum(weights)
#Some easy math operations
portfolio_return = np.sum(mean_returns * weights) * 252
portfolio_std = np.sqrt(np.dot(weights.T, np.dot(cov, weights))) * np.sqrt(252)
sharpe_ratio = (portfolio_return - rf) / portfolio_std
#write sharpe_ratio in result matrix as result for every loop
results_matrix[0,i] = sharpe_ratio
#iterate through the weight vector and add data to results array
for j in range(len(weights)):
results_matrix[j+1,i] = weights[j]
#output table as pandas data frame
output_table = pd.DataFrame(results_matrix.T,columns=['sharpe'] + [ticker for ticker in list(cov.columns)] )```

there is not a generic way to do that, first of all you must identify where your code is slow, and after that you can apply optimization.
First of all you have nested loop so complexity is O(n^2) not a bid deal here, because lot of work can be done using vectorial approach.
In python creation of new object is slow, so for example, if it can be stored in ram, the first np.random.uniform can be done one time and consumed during the cycle.
nested iterator, can be done in vectorial mode, this seem the best candidates for performance.
Anyway i suggest to use a tool like perf_tool that will guide you exactly on the slow piece of code [*]
[*] i'm the main developer of this tool.

#AmilaMGunawardana Here is my first try with tensorflow, but i is not fast enough. At the end i waited 5 hours for 100.000 rounds. Maybe i have to do something better?
Perftool showed me that evrything in the code is fast, except the Part:
vol_arr[x] = tnp.sqrt(tnp.dot(multi_randoms[x].T, np.dot(covData*252, multi_randoms[x]))) --> This part takes 90% of the execution Time.
covData = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
returns = input_table_2.copy().squeeze()
#Looping number
num_portfolios = 100000
rf=0
#print("mean_returns: ", mean_returns)
#print("cov2: ", cov2)
#print("cov: ", cov)
all_weights = np.zeros((num_ports, len(returns.columns))) #tnp.zeros([num_ports,len(returns.columns)], dtype=tnp.float32) #np.zeros((num_ports, len(returns.columns)))
ret_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)# pd.to_numeric(np.zeros(num_ports))
vol_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)
sharpe_arr = pd.to_numeric(np.zeros(num_ports))#tnp.zeros(num_ports, dtype=tnp.float32)
multi_randoms = np.random.normal(0, 1., size=(num_portfolios,len(covData.columns) ))
#perf_tool('main')
def main():
for x in range(num_ports):
with PerfTool('preparation1'):
# Save weights
all_weights[x,:] = multi_randoms[x]
with PerfTool('preparation2'):
# Expected return
ret_arr[x] = tnp.sum( (mean_returns * multi_randoms[x] * 252))
with PerfTool('preparation3'):
# Expected volatility
vol_arr[x] = tnp.sqrt(tnp.dot(multi_randoms[x].T, np.dot(covData*252, multi_randoms[x])))
with PerfTool('preparation4'):
# Sharpe Ratio
sharpe_arr[x] = ret_arr[x] - rf /vol_arr[x]
PerfTool.set_enabled()
main()
PerfTool.show_stats_if_enabled()```

This showes up one way of getting better with parallel loading. How could i get rid of the loop? Is there a way to do this calculations in just one step with using all_weights Dataframe once instead of looping over it?
import pandas as pd
import numpy as np
from perf_tool import PerfTool, perf_tool
from joblib import Parallel, delayed, parallel_backend
#Covarinace Matrix in Pandas Dataframe 6000 columns x 6000 rows
covData = input_table_1.copy()
#Mean returns Pandas DataFrame 6000 columns x 1800 rows
mean_returns = input_table_2.copy().squeeze()
#Looping number
num_ports = 100000
all_weights = np.zeros((num_ports, len(mean_returns.columns)))
#multi_randoms = np.random.random(size=(len(df.columns) ))
for x in range(num_ports):
weights = np.array(np.random.random(len(mean_returns.columns)))
weights = weights/np.sum(weights)
all_weights[x,:] = weights
#print(weights)
#weights = np.array(np.random.random(len(returns.columns)))
#print(all_weights)
#print("cov2 type: ", type(cov2))
#cov = pd.DataFrame(np.random.normal(0, 1., size=(600,600)))
#print("cov type: ", type(cov))
rf=0
#print("mean_returns: ", mean_returns)
#print("cov2: ", cov2)
#print("cov: ", cov)
#all_weights = np.zeros((num_ports, len(returns.columns)))
ret_arr = pd.to_numeric(np.zeros(num_ports))
vol_arr = pd.to_numeric(np.zeros(num_ports))
sharpe_arr = pd.to_numeric(np.zeros(num_ports))
##perf_tool('main')
##jit(parallel=True)
def test(x):
#for x in range(num_ports):
#with PerfTool('preparation1'):
# Weights
#weights = np.array(np.random.random(len(returns.columns)))
#with PerfTool('preparation2'):
#weights = weights/np.sum(weights)
#with PerfTool('preparation3'):
# Save weights
weights= all_weights[x]
#with PerfTool('preparation4'):
# Expected return
ret_arr[x] = np.sum( (mean_returns * weights * 252))
#with PerfTool('preparation5'):
# Expected volatility
vol_arr[x] = np.sqrt(np.dot(weights.T, np.dot(covData*252, weights)))
#with PerfTool('preparation6'):
# Sharpe Ratio
return x, ret_arr[x] - rf /vol_arr[x]
#sharpe_arr[x] = (np.sum( (mean_returns * all_weights * 252)) - rf) /(np.sqrt(np.dot(all_weights.T, np.dot(covData*252, all_weights))))
#PerfTool.set_enabled()
sharpe= []
weighttable= []
weighttable, sharpe= zip(*Parallel(n_jobs=-1)([delayed(test)(i) for i in range(num_ports)]))```

Drawing equal samples from each class in stratified sampling

So I have 1000 class 1 and 2500 class 2. So naturally when using:
sklearn's train_test_split(test_size = 200, stratify = y). I get an imbalanced test set since it is preserving the data distribution from the original data set. However, I would like to split to have 100 class 1 and 100 class 2 in the test set.
How would I do it? Any suggestions would be appreciated.

Split Manually
A manual solution isn't that scary. Main steps explained:
Isolate the index of class-1 and class-2 rows.
Use np.random.permutation() to select random n1 and n2 test samples for class 1 and 2 respectively.
Use df.index.difference() to perform inverse selection for the train samples.
The code can be easily generalized to arbitrary number of classes and arbitrary numbers to be selected as test data (just put n1/n2, idx1/idx2, etc. into lists and process by loops). But that's out of the scope of the question itself.
Code
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
# data
df = pd.DataFrame(
data={
"label": np.array([1]*1000 + [2]*2500),
# label 1 has value > 0, label 2 has value < 0
"value": np.hstack([np.random.uniform(0, 1, 1000),
np.random.uniform(-1, 0, 2500)])
}
)
df = df.sample(frac=1).reset_index(drop=True)
# sampling number for each class
n1 = 100
n2 = 100
# 1. get indexes and lengths for the classes respectively
idx1 = df.index.values[df["label"] == 1]
idx2 = df.index.values[df["label"] == 2]
len1 = len(idx1) # 1000
len2 = len(idx2) # 2500
# 2. draw index for test dataset
draw1 = np.random.permutation(len1)[:n1] # keep the first n1 entries to be selected
idx1_test = idx1[draw1]
draw2 = np.random.permutation(len2)[:n2]
idx2_test = idx2[draw2]
# combine the drawn indexes
idx_test = np.hstack([idx1_test, idx2_test])
# 3. derive index for train dataset
idx_train = df.index.difference(idx_test)
# split
df_train = df.loc[idx_train, :] # optional: .reset_index(drop=True)
df_test = df.loc[idx_test, :]
# len(df_train) = 3300
# len(df_test) = 200
# verify that no row was missing
idx_merged = np.hstack([df_train.index.values, df_test.index.values])
assert len(np.unique(idx_merged)) == 3500

Alternating Least Square parameters tuning

Context: I am working an building a recommender system using implicit feedback (orders) using the implicit library in python.
Issue: when trying to tune the parameters in order to know the best parameters to use, the output is not looping over all variables, and is not calculating the auc. How can I make sure it loops over all combination, and add in the dictionary if the combinations lead to a highest AUC score ?
Also, please feel free to let me know if there is a library that I can use to tune it as I did not know how to use gridsearchCV for example for this use case (ALS model).
In the code:
training_set2 - The altered version of the original training_set with a certain percentage of the user-item pairs that originally had interaction set back to zero.
validation_set - A copy of the original training_set matrix, unaltered, so it can be used to see how the rank order compares with the actual interactions.
Expected output: Is a dictionary with all the combinations, in a desc order with the last one being the combination that has the highest AUC score. This combination will be the one I will use for my test set.
def auc_score(predictions, test):
fpr, tpr, thresholds = metrics.roc_curve(test, predictions)
return metrics.auc(fpr, tpr)
def calc_mean_auc(training_set, altered_users, predictions, test_set):
'''
This function will calculate the mean AUC by user for any user that had their user-item matrix altered.
'''
store_auc = [] # An empty list to store the AUC for each user that had an item removed from the training set
item_vecs = predictions[1]
for user in altered_users: # Iterate through each user that had an item altered
training_row = training_set[user,:].toarray().reshape(-1) # Get the training set row
zero_inds = np.where(training_row == 0) # Find where the interaction had not yet occurred
# Get the predicted values based on our user/item vectors
user_vec = predictions[0][user,:]
pred = user_vec.dot(item_vecs).toarray()[0,zero_inds].reshape(-1)
# Get only the items that were originally zero
# Select all ratings from the MF prediction for this user that originally had no iteraction
actual = test_set[user,:].toarray()[0,zero_inds].reshape(-1)
# Select the binarized yes/no interaction pairs from the original full data
# that align with the same pairs in training
store_auc.append(auc_score(pred, actual)) # Calculate AUC for the given user and store
# End users iteration
return float('%.3f'%np.mean(store_auc))
...
latent_factors = [5, 10, 20, 40, 80]
regularizations = [0.01, 0.1, 1., 10., 100.]
regularizations.sort()
iter_array = [1, 2, 5, 10, 25, 50, 100]
best_params = {}
best_params['n_factors'] = latent_factors[0]
best_params['reg'] = regularizations[0]
best_params['n_iter'] = 0
best_params['auc_result'] = np.inf
best_params['model'] = None
for fact in latent_factors:
print('Factors: {}'.format(fact))
for reg in regularizations:
print ('Regularization: {}'.format(reg))
for ite in iter_array:
print ('Iteration: {}'.format(ite))
model = implicit.als.AlternatingLeastSquares(
factors=fact,
regularization=reg,
iterations=ite)
model.fit((training2_set.T * 15).astype('double'))
customers_vecs = model.user_factors
restaurant_vecs = model.item_factors
auc_result = calc_mean_auc(training2_set, cust_altered2,
[sparse.csr_matrix(customers_vecs), sparse.csr_matrix(restaurant_vecs.T)], validation_set)
if auc_result > best_params['auc_result']:
best_params['n_factors'] = fact
best_params['reg'] = reg
best_params['n_iter'] = ite
best_params['auc_result'] = auc_result
best_params['model'] = 'AlternatingLeastSquare'
print ('New optimal hyperparameters')
print (pd.Series(best_params))
I cannot post a picture but this is the output that I get:
</b>
Factors: 5</b>
Regularization: 0.01</b>
Iteration: 1</b>
n_factors 5.00</b>
reg 0.01</b>
n_iter 0.00</b>
auc_result inf</b>
model NaN</b>
dtype: float64</b>
Iteration: 2</b>
n_factors 5.00</b>
reg 0.01</b>
n_iter 0.00</b>
auc_result inf</b>
model NaN</b>
dtype: float64</b>
Iteration: 5</b>

You are only keeping the best parameters - if auc_error < best_params['auc_error']:
Use a collections.defaultdict to accumulate all the parameters in lists
...
from collections import defaultdict
best_params = defaultdict(list)
##best_params = {}
##best_params['n_factors'] = latent_factors[0]
##best_params['reg'] = regularizations[0]
##best_params['n_iter'] = 0
##best_params['auc_error'] = np.inf
##best_params['model'] = None
...
for fact in latent_factors:
...
...
##if auc_error < best_params['auc_error']:
best_params['n_factors'].append(fact)
best_params['reg'].append(reg)
best_params['n_iter'].append(ite)
best_params['auc_error'].append(auc_error)
best_params['model'].append('AlternatingLeastSquare')
##print ('New optimal hyperparameters')
It appears you are using Pandas so make a DataFrame and sort it.
df = pd.DataFrame(best_params)
df = df.sort_values('auc_error')

How to remove outliers correctly and define predictors for linear model?

I am learning how to build a simple linear model to find a flat price based on its squared meters and the number of rooms. I have a .csv data set with several features and of course 'Price' is one of them, but it contains several suspicious values like '1' or '4000'. I want to remove these values based on mean and standard deviation, so I use the following function to remove outliers:
import numpy as np
import pandas as pd
def reject_outliers(data):
u = np.mean(data)
s = np.std(data)
data_filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return data_filtered
Then I construct function to build linear regression:
def linear_regression(data):
data_filtered = reject_outliers(data['Price'])
print(len(data)) # based on the lenght I see that several outliers have been removed
Next step is to define the data/predictors. I set my features:
features = data[['SqrMeters', 'Rooms']]
target = data_filtered
X = features
Y = target
And here is my question. How can I get the same set of observations for my X and Y? Now I have inconsistent numbers of samples (5000 for my X and 4995 for my Y after removing outliers). Thank you for any help in this topic.

The features and labels should have the same length
and you should pass the whole data object to reject_outliers:
def reject_outliers(data):
u = np.mean(data["Price"])
s = np.std(data["Price"])
data_filtered = data[(data["Price"]>(u-2*s)) & (data["Price"]<(u+2*s))]
return data_filtered
You can use it in this way:
data_filtered=reject_outliers(data)
features = data_filtered[['SqrMeters', 'Rooms']]
target = data_filtered['Price']
X=features
y=target

Following works for Pandas DataFrames (data):
def reject_outliers(data):
u = np.mean(data.Price)
s = np.std(data.Price)
data_filtered = data[(data.Price > u-2*s) & (data.Price < u+2*s)]
return data_filtered

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Odd Results on Entropy Calculation - python

Related

How to process pandas dataframe by row

How to speed up a high dimensional loop in python with numpy instead of pandas?

Drawing equal samples from each class in stratified sampling

Alternating Least Square parameters tuning

How to remove outliers correctly and define predictors for linear model?

Categories

Resources