genetic algorithm, implementation of generation of new population

genetic algorithm, implementation of generation of new population - python

I have this problem with the implementation of the generation of a new population. I have a population of matrices, I defined a fitness function and I need the value of fitness to be as low as possible. So I implemented the function to re-create the population, maintaining only the best individuals from the one before, as it follows:
def new_generation(old_generation, real_coords, elit_rate = ELIT, mutation_rate = MUTATION_RATE, half = HALF, mtype = "strong"):
fit = [fitness(individual, real_coords) for individual in old_generation]
idx = np.argsort(fit)
print(idx)
new_gen = []
for i in range(n_population):
if i < ELIT:
new_gen.append(old_generation[idx[i]])
else:
new_gen.append(crossover(old_generation[idx[np.random.randint(0,n_population)]], old_generation[idx[np.random.randint(0,n_population)]]))
for i in range(n_population):
if np.random.rand() < mutation_rate:
if i > ELIT:
new_gen[i] = mutate(new_gen[i],mtype)
print("new gen")
print([fitness(individual, real_coords) for individual in new_gen])
return new_gen
My problem is that the new generation I get is not really ordered with the first ones with the lowest fitness possible!
(ELIT is 10 and n_population is 100)
I think my problem is in the part where I do:
for i in range(n_population):
if i < ELIT:
new_gen.append(old_generation[idx[i]])
because in my head this should guarantee me to have the first individual in new_gen as I desire.
Where am I wrong?
Thank you!

Related

Heuristic to choose five column arrays that maximise the dot product

I have a sparse 60000x10000 matrix M where each element is either a 1 or 0. Each column in the matrix is a different combination of signals (ie. 1s and 0s). I want to choose five column vectors from M and take the Hadamard (ie. element-wise) product of them; I call the resulting vector the strategy vector. After this step, I compute the dot product of this strategy vector with a target vector (that does not change). The target vector is filled with 1s and -1s such that having a 1 in a specific row of the strategy vector is either rewarded or penalised.
Is there some heuristic or linear algebra method that I could use to help me pick the five vectors from the matrix M that result in a high dot product? I don't have any experience with Google's OR tools nor Scipy's optimization methods so I am not too sure if they can be applied to my problem. Advice on this would be much appreciated! :)
Note: the five column vectors given as the solution does not need to be the optimal one; I'd rather have something that does not take months/years to run.

First of all, thanks for a good question. I don't get to practice numpy that often. Also, I don't have much experience in posting to SE, so any feedback, code critique, and opinions relating to the answer are welcome.
This was an attempt at finding an optimal solution at first, but I didn't manage to deal with the complexity. The algorithm should, however, give you a greedy solution that might prove to be adequate.
Colab Notebook (Python code + Octave validation)
Core Idea
Note: During runtime, I've transposed the matrix. So, the column vectors in the question correspond to row vectors in the algorithm.
Notice that you can multiply the target with one vector at a time, effectively getting a new target, but with some 0s in it. These will never change, so you can filter out some computations by removing those rows (columns, in the algorithm) in further computations entirely - both from the target and the matrix. - you're then left with a valid target again (only 1s and -1 in it).
That's the basic idea of the algorithm. Given:
n: number of vectors you need to pick
b: number of best vectors to check
m: complexity of matrix operations to check one vector
Do an exponentially-complex O((n*m)^b) depth-first search, but decrease the complexity of the calculations in deeper layers by reducing target/matrix size, while cutting down a few search paths with some heuristics.
Heuristics used
The best score achieved so far is known in every recursion step. Compute an optimistic vector (turn -1 to 0) and check what scores can still be achieved. Do not search in levels where the score cannot be surpassed.
This is useless if the best vectors in the matrix have 1s and 0s equally distributed. The optimistic scores are just too high. However, it gets better with more sparsity.
Ignore duplicates. Basically, do not check duplicate vectors in the same layer. Because we reduce the matrix size, the chance for ending up with duplicates increases in deeper recursion levels.
Further Thoughts on Heuristics
The most valuable ones are those that eliminate the vector choices at the start. There's probably a way to find vectors that are worse-or-equal than others, with respect to their affects on the target. Say, if v1 only differs from v2 by an extra 1, and target has a -1 in that row, then v1 is worse-or-equal than v2.
The problem is that we need to find more than 1 vector, and can't readily discard the rest. If we have 10 vectors, each worse-or-equal than the one before, we still have to keep 5 at the start (in case they're still the best option), then 4 in the next recursion level, 3 in the following, etc.
Maybe it's possible to produce a tree and pass it on in into recursion? Still, that doesn't help trim down the search space at the start... Maybe it would help to only consider 1 or 2 of the vectors in the worse-or-equal chain? That would explore more diverse solutions, but doesn't guarantee that it's more optimal.
Warning: Note that the MATRIX and TARGET in the example are in int8. If you use these for the dot product, it will overflow. Though I think all operations in the algorithm are creating new variables, so are not affected.
Code
# Given:
TARGET = np.random.choice([1, -1], size=60000).astype(np.int8)
MATRIX = np.random.randint(0, 2, size=(10000,60000), dtype=np.int8)
# Tunable - increase to search more vectors, at the cost of time.
# Performs better if the best vectors in the matrix are sparse
MAX_BRANCHES = 3 # can give more for sparser matrices
# Usage
score, picked_vectors_idx = pick_vectors(TARGET, MATRIX, 5)
# Function
def pick_vectors(init_target, init_matrix, vectors_left_to_pick: int, best_prev_result=float("-inf")):
assert vectors_left_to_pick >= 1
if init_target.shape == (0, ) or len(init_matrix.shape) <= 1 or init_matrix.shape[0] == 0 or init_matrix.shape[1] == 0:
return float("inf"), None
target = init_target.copy()
matrix = init_matrix.copy()
neg_matrix = np.multiply(target, matrix)
neg_matrix_sum = neg_matrix.sum(axis=1)
if vectors_left_to_pick == 1:
picked_id = np.argmax(neg_matrix_sum)
score = neg_matrix[picked_id].sum()
return score, [picked_id]
else:
sort_order = np.argsort(neg_matrix_sum)[::-1]
sorted_sums = neg_matrix_sum[sort_order]
sorted_neg_matrix = neg_matrix[sort_order]
sorted_matrix = matrix[sort_order]
best_score = best_prev_result
best_picked_vector_idx = None
# Heuristic 1 (H1) - optimistic target.
# Set a maximum score that can still be achieved
optimistic_target = target.copy()
optimistic_target[target == -1] = 0
if optimistic_target.sum() <= best_score:
# This check can be removed - the scores are too high at this point
return float("-inf"), None
# Heuristic 2 (H2) - ignore duplicates
vecs_tried = set()
# MAIN GOAL: for picked_id, picked_vector in enumerate(sorted_matrix):
for picked_id, picked_vector in enumerate(sorted_matrix[:MAX_BRANCHES]):
# H2
picked_tuple = tuple(picked_vector)
if picked_tuple in vecs_tried:
continue
else:
vecs_tried.add(picked_tuple)
# Discard picked vector
new_matrix = np.delete(sorted_matrix, picked_id, axis=0)
# Discard matrix and target rows where vector is 0
ones = np.argwhere(picked_vector == 1).squeeze()
new_matrix = new_matrix[:, ones]
new_target = target[ones]
if len(new_matrix.shape) <= 1 or new_matrix.shape[0] == 0:
return float("-inf"), None
# H1: Do not compute if best score cannot be improved
new_optimistic_target = optimistic_target[ones]
optimistic_matrix = np.multiply(new_matrix, new_optimistic_target)
optimistic_sums = optimistic_matrix.sum(axis=1)
optimistic_viable_vector_idx = optimistic_sums > best_score
if optimistic_sums.max() <= best_score:
continue
new_matrix = new_matrix[optimistic_viable_vector_idx]
score, next_picked_vector_idx = pick_vectors(new_target, new_matrix, vectors_left_to_pick - 1, best_prev_result=best_score)
if score <= best_score:
continue
# Convert idx of trimmed-down matrix into sorted matrix IDs
for i, returned_id in enumerate(next_picked_vector_idx):
# H1: Loop until you hit the required number of 'True'
values_passed = 0
j = 0
while True:
value_picked: bool = optimistic_viable_vector_idx[j]
if value_picked:
values_passed += 1
if values_passed-1 == returned_id:
next_picked_vector_idx[i] = j
break
j += 1
# picked_vector index
if returned_id >= picked_id:
next_picked_vector_idx[i] += 1
best_score = score
# Convert from sorted matrix to input matrix IDs before returning
matrix_id = sort_order[picked_id]
next_picked_vector_idx = [sort_order[x] for x in next_picked_vector_idx]
best_picked_vector_idx = [matrix_id] + next_picked_vector_idx
return best_score, best_picked_vector_idx

Maybe it's too naive, but the first thing that occurs to me is to choose the 5 columns with the shortest distance to the target:
import scipy
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
def sparse_prod_axis0(A):
"""Sparse equivalent of np.prod(arr, axis=0)
From https://stackoverflow.com/a/44321026/3381305
"""
valid_mask = A.getnnz(axis=0) == A.shape[0]
out = np.zeros(A.shape[1], dtype=A.dtype)
out[valid_mask] = np.prod(A[:, valid_mask].A, axis=0)
return np.matrix(out)
def get_strategy(M, target, n=5):
"""Guess n best vectors.
"""
dists = np.squeeze(pairwise_distances(X=M, Y=target))
idx = np.argsort(dists)[:n]
return sparse_prod_axis0(M[idx])
# Example data.
M = scipy.sparse.rand(m=6000, n=1000, density=0.5, format='csr').astype('bool')
target = np.atleast_2d(np.random.choice([-1, 1], size=1000))
# Try it.
strategy = get_strategy(M, target, n=5)
result = strategy # target.T
It strikes me that you could add another step of taking the top few percent from the M–target distances and check their mutual distances — but this could be quite expensive.
I have not checked how this compares to an exhaustive search.

Creating an efficient local search technique in Python for Latin Squares

I am basically needing to create a local search technique using the cost function. I need to create a new function that randomly swaps the original solution in the latin square, then calculates the cost and if it is better than the original solution, swap the two. This needs to be done until either the cost function is 0 or enough iterations are done. Any help at all would be massively appreciated. Thanks!!
def cost(sol):
nolist1 = [i for i in range(0,dim)]
costcol = []
missno = []
for i in range(0,dim):
nolist1 = [i for i in range(0,dim)]
for j in range(0,dim):
for k in range(0,len(nolist1)):
if sol[j][i] not in nolist1:
continue
elif sol[j][i] == nolist1[k]:
nolist1.remove(sol[j][i])
missno.append(nolist1)
costcol.append(len(missno[i]))
totalcost = sum(costcol)
return(totalcost,costcol,missno)
cost = cost(sol)

Generalized Random Response for local differential privacy implementation

I have been tasked with implementing a local (non-interactive) differential privacy mechanism. I am working with a large database of census data. The only sensitive attribute is "Number of children" which is a numerical value ranging from 0 to 13.
I decided to go with the Generalized Random Response mechanism as it seems like the most intuitive method. This mechanism is described here and presented here.
After loading each value into an array (ignoring the other attributes for now), I perform the perturbation as follows.
d = 14 # values may range from 0 to 13
eps = 1 # epsilon level of privacy
p = (math.exp(eps)/(math.exp(eps)+d-1))
q = 1/(math.exp(eps)+d-1)
p_dataset = []
for row in dataset:
coin = random.random()
if coin <= p:
p_dataset.append(row)
else:
p_dataset.append(random.randint(0,13))
Unless I have misinterpreted the definition, I believe this will guarantee epsilon differential privacy on p_dataset.
However, I am having difficulty understanding how the aggregator must interpret this dataset. Following the presentation above, I attempted to implement a method for estimating the number of individuals who answered a particular value.
v = 0 # we are estimating the number of individuals in the dataset who answered 0
nv = 0 # number of users in the perturbed dataset who answered the value
for row in p_dataset:
if row == v:
nv += 1
Iv = nv * p + (n - nv) * q
estimation = (Iv - (n*q)) / (p-q)
I do not know if I have correctly implemented the method described as I do not completely understand what it is doing, and cannot find a clear definition.
Regardless, I used this method to estimate the total amount of individuals who answered each value in the dataset with a value for epsilon ranging from 1 to 14, and then compared this to the actual values. The results are below (please excuse the formatting).
As you can see, the utility of the dataset suffers greatly for low values of epsilon. Additionally, when executed multiple times, there was relatively little deviation in estimations, even for small values of epsilon.
For example, when estimating the number of participants who answered 0 and using an epsilon of 1, all estimations seemed to be centered around 1600, with the largest distance between estimations being 100. Considering the actual value of this query is 5969, I am led to believe that I may have implemented something incorrectly.
Is this the expected behaviour of the Generalized Random Response mechanism, or have I made a mistake in my implementation?

I think when getting a false answer， we cannot directly use p_dataset.append(random.randint(0,13)), because it contains true answer

max_v = 13
min_v = 0
for row in dataset: #row就是dataset里的真实值
coin = random.random()
if coin <= p:
p_dataset.append(row)
else:
ans = []
if row == min_v:
ans = np.arange(min_v + 1, max_v + 1).tolist()
elif row == max_v:
ans = np.arange(min_v, max_v).tolist()
else:
a = np.arange(min_v, row).tolist()
b = np.arange(row + 1, max_v + 1).tolist()
[ans.append(i) for i in a]
[ans.append(i) for i in b]
p_dataset.append(random.sample(ans, 1)) # 这其实有一点问题 应该是真实值以外的其他值 这样写还包括真实值

Simple Genetic Algorithm meeting local optimum for "Hello World"

My target was simple, using genetic algorithm to reproduce the classical "Hello, World" string.
My code was based on this post. The code mainly contain 4 parts:
Generate the population which has serval different individual
Define the fitness and grade function which evaluate the individual good or bad based on the comparing with target.
Filter the population and leave len(pop)*retain individuals
Add some other individuals and mutate randomly
The parents's DNA will pass over to its children to comprise the whole population.
I modified the code and shows like this:
import numpy as np
import string
from operator import add
from random import random, randint
def population(GENSIZE,target):
p = []
for i in range(0,GENSIZE):
individual = [np.random.choice(list(string.printable[:-5])) for j in range(0,len(target))]
p.append(individual)
return p
def fitness(source, target):
fitval = 0
for i in range(0,len(source)-1):
fitval += (ord(target[i]) - ord(source[i])) ** 2
return (fitval)
def grade(pop, target):
'Find average fitness for a population.'
summed = reduce(add, (fitness(x, target) for x in pop))
return summed / (len(pop) * 1.0)
def evolve(pop, target, retain=0.2, random_select=0.05, mutate=0.01):
graded = [ (fitness(x, target), x) for x in p]
graded = [ x[1] for x in sorted(graded)]
retain_length = int(len(graded)*retain)
parents = graded[:retain_length]
# randomly add other individuals to
# promote genetic diversity
for individual in graded[retain_length:]:
if random_select > random():
parents.append(individual)
# mutate some individuals
for individual in parents:
if mutate > random():
pos_to_mutate = randint(0, len(individual)-1)
individual[pos_to_mutate] = chr(ord(individual[pos_to_mutate]) + np.random.randint(-1,1))
#
parents_length = len(parents)
desired_length = len(pop) - parents_length
children = []
while len(children) < desired_length:
male = randint(0, parents_length-1)
female = randint(0, parents_length-1)
if male != female:
male = parents[male]
female = parents[female]
half = len(male) / 2
child = male[:half] + female[half:]
children.append(child)
parents.extend(children)
return parents
GENSIZE = 40
target = "Hello, World"
p = population(GENSIZE,target)
fitness_history = [grade(p, target),]
for i in xrange(20):
p = evolve(p, target)
fitness_history.append(grade(p, target))
# print p
for datum in fitness_history:
print datum
But it seems that the result can't fit targetwell.
I tried to change the GENESIZE and loop time(more generation).
But the result always get stuck. Sometimes, enhance the loop time can help to find a optimum solution. But when I change the loop time to an much larger number like for i in xrange(10000). The result shows the error like:
individual[pos_to_mutate] = chr(ord(individual[pos_to_mutate]) + np.random.randint(-1,1))
ValueError: chr() arg not in range(256)
Anyway, how to modify my code and get an good result.
Any advice would be appreciate.

The chr function in Python2 only accepts values in the range 0 <= i < 256.
You are passing:
ord(individual[pos_to_mutate]) + np.random.randint(-1,1)
So you need to check that the result of
ord(individual[pos_to_mutate]) + np.random.randint(-1,1)
is not going to be outside that range, and take corrective action before passing to chr if it is outside that range.
EDIT
A reasonable fix for the ValueError might be to take the amended value modulo 256 before passing to chr:
chr((ord(individual[pos_to_mutate]) + np.random.randint(-1, 1)) % 256)
There is another bug: the fitness calculation doesn't take the final element of the candidate list into account: it should be:
def fitness(source, target):
fitval = 0
for i in range(0,len(source)): # <- len(source), not len(source) -1
fitval += (ord(target[i]) - ord(source[i])) ** 2
return (fitval)
Given that source and target must be of equal length, the function can be written as:
def fitness(source, target):
return sum((ord(t) - ord(s)) ** 2 for (t, s) in zip(target, source))
The real question was, why doesn't the code provided evolve random strings until the target string is reached.
The answer, I believe, is it may, but will take a lot of iterations to do so.
Consider, in the blog post referenced in the question, each iteration generates a child which replaces the least fit member of the gene pool if the child is fitter. The selection of the child's parent is biased towards fitter parents, increasing the likelihood that the child will enter the gene pool and increase the overall "fitness" of the pool. Consequently the members of the gene pool converge on the desired result within a few thousand iterations.
In the code in the question, the probability of mutation is much lower, based on the initial conditions, that is the defaults for the evolve function.
Parents that are retained have only a 1% chance of mutating, and one third of the time the "mutation" will not result in a change (zero is a possible result of random.randint(-1, 1)).
Discard parents are replaced by individuals created by merging two retained individuals. Since only 20% of parents are retained, the population can converge on a local minimum where each new child is effectively a copy of an existing parent, and so no diversity is introduced.
So apart from fixing the two bugs, the way to converge more quickly on the target is to experiment with the initial conditions and to consider changing the code in the question to inject more diversity, for example by mutating children as in the original blog post, or by extending the range of possible mutations.

k-means clustering implementation in python, running out of memory

 Note: updates/solutions at the bottom of this question
As part of a product recommendation engine, I'm trying to segment my users based on their product preferences starting with using the k-means clustering algorithm.
My data is a dictionary of the form:
prefs = {
'user_id_1': { 1L: 3.0f, 2L: 1.0f, },
'user_id_2': { 4L: 1.0f, 8L: 1.5f, },
}
where the product ids are the longs, and ratings are floats. the data is sparse. I currently have about 60,000 users, most of whom have only rated a handful of products. The dictionary of values for each user is implemented using a defaultdict(float) to simplify the code.
My implementation of k-means clustering is as follows:
def kcluster(prefs,sim_func=pearson,k=100,max_iterations=100):
from collections import defaultdict
users = prefs.keys()
centroids = [prefs[random.choice(users)] for i in range(k)]
lastmatches = None
for t in range(max_iterations):
print 'Iteration %d' % t
bestmatches = [[] for i in range(k)]
# Find which centroid is closest for each row
for j in users:
row = prefs[j]
bestmatch=(0,0)
for i in range(k):
d = simple_pearson(row,centroids[i])
if d < bestmatch[1]: bestmatch = (i,d)
bestmatches[bestmatch[0]].append(j)
# If the results are the same as last time, this is complete
if bestmatches == lastmatches: break
lastmatches=bestmatches
centroids = [defaultdict(float) for i in range(k)]
# Move the centroids to the average of their members
for i in range(k):
len_best = len(bestmatches[i])
if len_best > 0:
items = set.union(*[set(prefs[u].keys()) for u in bestmatches[i]])
for user_id in bestmatches[i]:
row = prefs[user_id]
for m in items:
if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best)
return bestmatches
As far as I can tell, the algorithm is handling the first part (assigning each user to its nearest centroid) fine.
The problem is when doing the next part, taking the average rating for each product in each cluster and using these average ratings as the centroids for the next pass.
Basically, before it's even managed to do the calculations for the first cluster (i=0), the algorithm bombs out with a MemoryError at this line:
if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best)
Originally the division step was in a seperate loop, but fared no better.
This is the exception I get:
malloc: *** mmap(size=16777216) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Any help would be greatly appreciated.
Update: Final algorithms
Thanks to the help recieved here, this is my fixed algorithm. If you spot anything blatantly wrong please add a comment.
First, the simple_pearson implementation
def simple_pearson(v1,v2):
si = [val for val in v1 if val in v2]
n = len(si)
if n==0: return 0.0
sum1 = 0.0
sum2 = 0.0
sum1_sq = 0.0
sum2_sq = 0.0
p_sum = 0.0
for v in si:
sum1+=v1[v]
sum2+=v2[v]
sum1_sq+=pow(v1[v],2)
sum2_sq+=pow(v2[v],2)
p_sum+=(v1[v]*v2[v])
# Calculate Pearson score
num = p_sum-(sum1*sum2/n)
temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n)
if temp < 0.0:
temp = -temp
den = sqrt(temp)
if den==0: return 1.0
r = num/den
return r
A simple method to turn simple_pearson into a distance value:
def distance(v1,v2):
return 1.0-simple_pearson(v1,v2)
And finally, the k-means clustering implementation:
def kcluster(prefs,k=21,max_iterations=50):
from collections import defaultdict
users = prefs.keys()
centroids = [prefs[u] for u in random.sample(users, k)]
lastmatches = None
for t in range(max_iterations):
print 'Iteration %d' % t
bestmatches = [[] for i in range(k)]
# Find which centroid is closest for each row
for j in users:
row = prefs[j]
bestmatch=(0,2.0)
for i in range(k):
d = distance(row,centroids[i])
if d <= bestmatch[1]: bestmatch = (i,d)
bestmatches[bestmatch[0]].append(j)
# If the results are the same as last time, this is complete
if bestmatches == lastmatches: break
lastmatches=bestmatches
centroids = [defaultdict(float) for i in range(k)]
# Move the centroids to the average of their members
for i in range(k):
len_best = len(bestmatches[i])
if len_best > 0:
for user_id in bestmatches[i]:
row = prefs[user_id]
for m in row:
centroids[i][m]+=row[m]
for key in centroids[i].keys():
centroids[i][key]/=len_best
# We may have made the centroids quite dense which significantly
# slows down subsequent iterations, so we delete values below a
# threshold to speed things up
if centroids[i][key] < 0.001:
del centroids[i][key]
return centroids, bestmatches

Not all these observations are directly relevant to your issues as expressed, but..:
a. why are the key in prefs, as shown, longs? unless you have billions of users, simple ints will be fine and save you a little memory.
b. your code:
centroids = [prefs[random.choice(users)] for i in range(k)]
can give you repeats (two identical centroids), which in turn would not make the K-means algorithm happy. Just use the faster and more solid
centroids = [prefs[u] for random.sample(users, k)]
c. in your code as posted you're calling a function simple_pearson which you never define anywhere; I assume you mean to call sim_func, but it's really hard to help on different issues while at the same time having to guess how the code you posted differs from any code that might actually be working
d. one more indication that this posted code may be different from anything that might actually work: you set bestmatch=(0,0) but then test with if d < bestmatch[1]: -- how is the test ever going to succeed? is the distance function returning negative values?
e. the point of a defaultdict is that just accessing row[m] magically adds an item to row at index m (with the value obtained by calling the defaultdict's factory, here 0.0). That item will then take up memory forevermore. You absolutely DON'T need this behavior, and therefore your code:
row = prefs[user_id]
for m in items:
if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best)
is wasting huge amount of memory, making prefs into a dense matrix (mostly full of 0.0 values) from the sparse one it used to be. If you code instead
row = prefs[user_id]
for m in row:
centroids[i][m]+=(row[m]/len_best)
there will be no growth in row and therefore in prefs because you're looping over the keys that row already has.
There may be many other such issues, major like the last one or minor ones -- as an example of the latter,
f. don't divide a bazillion times by len_best: compute its inverse one outside the loop and multiply by that inverse -- also you don't need to do that multiplication inside the loop, you can do it at the end in a separate since it's the same value that's multiplying every item -- this saves no memory but avoids wantonly wasting CPU time;-). OK, these are two minor issues, I guess, not just one;-).
As I mentioned there may be many others, but with the density of issues already shown by these six (or seven), plus the separate suggestion already advanced by S.Lott (which I think would not fix your main out-of-memory problem, since his code still addressing the row defaultdict by too many keys it doesn't contain), I think it wouldn't be very productive to keep looking for even more -- maybe start by fixing these ones and if problems persist post a separate question about those...?

Your centroids does not need to be an actual list.
You never appear to reference anything other than centroids[i][m]. If you only want centroids[i], then perhaps it doesn't need to be a list; a simple dictionary would probably do.
centroids = defaultdict(float)
# Move the centroids to the average of their members
for i in range(k):
len_best = len(bestmatches[i])
if len_best > 0:
items = set.union(*[set(prefs[u].keys()) for u in bestmatches[i]])
for user_id in bestmatches[i]:
row = prefs[user_id]
for m in items:
if row[m] > 0.0: centroids[m]+=(row[m]/len_best)
May work better.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

genetic algorithm, implementation of generation of new population - python

Related

Heuristic to choose five column arrays that maximise the dot product

Creating an efficient local search technique in Python for Latin Squares

Generalized Random Response for local differential privacy implementation

Simple Genetic Algorithm meeting local optimum for "Hello World"

k-means clustering implementation in python, running out of memory

Categories

Resources