is k-means ++ suitable for large data?

is k-means ++ suitable for large data? - python

I used this k-means++ python code for initializing k centers but it is very long for large data, for example 400000 points of 2 dimension:
class KPlusPlus(KMeans):
def _dist_from_centers(self):
cent = self.mu
X = self.X
D2 = np.array([min([np.linalg.norm(x-c)**2 for c in cent]) for x in X])
self.D2 = D2
def _choose_next_center(self):
self.probs = self.D2/self.D2.sum()
self.cumprobs = self.probs.cumsum()
r = random.random()
ind = np.where(self.cumprobs >= r)[0][0]
return(self.X[ind])
def init_centers(self):
self.mu = random.sample(self.X, 1)
while len(self.mu) < self.K:
self._dist_from_centers()
self.mu.append(self._choose_next_center())
def plot_init_centers(self):
X = self.X
fig = plt.figure(figsize=(5,5))
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5)
plt.plot(zip(*self.mu)[0], zip(*self.mu)[1], 'ro')
plt.savefig('kpp_init_N%s_K%s.png' % (str(self.N),str(self.K)), \
bbox_inches='tight', dpi=200)
Is there a way to speed up k-means++?

Initial seeding has a large impact on k-means execution time. In this post you can find some strategies to speed it up.
Perhaps, you could consider to use the Siddhesh Khandelwal's K-means variant, which was publised in Proceedings of European Conference on Information Retrieval (ECIR 2017).
Siddhesh provided the python implementation in GitHub, and it is accompanied by some other previous heuristic algorithms.

K-means++ initialization takes O(n*k) to run. This is reasonably fast for small k and large n, but if you choose k too large, it will take some time. It is about as expensive as one iteration of the (slow) Lloyd variant, so it will usually pay off to use kmeans++.
Your implementation is worse, at least O(n*k²) because it performs unnecessary recomputations. And it probably always chooses the same point as next center.
Note that you also only have the initialization, not the actual kmeans yet.

I haven't run any experiment yet, but Scalable K-Means++ seems rather good for very large data sets (perhaps for those even larger than what you describe).
You can find the paper here and another post explaining it here.
Unfortunately, I haven't seen any code around I'd trust...

Related

Implement method of lines to solve PDE in Python scipy with comparable performance to Matlab's ode15s

I want to use the method of lines to solve the thin-film equation. I have implemented it (with gamma=mu=0) Matlab using ode15s and it seems to work fine:
N = 64;
x = linspace(-1,1,N+1);
x = x(1:end-1);
dx = x(2)-x(1);
T = 1e-2;
h0 = 1+0.1*cos(pi*x);
[t,h] = ode15s(#(t,y) thinFilmEq(t,y,dx), [0,T], h0);
function dhdt = thinFilmEq(t,h,dx)
phi = 0;
hxx = (circshift(h,1) - 2*h + circshift(h,-1))/dx^2;
p = phi - hxx;
px = (circshift(p,-1)-circshift(p,1))/dx;
flux = (h.^3).*px/3;
dhdt = (circshift(flux,-1) - circshift(flux,1))/dx;
end
The film just flattens after some time, and for large time the film should tend to h(t->inf)=1. I haven't done any rigorous check and convergence analysis, but at least the result looks promising after only spending less than 5 mins to code it.
I want to do the same thing in Python, and I tried the following:
import numpy as np
import scipy.integrate as spi
def thin_film_eq(t,h,dx):
print(t) # to check the current evaluation time for debugging
phi = 0
hxx = (np.roll(h,1) - 2*h + np.roll(h,-1))/dx**2
p = phi - hxx
px = (np.roll(p,-1) - np.roll(p,1))/dx
flux = h**3*px/3
dhdt = (np.roll(flux,-1) - np.roll(flux,1))/dx
return dhdt
N = 64
x = np.linspace(-1,1,N+1)[:-1]
dx = x[1]-x[0]
T = 1e-2
h0 = 1 + 0.1*np.cos(np.pi*x)
sol = spi.solve_ivp(lambda t,h: thin_film_eq(t,h,dx), (0,T), h0, method='BDF', vectorized=True)
I add a print statement inside the function so I can check the current progress of the program. For some reasons, it is taking very tiny time step and after waiting for a few minutes it is still stuck at t=3.465e-5, with dt smaller than 1e-10. (haven't finished yet by the time I finished typing this question, and it probably won't within any reasonable time). For the Matlab program, it is done within a second with only 14 time steps taken (I only specify the time span, and it outputs 14 time steps with everything else kept at default). I want to ask the following:
Have I done anything wrong which dramatically slows down the computation time for my Python code? What settings should I choose for the solve_ivp function call? One thing I'm not sure is if I do the vectorization properly. Also did I write the function in the correct way? I know this is a stiff ODE, but the ultra-small time step taken by
Is the difference really just down to the difference in the ode solver? scipy.integrate.solve_ivp(f, method='BDF') is the recommended substitute of ode15s according to the official numpy website. But for this particular example the performance difference is one second vs takes ages to solve. The difference is a lot bigger than I thought.
Are there other alternative methods I can try in Python for solving similar PDEs? (something along the line of finite difference/method of lines) I mean utilizing existing libraries, preferably those in scipy.

Bayesian fit of cosine wave taking longer than expected

In a recent homework, I was asked to perform a Bayesian fit over a set of data a and b using a Metropolis algorithm. The relationship between a and b is given:
e(t) = e_0*cos(w*t)
w = 2 * pi
The Metropolis algorithm is (it works fine with other fit):
def metropolis(logP, args, v0, Nsteps, stepSize):
vCur = v0
logPcur = logP(vCur, *args)
v = []
Nattempts = 0
for i in range(Nsteps):
while(True):
#Propose step:
vNext = vCur + stepSize*np.random.randn(*vCur.shape)
logPnext = logP(vNext, *args)
Nattempts += 1
#Accept/reject step:
Pratio = (1. if logPnext>logPcur else np.exp(logPnext-logPcur))
if np.random.rand() < Pratio:
vCur = vNext
logPcur = logPnext
v.append(vCur)
break
acceptRatio = Nsteps*(1./Nattempts)
return np.array(v), acceptRatio
I have tried to Bayesian fit the cosine wave and used the Metropolis algorithm above:
e_0 = -0.00155
def strain_t(e_0,t):
return e_0*np.cos(2*np.pi*t)
data = pd.read_csv('stressStrain.csv')
t = np.array(data['t'])
e = strain_t(e_0,t)
def logfitstrain_t(params,t,e):
e_0 = params[0]
sigmaR = params[1]
strainModel = strain_t(e_0,t)
return np.sum(-0.5*((e-strainModel)/sigmaR)**2 - np.log(sigmaR))
params0 = np.array([-0.00155,np.std(t)])
params, accRatio = metropolis(logfitstrain_t, (t,e), params0, 1000, 0.042)
print('Acceptance ratio:', accRatio)
e0 = np.mean(params[0])
print('e0=',e0)
e_t = e0*np.cos(2*np.pi*t)
sns.jointplot(t, e_t, kind='hex',color='purple')
The data in .csv looks like
There isn't any error message showing after I hit run, but it takes forever for python to give me an output. What did I do wrong here?

Why it might "take forever"
Your algorithm is designed to run until it accepts a given number of proposals (1000 in the example). Thus, if it's running for a long time, you're likely rejecting a bunch of proposals. This can happen when the step size is too large, leading new proposals to end up in distant, low probability regions of the likelihood space. Try reducing your step size. This may require you to also increase the number of samples to ensure the posterior space becomes adequately explored.
A more serious concern
Because you only append accepted proposals to the chain v, you haven't actually implemented the Metropolis algorithm, and instead obtain a biased set of samples that will tend to overrepresent less likely regions of the posterior space. A true Metropolis implementation re-appends the previous proposal whenever the new proposal is rejected. You can still enforce a minimum number of accepted proposals, but you really must append something every time.

Numerical Stability of Forward Substitution in Python

I am implementing some basic linear equation solvers in Python.
I have currently implemented forward and backward substitution for triangular systems of equations (so very straightforward to solve!), but the precision of the solutions becomes very poor even with systems of about 50 equations (50x50 coefficient matrix).
The following code performs the forward/backward substitution:
FORWARD_SUBSTITUTION = 1
BACKWARD_SUBSTITUTION = 2
def solve_triang_subst(A: np.ndarray, b: np.ndarray,
substitution=FORWARD_SUBSTITUTION) -> np.ndarray:
"""Solves a triangular system via
forward or backward substitution.
A must be triangular. FORWARD_SUBSTITUTION means A should be
lower-triangular, BACKWARD_SUBSTITUTION means A should be upper-triangular.
"""
rows = len(A)
x = np.zeros(rows, dtype=A.dtype)
row_sequence = reversed(range(rows)) if substitution == BACKWARD_SUBSTITUTION else range(rows)
for row in row_sequence:
delta = b[row] - np.dot(A[row], x)
cur_x = delta / A[row][row]
x[row] = cur_x
return x
I am using numpy and 64-bit floats.
Simple Testing Tool
I have set up a simple test suite which generates coefficient matrices and x vectors, computes the b, and then uses forward or backward substitution to recover the x, comparing it to the its known value for validity.
The following code performs these checks:
import numpy as np
import scipy.linalg as sp_la
RANDOM_SEED = 1984
np.random.seed(RANDOM_SEED)
def check(sol: np.ndarray, x_gt: np.ndarray, description: str) -> None:
if not np.allclose(sol, x_gt, rtol=0.1):
print("Found inaccurate solution:")
print(sol)
print("Ground truth (not achieved...):")
print(x_gt)
raise ValueError("{} did not work!".format(description))
def fuzz_test_solving():
N_ITERATIONS = 100
refine_result = True
for mode in [FORWARD_SUBSTITUTION, BACKWARD_SUBSTITUTION]:
print("Starting mode {}".format(mode))
for iteration in range(N_ITERATIONS):
N = np.random.randint(3, 50)
A = np.random.uniform(0.0, 1.0, [N, N]).astype(np.float64)
if mode == BACKWARD_SUBSTITUTION:
A = np.triu(A)
elif mode == FORWARD_SUBSTITUTION:
A = np.tril(A)
else:
raise ValueError()
x_gt = np.random.uniform(0.0, 1.0, N).astype(np.float64)
b = np.dot(A, x_gt)
x_est = solve_triang_subst(A, b, substitution=mode,
refine_result=refine_result)
# TODO report error and count, don't throw!
# Keep track of error norm!!
check(x_est, x_gt,
"Mode {} custom triang iteration {}".format(mode, iteration))
if __name__ == '__main__':
fuzz_test_solving()
Note that the maximum size of a test matrix is 49x49. Even in this case, the system cannot always compute decent solutions, and fails by more than a margin of 0.1. Here's an example of such a failure (this is doing backward substitution, so the biggest error is in the 0th coefficient; all the test data are sampled uniformly from [0, 1[):
Solution found with Mode 2 custom triang iteration 24:
[ 0.27876067 0.55200497 0.49499509 0.3259397 0.62420183 0.47041149
0.63557676 0.41155446 0.47191956 0.74385864 0.03002819 0.4700286
0.37989592 0.56527691 0.15072607 0.05659282 0.52587574 0.82252197
0.65662833 0.50250729 0.74139748 0.10852731 0.27864265 0.42981232
0.16327331 0.74097937 0.24411709 0.96934199 0.890266 0.9183985
0.14842446 0.51806495 0.36966843 0.18227989 0.85399593 0.89615663
0.39819336 0.90445931 0.21430972 0.61212349 0.85205597 0.66758689
0.1793689 0.38067267 0.39104614 0.6765885 0.4118123 ]
Ground truth (not achieved...)
[ 0.20881608 0.71009766 0.44735271 0.31169033 0.63982328 0.49075813
0.59669585 0.43844108 0.47764942 0.72222069 0.03497499 0.4707452
0.37679884 0.56439738 0.15120397 0.05635977 0.52616387 0.82230625
0.65670245 0.50251426 0.74139956 0.10845974 0.27864289 0.42981226
0.1632732 0.74097939 0.24411707 0.96934199 0.89026601 0.91839849
0.14842446 0.51806495 0.36966843 0.18227989 0.85399593 0.89615663
0.39819336 0.90445931 0.21430972 0.61212349 0.85205597 0.66758689
0.1793689 0.38067267 0.39104614 0.6765885 0.4118123 ]
I have also implemented the iterative refinement method described in Section 2.5 of [0], and while it did help a little, the results are still poor for larger matrices.
MATLAB Sanity Check
I also did this experiment in MATLAB, and even there, once there are more than 100 equations, the estimation error shoots up exponentially.
Here is the MATLAB code I used for this experiment:
err_norms = [];
range = 1:3:120;
for size=range
A = rand(size, size);
A = tril(A);
x_gt = rand(size, 1);
b = A * x_gt;
x_sol = A\b;
err_norms = [err_norms, norm(x_gt - x_sol)];
end
plot(range, err_norms);
set(gca, 'YScale', 'log')
And here is the resulting plot:
Main Question
My question is: Is this normal behavior, seeing as there is essentially no structure in the problem, given that I randomly generate the A matrix and x?
What about solving linear systems of 100s of equations for various practical applications? Are these limitations simply an accepted fact, and e.g., optimization algorithms are just naturally robust to these issues? Or am I missing some important facets of this problem?
[0]: Press, William H. Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press, 2007.

There are no limitations. This is a very fruitful exercise that we all came to realize; writing linear solvers are not that easy and that's why almost always LAPACK or its cousins in other languages are used with full confidence.
You are hit by almost singular matrices and because you are using matlab's backslash you don't see that matlab is switching to least squares solutions behind the scenes when near singularity is hit. If you just change A\b to linsolve(A,b) hence you restrict the solver to solve square systems you'll probably see lots of warnings on your console.
I didn't test it because I don't have a license anymore but if I write blindly this should show you the condition numbers of the matrices at each step.
err_norms = [];
range = 1:3:120;
for i=1:40
size = range(i);
A = rand(size, size);
A = tril(A);
x_gt = rand(size, 1);
b = A * x_gt;
x_sol = linsolve(A,b);
err_norms = [err_norms, norm(x_gt - x_sol)];
zzz(i) = rcond(A);
end
semilogy(range, err_norms);
figure,semilogy(range,zzz);
Note that because you are picking up numbers from a uniform distribution it becomes more and more likely to hit ill-conditioned matrices (wrt to inversion) as the rows have more probability to have rank deficiency. That's why the error becomes bigger and bigger. Sprinkle some identity matrix times a scalar and all errors should come back to eps*n levels.
But best, leave this to expert algorithms which have been tested through decades. It is really not that trivial to write any of these. You can read the Fortran codes, for example, dtrsm solves the triangular system.
On the Python side, you can use scipy.linalg.solve_triangular which uses ?trtrs routines from LAPACK.

SVM with python and CPLEX, load the quadratic part of the objective function

''In general, it would get better performance creating batches of linear constraints rather than creating them one at a time. I just wondering if it states even with a huge problem.'' - The wise programmer.
To be clear, I have a (35k x 40) dataset, and I want to do SVM on it. I need to produce the Gramm matrix of this dataset, it is fine, but to pass the coefficient to CPLEX is a mess, it takes hours, here my code:
nn = 35000
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
xg, yg = np.meshgrid(range(nn), range(nn))
indici = np.dstack([yg,xg])
quadraric_part = []
for ii in xrange(nn):
for indd in indici[ii][ii:]:
quadraric_part.append([indd[0],indd[1],temp[indd[0],indd[1]]])
The 'quadratic_part' is a list of the form [i,j,c_ij] where c_ij is the coefficient stored in temp. It will be passed to the function 'objective.set_quadratic_coefficients()' of the CPLEX Python API.
There is a wiser way to do that?
P.S. I have maybe a Memory problem, so It wold be better, instead store the whole list 'quadratic_part', call several times the function 'objective.set_quadratic_coefficients()'.... you know what I mean?!

Under the hood, objective.set_quadratic makes use of the CPXXcopyquad function in the C Callable Library. Whereas, objective.set_quadratic_coefficients uses CPXXcopyqpsep.
Here is an example (bear in mind that I am not a numpy expert; it's quite possible there's a better way to do that part):
import numpy as np
import cplex
nn = 5 # a small example size here
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
# create symetric matrix
tempu = np.triu(temp) # upper triangle
iu1 = np.triu_indices(nn, 1)
tempu.T[iu1] = tempu[iu1] # copy upper into lower
ind = np.array([[x for x in range(nn)] for x in range(nn)])
qmat = []
for i in range(nn):
qmat.append([np.arange(nn), tempu[i]])
c = cplex.Cplex()
c.variables.add(lb=[0]*nn)
c.objective.set_quadratic(qmat)
c.write("test2.lp")
Your Q matrix is completely dense so depending on the amount of memory you have, this technique may not scale. When it's possible, though, you should get better performance initializing your Q matrix with objective.set_quadratic. Perhaps you'll need to use some hybrid technique where you use both set_quadratic and set_quadratic_coefficients.

Python - multiprocessing for matplotlib griddata

Following my former question [1], I would like to apply multiprocessing to matplotlib's griddata function. Is it possible to split the griddata into, say 4 parts, one for each of my 4 cores? I need this to improve performance.
For example, try the code below, experimenting with different values for size:
import numpy as np
import matplotlib.mlab as mlab
import time
size = 500
Y = np.arange(size)
X = np.arange(size)
x, y = np.meshgrid(X, Y)
u = x * np.sin(5) + y * np.cos(5)
v = x * np.cos(5) + y * np.sin(5)
test = x + y
tic = time.clock()
test_d = mlab.griddata(
x.flatten(), y.flatten(), test.flatten(), x+u, y+v, interp='linear')
toc = time.clock()
print 'Time=', toc-tic

I ran the example code below in Python 3.4.2, with numpy version 1.9.1 and matplotlib version 1.4.2, on a Macbook Pro with 4 physical CPUs (i.e., as opposed to "virtual" CPUs, which the Mac hardware architecture also makes available for some use cases):
import numpy as np
import matplotlib.mlab as mlab
import time
import multiprocessing
# This value should be set much larger than nprocs, defined later below
size = 500
Y = np.arange(size)
X = np.arange(size)
x, y = np.meshgrid(X, Y)
u = x * np.sin(5) + y * np.cos(5)
v = x * np.cos(5) + y * np.sin(5)
test = x + y
tic = time.clock()
test_d = mlab.griddata(
x.flatten(), y.flatten(), test.flatten(), x+u, y+v, interp='linear')
toc = time.clock()
print('Single Processor Time={0}'.format(toc-tic))
# Put interpolation points into a single array so that we can slice it easily
xi = x + u
yi = y + v
# My example test machine has 4 physical CPUs
nprocs = 4
jump = int(size/nprocs)
# Enclose the griddata function in a wrapper which will communicate its
# output result back to the calling process via a Queue
def wrapper(x, y, z, xi, yi, q):
test_w = mlab.griddata(x, y, z, xi, yi, interp='linear')
q.put(test_w)
# Measure the elapsed time for multiprocessing separately
ticm = time.clock()
queue, process = [], []
for n in range(nprocs):
queue.append(multiprocessing.Queue())
# Handle the possibility that size is not evenly divisible by nprocs
if n == (nprocs-1):
finalidx = size
else:
finalidx = (n + 1) * jump
# Define the arguments, dividing the interpolation variables into
# nprocs roughly evenly sized slices
argtuple = (x.flatten(), y.flatten(), test.flatten(),
xi[:,(n*jump):finalidx], yi[:,(n*jump):finalidx], queue[-1])
# Create the processes, and launch them
process.append(multiprocessing.Process(target=wrapper, args=argtuple))
process[-1].start()
# Initialize an array to hold the return value, and make sure that it is
# null-valued but of the appropriate size
test_m = np.asarray([[] for s in range(size)])
# Read the individual results back from the queues and concatenate them
# into the return array
for q, p in zip(queue, process):
test_m = np.concatenate((test_m, q.get()), axis=1)
p.join()
tocm = time.clock()
print('Multiprocessing Time={0}'.format(tocm-ticm))
# Check that the result of both methods is actually the same; should raise
# an AssertionError exception if assertion is not True
assert np.all(test_d == test_m)
and I got the following result:
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/matplotlib/tri/triangulation.py:110: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.self._neighbors)
Single Processor Time=8.495998
Multiprocessing Time=2.249938
I'm not really sure what is causing the "future warning" from triangulation.py (evidently my version of matplotlib did not like something about the input values that were originally provided for the question), but regardless, the multiprocessing does appear to achieve the desired speedup of 8.50/2.25 = 3.8, (edit: see comments) which is roughly in the neighborhood of about 4X that we would expect for a machine with 4 CPUs. And the assertion statement at the end also executes successfully, proving that the two methods get the same answer, so in spite of the slightly weird warning message, I believe that the code above is a valid solution.
EDIT: A commenter has pointed out that both my solution, as well as the code snippet posted by the original author, are likely using the wrong method, time.clock(), for measuring execution time; he suggests using time.time() instead. I think I'm also coming around to his point of view. (Digging into the Python documentation a bit further, I'm still not convinced that even this solution is 100% correct, as newer versions of Python appear to have deprecated time.clock() in favor of time.perf_counter() and time.process_time(). But regardless, I do agree that whether or not time.time() is absolutely the most correct way of taking this measurement, it's still probably more correct than what I had been using before, time.clock().)
Assuming the commenter's point is correct, then it means the approximately 4X speedup that I thought I had measured is in fact wrong.
However, that does not mean that the underlying code itself wasn't correctly parallelized; rather, it just means that parallelization didn't actually help in this case; splitting up the data and running on multiple processors didn't improve anything. Why would this be? Other users have pointed out that, at least in numpy/scipy, some functions run on multiple cores, and some do not, and it can be a seriously challenging research project for an end-user to try to figure out which ones are which.
Based on the results of this experiment, if my solution correctly achieves parallelization within Python, but no further speedup is observed, then I would suggest the simplest likely explanation is that matplotlib is probably also parallelizing some of its functions "under the hood", so to speak, in compiled C++ libraries, just like numpy/scipy already do. Assuming that's the case, then the correct answer to this question would be that nothing further can be done: further parallelizing in Python will do no good if the underlying C++ libraries are already silently running on multiple cores to begin with.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

is k-means ++ suitable for large data? - python

I haven't run any experiment yet, but Scalable K-Means++ seems rather good for very large data sets (perhaps for those even larger than what you describe). You can find the paper here and another post explaining it here. Unfortunately, I haven't seen any code around I'd trust...

Related

Implement method of lines to solve PDE in Python scipy with comparable performance to Matlab's ode15s

Bayesian fit of cosine wave taking longer than expected

Numerical Stability of Forward Substitution in Python

SVM with python and CPLEX, load the quadratic part of the objective function

Python - multiprocessing for matplotlib griddata

Categories

Resources