Modify kmeans alghoritm for 1d array where order matters

Modify kmeans alghoritm for 1d array where order matters - python

I want to find groups in one dimensional array where order/position matters. I tried to use numpys kmeans2 but it works only when i have numbers in increasing order.
I have to maximize average difference between neigbour sub-arrays
For example: if I have array [1,2,2,8,9,0,0,0,1,1,1] and i want to get 4 groups the result should be something like [1,2,2], [8,9], [0,0,0], [1,1,1]
Is there a way to do it in better then O(n^k)
answer: I ended up with modiied dendrogram, where I merge neigbors only.

K-means is about minimizing the least squares. Among it's largest drawbacks (there are many) is that you need to know k. Why do you want to inherit this drawback?
Instead of hacking k-means into not ignoring the order, why don't you instead look at time series segmentation and change detection approaches that are much more appropriate for this problem?
E.g. split your time series if abs(x[i] - x[-1]) > stddev where stddev is the standard deviation of your data set. Or the standard deviation of the last 10 samples (in above series, the standard deviation is about 3, so it would split as [1,2,2], [8,9], [0,0,0,1,1,1] because the change 0 to 1 is not significant.

Related

Is it possible to calculate running statistics having neighbours into account in Numpy?

I have a rank-3 matrix A of shape (Nx, Ny, Nt), and I need calculate statistics in the third (temporal) dimension for all grid points (the first two dimensions correspond to horizontal space). For this, you can just make something like np.percentile(A, 0.5, axis=2). So far so good.
Now, my actual problem is more convoluted. I'd like to have neighbours into account to make some sort of calculation over a "running window" of a given size, or even an arbitrary shape. This sounds like a convolution, but I want to calculate something else than just the average. I need percentiles and things like such.
How can I achieve this result efficiently without explicit looping in the spatial dimensions?
I was thinking that a plausible solution would be to "enlarge" the original matrix by including an additional dimension that copies the data from the neighbours in the additional dimension. Something like B of shape (Nx, Ny, Nt, Nn), where Nn is the number of neighbours (let's assume we take the 8 closest neighbours for the sake of simplicity). Then, I could calculate np.percentile(B.reshape(Nx, Ny, Nt*Nn), 0.5, axis=2). The problem with this approach is two-fold:
I don't know how to build this matrix without explicit looping.
I'm concerned with the memory cost of having a Nn redundant copies of the original array. In the actual application, the matrix will be very large and the number of neighbours will also larger than 8.
Overall, I do not think this procedure is any better than explicitly looping and calculating things on each grid point in each iteration.
I'm pretty sure there must be more convenient and standardised way to have neighbours into account within the Numpy matricial approach, but I could not find the recipe online.

Moments of a Numpy Array

I have a numpy array, say
rho = np.arange(25).reshape((5,5))
where each value represents the mass at that coordinate. I was a fast method to calculate the first 3 moments (so sum rho(x,y) x^n y^m for m+n<=3 though I may need high values. My actual matrix is much larger. I can do it easily with for loops, so compute the mean, then subtract that in the following summations, but there must be an efficient way with numpy operations that I have not figured out. Is there a very easy way to do this?

How important are the rows vs columns in PCA?

So i have a dataset with pictures, where each column consist of a vector that can be reshaped into a 32x32 picture. The specific dimensions of my dataset is the following 1024 x 20000. Meaning 20000 samples of images.
Now when i look at various ways of doing PCA without using the built in functions from something like scikit-learn people tend to take either the mean of the rows and subtract the resulting matrix from the original one to get the covariance matrix. I.e the following
A = (1024x20000) #dimensions of the numpy array
mean_rows = A.mean(0)
new_A = A-mean_rows
Other times people tend to get the mean of the columns and the subtract that from the original matrix.
A = (1024x20000) #dimensions of the numpy array
mean_rows = A.mean(1)
new_A = A-mean_rows
Now my question is, when are you supposed to do what? Say i have a dataset as my example which of the methods would i use?
Looked at a variety of websites such as https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/,
http://sebastianraschka.com/Articles/2014_pca_step_by_step.html

I think you're talking about normalizing the dataset to have zero mean. You should compute the mean across the axis that contains each observation.
In your example, you have 20,000 observations with 1,024 dimensions each and your matrix has laid out each observation as a column so you should compute the mean of the columns.
In code that would be:
A = A - A.mean(axis=0)

How to generate a random covariance matrix in Python?

So I would like to generate a 50 X 50 covariance matrix for a random variable X given the following conditions:
one variance is 10 times larger than the others
the parameters of X are only slightly correlated
Is there a way of doing this in Python/R etc? Or is there a covariance matrix that you can think of that might satisfy these requirements?
Thank you for your help!

OK, you only need one matrix and randomness isn't important. Here's a way to construct a matrix according to your description. Start with an identity matrix 50 by 50. Assign 10 to the first (upper left) element. Assign a small number (I don't know what's appropriate for your problem, maybe 0.1? 0.01? It's up to you) to all the other elements. Now take that matrix and square it (i.e. compute transpose(X) . X where X is your matrix). Presto! You've squared the eigenvalues so now you have a covariance matrix.
If the small element is small enough, X is already positive definite. But squaring guarantees it (assuming there are no zero eigenvalues, which you can verify by computing the determinant -- if the determinant is nonzero then there are no zero eigenvalues).
I assume you can find Python functions for these operations.

How do I compute the variance of a column of a sparse matrix in Scipy?

I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.
scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?

You can calculate the variance yourself using the mean, with the following formula:
E[X^2] - (E[X])^2
E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.

Sicco has the better answer.
However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):
# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
arr[i] = np.var(mat[:, i].toarray())

The most efficient way I know of is to use StandardScalar from scikit:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler(with_mean=False)
scalar.fit(X)
Then the variances are in the attribute var_:
X_var = scalar.var_
The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent. I don't know which is more accurate.

The efficient way is actually to densify the entire matrix, then standardize it in the usual way with
X = X.toarray()
X -= X.mean()
X /= X.std()
As #Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.