Scale (apply function?) sparse matrix logarithmically - python

I am using scikit-learn preprocessing scaling for sparse matrices.
My goal is to "scale" each feature-column by taking the logarithm-base the column maximum value. My wording may be inexact. I try to explain.
Say feature-column has values: 0, 8, 2:
Max value = 8
Log-8 of feature value 0 should be 0.0 = math.log(0+1, 8+1) (the +1 is to cope with zeros; so yes, we are actually taking log-base 9)
Log-8 of feature value 8 should be 1.0 = math.log(8+1, 8+1)
Log-8 of feature value 2 should be 0.5 = math.log(2+1, 8+1)
Yes, I can easily apply any arbitrary function-based transformer with FunctionTransformer, but I want the base of the log change (based on) each column (in particular, the maximum value). That is, I want to do something like the MaxAbsScaler, only taking logarithms.
I see that MaxAbsScaler gets first a vector (scale) of the maximum values of each column (code) and then multiples the original matrix times 1 / scale in code.
However, I don't know what to do if I want to take the logarithms-based on the scale vector. Is it even possible to transform the logarithm operation to a multiplication (?) or do I have other possibilities of scipy sparse operations that are efficient?
I hope my intent is clear (and possible).

Logarithm of x in base b is the same as log(x)/log(b), where logs are natural. So, the process you describe amounts to first applying log(x+1) transformation to everything, and then scaling by max absolute value. Conveniently, log(x+1) is a built-in function, log1p. Example:
from sklearn.preprocessing import FunctionTransformer, maxabs_scale
from scipy.sparse import csc_matrix
import numpy as np
logtran = FunctionTransformer(np.log1p, accept_sparse=True)
X = csc_matrix([[ 1., 0, 8], [ 2., 0, 0], [ 0, 1., 2]])
Y = maxabs_scale(logtran.transform(X))
Output (sparse matrix Y):
(0, 0) 0.630929753571
(1, 0) 1.0
(2, 1) 1.0
(0, 2) 1.0
(2, 2) 0.5

Related

Scipy CSR matrix: subtract only from non-zero values

I have a large matrix that I want to perform calculations on.
To make things easier to understand, here are examples of what I want to do using smaller data:
I use a sparse CSR matrix like this (shape of the actual matrix is (9000, 900)):
x = sp.csr_matrix(np.array([[1,0,2],[1,1,0]]))
# > (0, 0) 1
# (0, 2) 2
# (1, 0) 1
# (1, 1) 1
I then have a vector of appropriate shape that I want to subtract from the matrix (shape of the actual vector is (,9000) ):
y = np.array([1, 1])
res = x - np.array([y]).T
# > matrix([[ 0, -1, 1],
# [ 0, 0, -1]])
This also subtracts from values that are zero in the sparse matrix, but I want to only subtract form non-zero values. I tried using scipys .nonzero(), like this:
x[x.nonzero()] - np.array([y]).T
which works on this small example, but when I try it on my actual data more than 32 GB of RAM are being used. Performing the calculation without .nonzero() works perfectly fine and barely takes a second.
What is an efficient way of performing the subtraction only on non-zero values?
EDIT:
I realized that my question is a bit unclear, and I have also found the solution, so here is a clarification of the question and then an answer:
I have a matrix with 9000 rows and a column-vector with 9000 rows. I wanted to subtract the value in a row of the column vector from all non-zero values of the corresponding matrix row. So for my example matrix sp.csr_matrix(np.array([[1,0,2],[1,1,0]])) and vector np.array([1, 1]), the result should be
[[0 0 1]
[0 0 0]]
I thought that my attempt at using .nonzero() calcualted just that, but I was wrong. However, I found the correct way of doing it here. So this is the working code, which also does not cause any RAM issues :
x = sp.csr_matrix(np.array([[1,0,2],[1,1,0]]))
y = np.array([1,1])
nz = x.nonzero()
x[nz] -= y[nz[0]]

How to cluster very big sparse data set using low memory in Python?

I have data which forms a sparse matrix in shape of 1000 x 1e9. I want to cluster the 1000 examples into 10 clusters using K-means.
The matrix is very sparse, less than 1/1e6 values.
My laptop got 16 RAM. I tried sparse matrix in scipy. Unfortunately, the matrix makes the clustering process need much more memory than I have. Could anyone suggest a way to do this?
My system crashed when running the following test snippet
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans
row = np.array([0, 0, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 6, 6, 7, 8, 8, 8])
col = np.array([0, 2, 2, 0, 1, 2] * 3)
data = np.array([1, 2, 3, 4, 5, 6] * 3)
X = csr_matrix((data, (row, col)), shape=(9, 1e9))
resC = KMeans(n_clusters=3).fit(X)
resC.labels_
Any helpful suggestion is appreciated.
KMeans centers will not be sparse anymore, so this would need careful optimization for the sparse case (that may be costly for the usual case, so it probably isn't optimized this way).
You can try ELKI (not python but Java) which often is much faster, and also has sparse data types. You can also try using single-precision float will also help.
But in the end, the results will be questionable: k-means is statistically rooted in least-squares. It assumes your data is coming from k signals plus some Gaussian error. Because your data is sparse, it obviously does not have this kind of Gaussian shape. When the majority of values is 0, it cannot be a Gaussian.
With just 1000 data points, I'd rather use HAC.
Whatever you do (for your data; given your memory-constraints): kmeans is not ready for that!
This includes:
Online KMeans / MiniBatch Kmeans; as proposed in another answer
it only helps to handle many samples (and is hurt by the same effect mentioned later)!
Various KMeans-implementation in different languages (it's an algorithmic problem; not bound by an implementation)
Ignoring potential theoretic reasons (high-dimensionality and non-convex heuristic optimization) i'm just mentioning the practical problem here:
your centroids might become non-sparse! (mentioned in sidenote by SOs clustering-expert; this link also mentions alternatives!)
this means: the sparse data-structures used will get very non-sparse and eventually blow up your memory!
(i changed sklearn's code to observe what the above link already mentioned)
relevant sklearn code: center_shift_total = squared_norm(centers_old - centers)
Even if you remove / turn-off all the memory-heavy components like:
init=some_sparse_ndarray (instead of k-means++)
n_init=1 instead of 10
precompute_distances=False instead of True (unclear if it helps)
n_jobs=1 instead of -1
the above will be your problem to care!
Although KMeans accepts sparse matrices as input, the centroids used within the algorithm have a dense representation, and your feature space is so big that even 10 centroids will not fit into 16GB of RAM.
I have 2 ideas:
Can you fit the clustering into RAM if you discard all empty columns? If you have 1000 samples and only about 1/1e6 values are occupied, then probably less than 1 in 1000 columns will contain any non-zero entries.
Several clustering algorithms in scikit-learn will accept a matrix of distances between samples in stead of the full data e.g. sklearn.cluster.SpectralClustering. You could precompute the pairwise distances in a 1000x1000 matrix and pass that to your clustering algorithm in stead. (I can't make a specific recommendation of a clustering method, or a suitable function to calculate the distances, as it will depend on your application)
Consider using dict, since it will only store the values wich were assigned. I guess a nice way to do this is by creating a SparseMatrix object like this:
class SparseMatrix(dict):
def __init__(self, mapping=[]):
dict.__init__(self, {i:mapping[i] for i in range(len(mapping))})
#overriding this method makes never-accessed indexes return 0.0
def __getitem__(self, i):
try:
return dict.__getitem__(self, i)
except KeyError:
return 0.0
>>> my_matrix = SparseMatrix([1,2,3])
>>> my_matrix[0]
1
>>> my_matrix[5]
0.0
Edit:
For the multi-dimensional case you may need to override the two item-management methods as follows:
def __getitem__(self, ij):
i,j = ij
dict.__setitem__(i*self.n + j)
def __getitem__(self, ij):
try:
i,j = ij
return dict.__getitem__(self, i*self.n + j)
except KeyError:
return 0.0
>>> my_matrix[0,0] = 10
>>> my_matrix[1,2]
0.0
>>> my_matrix[0,0]
10
Also assuming you defined self.n as the length of the matrix rows.

Scipy fitting polynomial model to some data

I do try to find an appropriate function for the permeability of cells under varying conditions. If I assume constant permeability, I can fit it to the experimental data and use Sklearns PolynomialFeatures together with a LinearModel (As explained in this post) in order to determine a correlation between the conditions and the permeability. However, the permeability is not constant and now I try to fit my model with the permeability as a function of the process conditions. The PolynomialFeature module of sklearn is quite nice to use.
Is there an equivalent function within scipy or numpy which allows me to create a polynomial model (including interaction terms e.g. a*x[0]*x[1] etc.) of varying order without writing the whole function by hand ?
The standard polynomial class in numpy seems not to support interaction terms.
I'm not aware of such a function that does exactly what you need, but you can achieve it using a combination of itertools and numpy.
If you have n_features predictor variables, you essentially must generate all vectors of length n_features whose entries are non-negative integers and sum to the specified order. Each new feature column is the component-wise power using these vectors who sum to a given order.
For example, if order = 3 and n_features = 2, one of the new features will be the old features raise to the respective powers, [2,1]. I've written some code below for arbitrary order and number of features. I've modified the generation of vectors who sum to order from this post.
import itertools
import numpy as np
from scipy.special import binom
def polynomial_features_with_cross_terms(X, order):
"""
X: numpy ndarray
Matrix of shape, `(n_samples, n_features)`, to be transformed.
order: integer, default 2
Order of polynomial features to be computed.
returns: T, powers.
`T` is a matrix of shape, `(n_samples, n_poly_features)`.
Note that `n_poly_features` is equal to:
`n_features+order-1` Choose `n_features-1`
See: https://en.wikipedia.org\
/wiki/Stars_and_bars_%28combinatorics%29#Theorem_two
`powers` is a matrix of shape, `(n_features, n_poly_features)`.
Each column specifies the power by row of the respective feature,
in the respective column of `T`.
"""
n_samples, n_features = X.shape
n_poly_features = int(binom(n_features+order-1, n_features-1))
powers = np.zeros((n_features, n_poly_features))
T = np.zeros((n_samples, n_poly_features), dtype=X.dtype)
combos = itertools.combinations(range(n_features+order-1), n_features-1)
for i,c in enumerate(combos):
powers[:,i] = np.array([
b-a-1 for a,b in zip((-1,)+c, c+(n_features+order-1,))
])
T[:,i] = np.prod(np.power(X, powers[:,i]), axis=1)
return T, powers
Here's some example usage:
>>> X = np.arange(-5,5).reshape(5,2)
>>> T,p = polynomial_features_with_cross_terms(X, order=3)
>>> print X
[[-5 -4]
[-3 -2]
[-1 0]
[ 1 2]
[ 3 4]]
>>> print p
[[ 0. 1. 2. 3.]
[ 3. 2. 1. 0.]]
>>> print T
[[ -64 -80 -100 -125]
[ -8 -12 -18 -27]
[ 0 0 0 -1]
[ 8 4 2 1]
[ 64 48 36 27]]
Finally, I should mention that the SVM polynomial kernel achieves exactly this effect without explicitly computing the polynomial map. There are of course pro's and con's to this, but I figured I should mentioned it for you to consider if you have not, yet.

Should I store the values in dictionary or compute on-the-fly?

I have a problem where I have tuples called state and action and I want to compute its "binary features". The function to compute the features of state and action are described below. Mind you this is just a toy code.
I have about 700,000 combination of states and actions. I also need to have the features in numpy array/scipy sparse matrix.
Now, the problem is, I have to compute the features of states and actions million times. I have in mind 2 options.
One option is to compute beforehand using the function below all the 700,000 combinations and store it in a dictionary. The keys are (state, action) and the values are the binary features.
The other option is to call the function below every time I want to find the value of the binary feature of each state and action.
My objective is to get a good performance and also be memory efficient.
from numpy import array
from scipy import sparse
def compute_features(state, action):
# state and action are 3-tuples of integers.
# e.g. (1, 2, 3)
return array(state) - array(action)
def computer_binary_features(state, action, coord_size):
# e.g.
# features = (1, 0, 2)
# sizes = (2, 2, 3)
# Meaning, the size of first coordinate is 2, second is 2 and third is 3.
# It means the first coordinate can only take value integers 0 to 7.
#
# What this function does is turning (1, 0, 2) into binary features.
# For the first coordinate, the value is 1 and the size is 2, so the binary
# features of the first coordinate it (0, 1).
# Second coordinate, the value is 0 and the size is 2. The binary features
# is (1, 0)
# Third coordinate, the value is 2 and the size is 3. The binary features is
# (0, 0, 1).
#
# So the binary features of (1, 0, 2) is: (0, 1, 1, 0, 0, 0, 1)
#
# This function does not do concatenation but rather finding position of ones
# in the binary features of size sum(sizes).
# returns a coo sparse 0-1 valued 1 x n matrix.
features = compute_features(state, action)
coord_size = array(coord_size)
col = []
index = 0
for i in range(len(features)):
index = index + coord_size[i]
col.append(index + features[i] - min_elem[i])
row = [0] * len(col)
data = [1] * len(col)
mtx = sparse.coo_matrix((data, (row, col)), (1, sum(coord_size)),
dtype=np.uint8)
return mtx
If it is highly critical to return the results as fast as possible, than you should consider option one. However, you should keep in mind the memory and setup time overhead, which might be too expensive.
If performance is not an issue at all, you should prefer option two. This will make your code simpler and will not increase memory consumption and setup time unnecessarily.
If performance does play some role, but does not have to be AS GOOD AS POSSIBLE EVERY SINGLE TIME, I suggest to combine the two options.
To enjoy both worlds, you can use Memoization. Basically, it means you calculate results on demand, but just before returning them, you put them in a dictionary, as you suggested. The function will try to look for the result in the dictionary, and calculate the result only if necessary.
Here is a nice tutorial for implementing memoization in python.

Numpy- weight and sum rows of a matrix

Using Python & Numpy, I would like to:
Consider each row of an (n columns x
m rows) matrix as a vector
Weight each row (scalar
multiplication on each component of
the vector)
Add each row to create a final vector
(vector addition).
The weights are given in a regular numpy array, n x 1, so that each vector m in the matrix should be multiplied by weight n.
Here's what I've got (with test data; the actual matrix is huge), which is perhaps very un-Numpy and un-Pythonic. Can anyone do better? Thanks!
import numpy
# test data
mvec1 = numpy.array([1,2,3])
mvec2 = numpy.array([4,5,6])
start_matrix = numpy.matrix([mvec1,mvec2])
weights = numpy.array([0.5,-1])
#computation
wmatrix = [ weights[n]*start_matrix[n] for n in range(len(weights)) ]
vector_answer = [0,0,0]
for x in wmatrix: vector_answer+=x
Even a 'technically' correct answer has been all ready given, I'll give my straightforward answer:
from numpy import array, dot
dot(array([0.5, -1]), array([[1, 2, 3], [4, 5, 6]]))
# array([-3.5 -4. -4.5])
This one is much more on with the spirit of linear algebra (and as well those three dotted requirements on top of the question).
Update:
And this solution is really fast, not marginally, but easily some (10- 15)x faster than all ready proposed one!
It will be more convenient to use a two-dimensional numpy.array than a numpy.matrix in this case.
start_matrix = numpy.array([[1,2,3],[4,5,6]])
weights = numpy.array([0.5,-1])
final_vector = (start_matrix.T * weights).sum(axis=1)
# array([-3.5, -4. , -4.5])
The multiplication operator * does the right thing here due to NumPy's broadcasting rules.

Categories