Speeding up summing certain columns in a matrix - python

Question in short
Given a large sparse csr_matrix A and a numpy array B, what is the fastest way to construct a numpy matrix C, such that C[i,j] = sum(A[k,j]) for all k where B[k] == i?
Details of question
I found a solution to do this, but I am not really content with how long it takes. I will first explain the problem, then my solution, then show my code, and then show my timings.
Problem
I am working on a clustering algorithm in Python, and I'd like to speed it up. I have a sparse csr_matrix pam, in which I have per person per article how many items they bought of that article. Furthermore, I have a numpy array clustering, in which the cluster that person belongs to is denoted. Example:
pam pam.T clustering
article person
p [[1 0 0 0] a
e [0 2 0 0] r [[1 0 1 0 0 0] [0 0 0 0 1 1]
r [1 1 0 0] t [0 2 1 0 0 0]
s [0 0 1 0] i [0 0 0 1 0 1]
o [0 0 0 1] c [0 0 0 0 1 2]]
n [0 0 1 2]] l
e
What I like to calculate is acm: the amount of items all people in one cluster together bought. This amounts to, for every column i in acm, adding those columns p of pam.T for which clustering[p] == i.
acm
cluster
a
r [[2 0]
t [3 0]
i [1 1]
c [0 3]]
l
e
Solution
First, I create another sparse matrix pcm, in which I indicate per element [i,j] if person i is in cluster j. Result (when cast to dense matrix):
pcm
cluster
p [[False True]
e [False True]
r [ True False]
s [False True]
o [False True]
n [ True False]]
Next, I matrix multiply pam.T with pcm to get the matrix that I want.
Code
I wrote the following program to test the duration of this method in practice.
import numpy as np
from scipy.sparse.csr import csr_matrix
from timeit import timeit
def _clustering2pcm(clustering):
'''
Converts a clustering (np array) into a person-cluster matrix (pcm)
'''
N_persons = clustering.size
m_person = np.arange(N_persons)
clusters = np.unique(clustering)
N_clusters = clusters.size
m_data = [True] * N_persons
pcm = csr_matrix( (m_data, (m_person, clustering)), shape = (N_persons, N_clusters))
return pcm
def pam_clustering2acm():
'''
Convert a person-article matrix and a given clustering into an
article-cluster matrix
'''
global clustering
global pam
pcm = _clustering2pcm(clustering)
acm = csr_matrix.transpose(pam).dot(pcm).todense()
return acm
if __name__ == '__main__':
global clustering
global pam
N_persons = 200000
N_articles = 400
N_shoppings = 400000
N_clusters = 20
m_person = np.random.choice(np.arange(N_persons), size = N_shoppings, replace = True)
m_article = np.random.choice(np.arange(N_articles), size = N_shoppings, replace = True)
m_data = np.random.choice([1, 2], p = [0.99, 0.01], size = N_shoppings, replace = True)
pam = csr_matrix( (m_data, (m_person, m_article)), shape = (N_persons, N_articles))
clustering = np.random.choice(np.arange(N_clusters), size = N_persons, replace = True)
print timeit(pam_clustering2acm, number = 100)
Timing
It turns out that for these 100 runs, I need 5.1 seconds. 3.6 seconds of these are spent on creating pcm. I have the feeling there could be a faster way to calculate this matrix without creating a temporary sparse matrix, but I don't see one without looping. Is there a faster way of construction?
EDIT
After Martino's answer, I have tried to implement the loop over clusters and slicing algorithm, but that is even slower. It takes now 12.5 seconds to calculate acm 100 times, of which 4.1 seconds remain if I remove the line acm[:,i] = pam[p,:].sum(axis = 0).
def pam_clustering2acm_loopoverclusters():
global clustering
global pam
N_articles = pam.shape[1]
clusters = np.unique(clustering)
N_clusters = clusters.size
acm = np.zeros([N_articles, N_clusters])
for i in clusters:
p = np.where(clustering == i)[0]
acm[:,i] = pam[p,:].sum(axis = 0)
return acm

This is about 50x faster than your _clustering2pcm function:
def pcm(clustering):
n = clustering.size
data = np.ones((n,), dtype=bool)
indptr = np.arange(n+1)
return csr_matrix((data, clustering, indptr))
I haven't looked at the source code, but when you pass the CSR constructor the (data, (rows, cols)) structure, it is almost certainly using that to create a COO matrix, then converting it to CSR. Because your matrix is so simple, it is very easy to put the actual CSR matrix description arrays together as above, and skip all of that.
This almost cuts your execution time down by three:
In [38]: %timeit pam_clustering2acm()
10 loops, best of 3: 36.9 ms per loop
In [40]: %timeit pam.T.dot(pcm(clustering)).A
100 loops, best of 3: 12.8 ms per loop
In [42]: np.all(pam.T.dot(pcm(clustering)).A == pam_clustering2acm())
Out[42]: True

I refer you to the scipy.sparse docs (http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix). Where they say the row slicing is efficient (as opposed to column splicing), so it is probably better to to stick to the non-transposed matrix. Then if you browse down there is a sum function where the axis can be specified. It is probably better to use the methods that come with your object as they are likely to use compiled code. This is at the cost of looping through clusters (of which I am assuming there are not too many).

Related

Pandas drop rows lower then others in all colums

I have a dataframe with a lot of rows with numerical columns, such as:
A
B
C
D
12
7
1
0
7
1
2
0
1
1
1
1
2
2
0
0
I need to reduce the size of the dataframe by removing those rows that has another row with all values bigger.
In the previous example i need to remove the last row because the first row has all values bigger (in case of dubplicate rows i need to keep one of them).
And return This:
A
B
C
D
12
7
1
0
7
1
2
0
1
1
1
1
My faster solution are the folowing:
def complete_reduction(df, columns):
def _single_reduction(row):
df["check"] = True
for col in columns:
df["check"] = df["check"] & (df[col] >= row[col])
drop_index.append(df["check"].sum() == 1)
df = df.drop_duplicates(subset=columns)
drop_index = []
df.apply(lambda x: _single_reduction(x), axis=1)
df = df[numpy.array(drop_index).astype(bool)]
return df
Any better ideas?
Update:
A new solution has been found here
https://stackoverflow.com/a/68528943/11327160
but i hope for somethings faster.
An more memory-efficient and faster solution than the one proposed so far is to use Numba. There is no need to create huge temporary array with Numba. Moreover, it is easy to write a parallel implementation that makes use of all CPU cores. Here is the implementation:
import numba as nb
#nb.njit
def is_dominated(arr, k):
n, m = arr.shape
for i in range(n):
if i != k:
dominated = True
for j in range(m):
if arr[i, j] < arr[k, j]:
dominated = False
if dominated:
return True
return False
# Precompile the function to native code for the most common types
#nb.njit(['(i4[:,::1],)', '(i8[:,::1],)'], parallel=True, cache=True)
def dominated_rows(arr):
n, m = arr.shape
toRemove = np.empty(n, dtype=np.bool_)
for i in nb.prange(n):
toRemove[i] = is_dominated(arr, i)
return toRemove
# Special case
df2 = df.drop_duplicates()
# Main computation
result = df2[~dominated_rows(np.ascontiguousarray(df.values))]
Benchmark
The input test is two random dataframes of shape 20000x5 and 5000x100 containing small integers (ie. [0;100[). Tests have been done on a (6-core) i5-9600KF processor with 16 GiB of RAM on Windows. The version of #BingWang is the updated one of the 2022-05-24. Here are performance results of the proposed approaches so far:
Dataframe with shape 5000x100
- Initial code: 114_340 ms
- BENY: 2_716 ms (consume few GiB of RAM)
- Bing Wang: 2_619 ms
- Numba: 303 ms <----
Dataframe with shape 20000x5
- Initial code: (too long)
- BENY: 8.775 ms (consume few GiB of RAM)
- Bing Wang: 578 ms
- Numba: 21 ms <----
This solution is respectively about 9 to 28 times faster than the fastest one (of #BingWang). It also has the benefit of consuming far less memory. Indeed, the #BENY implementation consume few GiB of RAM while this one (and the one of #BingWang) only consumes no more than few MiB for this used-case. The speed gain over the #BingWang implementation is due to the early stop, parallelism and the native execution.
One can see that this Numba implementation and the one of #BingWang are quite efficient when the number of column is small. This makes sense for the #BingWang since the complexity should be O(N(logN)^(d-2)) where d is the number of columns. As for Numba, it is significantly faster because most rows are dominated on the second random dataset causing the early stop to be very effective in practice. I think the #BingWang algorithm might be faster when most rows are not dominated. However, this case should be very uncommon on dataframes with few columns and a lot of rows (at least, clearly on uniformly random ones).
We can do numpy board cast
s = df.values
out = df[np.sum(np.all(s>=s[:,None],-1),1)==1]
Out[44]:
A B C D
0 12 7 1 0
1 7 1 2 0
2 1 1 1 1
Here is a try based on Kung et al 1975
http://www.eecs.harvard.edu/~htk/publication/1975-jacm-kung-luccio-preparata.pdf
Brutal force solution is from https://stackoverflow.com/a/68528943/11327160
I didn't robustly test it, but using these parameters it looks to be the same answer
There is no guarantee it is correct, or I am even following the paper. Please test thoroughly. In addition, there is very likely to be a commercial solution to calculate it.
D=5 #dimension, or number of columns
N=2000 #number of data rows
M=1000 #upper bound for random integers
Changing to D=20 and N=20000 you can see Kung75 completes in <1 minute but Brutal Force will use more than 10x the time.
Even at Dimension=1000,Rows=20000,value range 0~999, it can still complete slightly over 1 minute
This can be revised similar to merge sort (compute small chunks by brutal force, then merge up with Filter), which is easier to switch to parallel computing.
Another way of speeding up is to turn off array boundary check after you are comfortable with the code. This is due to heavy array indexing here. I would recommend C# if you want to try this path.
import pandas as pd
import numpy as np
import datetime
#generate fake data
D=1000 #dimension, or number of columns
N=20000 #number of data rows
M=1000 #upper bound for random integers
np.random.seed(12345) #set seed so this is reproducible
data=np.random.randint(0,M,(N,D))
for i in range(0,12):
print(i,data[i])
#Compare w and v starting dimention d
def Compare(w,v,d=0):
cmp=3 #0x11, low bit is GE, high bit is LE, together means EQ
while d<D:
if w[d]>v[d]:
cmp&=1
elif w[d]<v[d]:
cmp&=2
if cmp>0:
d+=1
else:
break
return cmp # 0=uncomparable, 1=GT, 2=LT, 3=EQ
#unit test:
#print(Compare(data[0],data[1]))
#print(Compare(data[0],data[1],4))
#print(Compare(data[1],data[11]))
#print(Compare(data[11],data[1]))
#print(Compare(data[1],data[1]))
def AuxSort(d,ndxArray): #stable sort desc by dimention d
return [x[1] for x in sorted([(-data[n][d],n) for n in ndxArray])]
#unit test
#print(AuxSort(data,0,[0,4,3]))
#print(AuxSort(data,2,[0,1,2]))
#cumulatively find the pareto front. Time O(N^2), space O(N)
def N2BrutalForce(data,ndxArray=None,d=0):
if len(data)==0:
return []
if not ndxArray: #by default check the entire data
ndxArray=list(range(len(data)))
#up to this point ndxArray is not empty
result={ndxArray[0]:data[ndxArray[0]]}
for i in range(1,len(ndxArray)):
dominated=[]
j=ndxArray[i]
for k,v in result.items():
c=Compare(data[j],v,d)
if c>1:
break
elif c==1:
dominated.append(k)
else:
for o in dominated:
del result[o]
result[j]=data[j]
return [r for r in result]
def resultPrinter(res, ShowCountOnly=False):
if not ShowCountOnly:
for r in sorted(res):
print(r,data[r])
print(len(res),'results found',datetime.datetime.today())
#unit rest
#resultPrinter(N2BrutalForce(data),True)
#resultPrinter(N2BrutalForce(data,list(range(15))))
def FindT(R1,R2,S1,S2,d):
S1R1=set(Filter(data,d,R1,S1))
T1=[s for s in S1 if s in S1R1]
S2R1=Filter(data,d+1,R1,S2)
S2R2=set(Filter(data,d,R2,S2))
T2=[s for s in S2R1 if s in S2R2]
return T1+T2
def BreakAtPseudoMedian(sArray,d):
sArray=AuxSort(d,sArray) #this could speed up by moving the sort to caller and avoid redo sorting
if data[sArray[0]][d]==data[sArray[-1]][d]:
return [],sArray
L=len(sArray)
mHigh=mLow=L//2
while mLow>0 and data[sArray[mLow]][d]==data[sArray[mLow-1]][d]:
mLow-=1
if mLow>0:
return sArray[:mLow],sArray[mLow:]
while mHigh<L-1 and data[sArray[mHigh]][d]==data[sArray[mHigh+1]][d]:
mHigh+=1
return sArray[:mHigh],sArray[mHigh:]
def Filter(data,d,rArray,sArray):
L=len(rArray)+len(sArray)
if d==D-1 and rArray:
R=max(data[r][d] for r in rArray)
return [s for s in sArray if data[s][d]>R]
elif len(rArray)*len(sArray)<=30 or len(rArray)<=2 or len(sArray)<=2:
nonDominated=[]
for s in sArray:
for r in rArray:
c=Compare(data[s],data[r],d)
if c>1:
break
else:
nonDominated.append(s)
return nonDominated
S1,S2=BreakAtPseudoMedian(sArray,d)
R1,R2=BreakAtRefValue(rArray,d,data[S2[0]][d])
if not S1 and not R1:
return Filter(data,d+1,rArray,sArray)
return FindT(R1,R2,S1,S2,d)
#Filter(data,0,[0,1,2,3,4,5,6,7,8,9],[11])
def BreakAtRefValue(rArray,d,br):
rArray=AuxSort(d,rArray)
if data[rArray[0]][d]<=br:
return [],rArray
if data[rArray[-1]][d]>br:
return rArray,[]
mLow,mHigh=0,len(rArray)-1
while mLow<mHigh-1 and data[rArray[mLow]][d]>br and data[rArray[mHigh]][d]<br:
mid=(mLow+mHigh)//2
if data[rArray[mid]][d]>br:
mLow=mid
elif data[rArray[mid]][d]<br:
mHigh=mid
else:
mLow=mid
break
if data[rArray[mLow]][d]>br and data[rArray[mHigh]][d]<br:
return rArray[:mHigh],rArray[mHigh:]
if data[rArray[mLow]][d]==br:
while data[rArray[mLow-1]][d]==br:
mLow-=1
return rArray[:mLow],rArray[mLow:]
while data[rArray[mHigh-1]][d]==br:
mHigh-=1
return rArray[:mHigh],rArray[mHigh:]
def Kung75(data,d,ndxArray):
L=len(ndxArray)
if L<10:
return N2BrutalForce(data,ndxArray,d)
elif d==D-1:
x,y=-1,-1
for n in ndxArray:
if y<0 or data[n][d]>x:
x,y=data[n][d],n
return [y]
if data[ndxArray[0]][d]==data[ndxArray[-1]][d]:
return Kung75(data,d+1,AuxSort(d+1,ndxArray))
R,S=BreakAtPseudoMedian(ndxArray,d)
R=Kung75(data,d,R)
S=Kung75(data,d,S)
T=Filter(data,d+1,R,S)
return R+T
print('started at',datetime.datetime.today())
resultPrinter(Kung75(data,0,AuxSort(0,list(range(len(data))))),True)
We take the cumulative maximum value per column in the dataframe.
We want to keep all rows that have a single column value that is equal to the maximum. We then drop duplicates using pandas drop_duplicates
In [14]: df = pd.DataFrame(
...: [[12, 7, 1, 0], [7, 1, 2, 0], [1, 1, 1, 1], [2, 2, 0, 0]],
...: columns=["A", "B", "C", "D"],
...: )
In [15]: df[(df == df.cummax(axis=0)).any(axis=1)].drop_duplicates()
Out[15]:
A B C D
0 12 7 1 0
1 7 1 2 0
2 1 1 1 1
df.sort_values(by=['A', 'B', 'C', 'D'], ascending=False, inplace=True)
df = df.iloc[:cutoff]
If this takes too long you could do it on subsets of the df until
it is small enough.

What's a more efficient way to calculate the max of each row in a matrix excluding its own column?

For a given 2D matrix np.array([[1,3,1],[2,0,5]]) if one needs to calculate the max of each row in a matrix excluding its own column, with expected example return np.array([[3,1,3],[5,5,2]]), what would be the most efficient way to do so?
Currently I implemented it with a loop to exclude its own col index:
n=x.shape[0]
row_max_mat=np.zeros((n,n))
rng=np.arange(n)
for i in rng:
row_max_mat[:,i] = np.amax(s_a_array_sum[:,rng!=i],axis=1)
Is there a faster way to do so?
Similar idea to yours (exclude columns one by one), but with indexing:
mask = ~np.eye(cols, dtype=bool)
a[:,np.where(mask)[1]].reshape((a.shape[0], a.shape[1]-1, -1)).max(1)
Output:
array([[3, 1, 3],
[5, 5, 2]])
You could do this using np.accumulate. Compute the forward and backward accumulations of maximums along the horizontal axis and then combine them with an offset of one:
import numpy as np
m = np.array([[1,3,1],[2,0,5]])
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
print(r)
# [[3 1 3]
# [5 5 2]]
This will require 3x the size of your matrix to process (although you could take that down to 2x if you want an in-place update). Adding a 3rd&4th dimension could also work using a mask but that will require columns^2 times matrix's size to process and will likely be slower.
If needed, you can apply the same technique column wise or to both dimensions (by combining row wise and column wise results).
a = np.array([[1,3,1],[2,0,5]])
row_max = a.max(axis=1).reshape(-1,1)
b = (((a // row_max)+1)%2)
c = b*row_max
d = (a // row_max)*((a*b).max(axis=1).reshape(-1,1))
c+d # result
Since, we are looking to get max excluding its own column, basically the output would have each row filled with the max from it, except for the max element position, for which we will need to fill in with the second largest value. As such, argpartition seems would fit right in there. So, here's one solution with it -
def max_exclude_own_col(m):
out = np.full(m.shape, m.max(1, keepdims=True))
sidx = np.argpartition(-m,2,axis=1)
R = np.arange(len(sidx))
s0,s1 = sidx[:,0], sidx[:,1]
mask = m[R,s0]>m[R,s1]
L1c,L2c = np.where(mask,s0,s1), np.where(mask,s1,s0)
out[R,L1c] = m[R,L2c]
return out
Benchmarking
Other working solution(s) for large arrays -
# #Alain T.'s soln
def max_accum(m):
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
return r
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
So, we will test out with large arrays of various shapes for timings and speedups -
In [54]: import benchit
In [55]: funcs = [max_exclude_own_col, max_accum]
In [170]: inputs = [np.random.randint(0,100,(100000,n)) for n in [10, 20, 50, 100, 200, 500]]
In [171]: T = benchit.timings(funcs, inputs, indexby='shape')
In [172]: T
Out[172]:
Functions max_exclude_own_col max_accum
Shape
100000x10 0.017721 0.014580
100000x20 0.028078 0.028124
100000x50 0.056355 0.089285
100000x100 0.103563 0.200085
100000x200 0.188760 0.407956
100000x500 0.439726 0.976510
# Speedups with max_exclude_own_col over max_accum
In [173]: T.speedups(ref_func_by_index=1)
Out[173]:
Functions max_exclude_own_col Ref:max_accum
Shape
100000x10 0.822783 1.0
100000x20 1.001660 1.0
100000x50 1.584334 1.0
100000x100 1.932017 1.0
100000x200 2.161241 1.0
100000x500 2.220725 1.0

fastest way to iterate a numpy array and update each element

This might be weird to you people, but I happen to have this weird goal to achieve, code goes as follows.
# A is a numpy array, dtype=int32,
# and each element is actually an ID(int), the ID range might be wide,
# but the actually existing values are quite fewer than the dense range,
A = array([[379621, 552965, 192509],
[509849, 252786, 710979],
[379621, 718598, 591201],
[509849, 35700, 951719]])
# and I need to map these sparse ID to dense ones,
# my idea is to have a dict, mapping actual_sparse_ID -> dense_ID
M = {}
# so I iterate this numpy array, and check if this sparse ID has a dense one or not
for i in np.nditer(A, op_flags=['readwrite']):
if i not in M:
M[i] = len(M) # sparse ID got a dense one
i[...] = M[i] # replace sparse one with the dense ID
My goal could be achieved with np.unique(A, return_inverse=True), and the return_inverse result is what I want.
However, the numpy array I have is too huge to fully load into memory, so I cannot run np.unique over the whole data, and this is why I came up with this dict-mapping idea...
Is this the right way to go? Any possible improvement?
I will make an attempt to provide an alternative way of doing this by using numpy.unique() on sub-arrays. This solution is not fully tested. I also did not do any side-by-side performance evaluation since your solution is not fully working for me.
Let's say we have an array c that we split into two smaller arrays. Let's create some test data, for example:
>>> a = np.array([[1,1,2,3,4],[1,2,6,6,2],[8,0,1,1,4]])
>>> b = np.array([[11,2,-1,12,6],[12,2,6,11,2],[7,0,3,1,3]])
>>> c = np.vstack([a, b])
>>> print(c)
[[ 1 1 2 3 4]
[ 1 2 6 6 2]
[ 8 0 1 1 4]
[11 2 -1 12 6]
[12 2 6 11 2]
[ 7 0 3 1 3]]
Here we assume that c is the large array and a and b are sub-arrays. Of course, one could build c first and then extract sub-arrays.
Next step is to run numpy.unique() on the two sub-arrays:
>>> ua, ia = np.unique(a, return_inverse=True)
>>> ub, ib = np.unique(b, return_inverse=True)
>>> uc, ic = np.unique(c, return_inverse=True) # this is for future reference
Now, here is an algorithm for combining the results from subarrays:
def merge_unique(ua, ia, ub, ib):
# make copies *if* changing inputs is undesirable:
ua = ua.copy()
ia = ia.copy()
ub = ub.copy()
ib = ib.copy()
# find differences between unique values in the two arrays:
diffab = np.setdiff1d(ua, ub, assume_unique=True)
diffba = np.setdiff1d(ub, ua, assume_unique=True)
# find indices in ua, ub where to insert "other" unique values:
ssa = np.searchsorted(ua, diffba)
ssb = np.searchsorted(ub, diffab)
# throw away values that are too large:
ssa = ssa[np.where(ssa < len(ua))]
ssb = ssb[np.where(ssb < len(ub))]
# increment indices past previously computed "insert" positions:
for v in ssa[::-1]:
ia[ia >= v] += 1
for v in ssb[::-1]:
ib[ib >= v] += 1
# combine results:
uc = np.union1d(ua, ub) # or use ssa, ssb, diffba, diffab to update ua, ub
ic = np.concatenate([ia, ib])
return uc, ic
Now, let's run this function on the results of numpy.unique() from sub-arrays and then compare merged indices and unique values with the reference results uc and ic:
>>> uc2, ic2 = merge_unique(ua, ia, ub, ib)
>>> np.all(uc2 == uc)
True
>>> np.all(ic2 == ic)
True
Splitting into more than two sub-arrays can be handled with little additional work - simply keep accumulating "unique" values and indices, like this:
uacc, iacc = np.unique(subarr1, return_inverse=True)
ui, ii = np.unique(subarr2, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr3, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr4, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
................................ (etc.)

Create a sparse matrix from a generator

I would like to create a big sparse matrix where its source data can't be fully loaded because of the memory issues. You may think that we have a very big file on disk and we can't read it.
I think about it but I couldn't find a way to create a sparse matrix from a generator.
from scipy.sparse import coo_matrix
matrix1 = coo_matrix(xrange(10)) # it works. Create a sparse matrix with 9 elements.
data = ((0, 1, random.randint(0,5)) for i in xrange(10)) # generator example
matrix2 = coo_matrix(data) # does not work.
Any idea?
Edit: I found this, haven't tried it yet but it looks helpful.
Here's an example of using a generator to populate a sparse matrix. I use the generator to fill a structured array, and create the sparse matrix from its fields.
import numpy as np
from scipy import sparse
N, M = 3,4
def foo(N,M):
# just a simple dense matrix of random data
cnt = 0
for i in xrange(N):
for j in xrange(M):
yield cnt, (i, j, np.random.random())
cnt += 1
dt = dt=np.dtype([('i',int), ('j',int), ('data',float)])
X = np.empty((N*M,), dtype=dt)
for cnt, tup in foo(N,M):
X[cnt] = tup
print X.shape
print X['i']
print X['j']
print X['data']
S = sparse.coo_matrix((X['data'], (X['i'], X['j'])), shape=(N,M))
print S.shape
print S.A
producing something like:
(12,)
[0 0 0 0 1 1 1 1 2 2 2 2]
[0 1 2 3 0 1 2 3 0 1 2 3]
[ 0.99268494 0.89277993 0.32847213 0.56583702 0.63482291 0.52278063
0.62564791 0.15356269 0.1554067 0.16644956 0.41444479 0.75105334]
(3, 4)
[[ 0.99268494 0.89277993 0.32847213 0.56583702]
[ 0.63482291 0.52278063 0.62564791 0.15356269]
[ 0.1554067 0.16644956 0.41444479 0.75105334]]
All of the nonzero data points will exist in memory in 2 forms - the fields of X, and the row,col,data arrays of the sparse matrix.
A structured array like X could also be loaded from the columns of a csv file.
A couple of the sparse matrix formats let you set data elements, e.g.
S = sparse.lil_matrix((N,M))
for cnt, tup in foo(N,M):
i,j,value = tup
S[i,j] = value
print S.A
sparse tells me that lil is the least expensive format for this type of assignment.

Calculating the null space of a matrix

I'm attempting to solve a set of equations of the form Ax = 0. A is known 6x6 matrix and I've written the below code using SVD to get the vector x which works to a certain extent. The answer is approximately correct but not good enough to be useful to me, how can I improve the precision of the calculation? Lowering eps below 1.e-4 causes the function to fail.
from numpy.linalg import *
from numpy import *
A = matrix([[0.624010149127497 ,0.020915658603923 ,0.838082638087629 ,62.0778180312547 ,-0.336 ,0],
[0.669649399820597 ,0.344105317421833 ,0.0543868015800246 ,49.0194290212841 ,-0.267 ,0],
[0.473153758252885 ,0.366893577716959 ,0.924972565581684 ,186.071352614705 ,-1 ,0],
[0.0759305208803158 ,0.356365401030535 ,0.126682113674883 ,175.292109352674 ,0 ,-5.201],
[0.91160934274653 ,0.32447818779582 ,0.741382053883291 ,0.11536775372698 ,0 ,-0.034],
[0.480860406786873 ,0.903499596111067 ,0.542581424762866 ,32.782593418975 ,0 ,-1]])
def null(A, eps=1e-3):
u,s,vh = svd(A,full_matrices=1,compute_uv=1)
null_space = compress(s <= eps, vh, axis=0)
return null_space.T
NS = null(A)
print "Null space equals ",NS,"\n"
print dot(A,NS)
A is full rank --- so x is 0
Since it looks like you need a least-squares solution, i.e. min ||A*x|| s.t. ||x|| = 1, do the SVD such that [U S V] = svd(A) and the last column of V (assuming that the columns are sorted in order of decreasing singular values) is x.
I.e.,
U =
-0.23024 -0.23241 0.28225 -0.59968 -0.04403 -0.67213
-0.1818 -0.16426 0.18132 0.39639 0.83929 -0.21343
-0.69008 -0.59685 -0.18202 0.10908 -0.20664 0.28255
-0.65033 0.73984 -0.066702 -0.12447 0.088364 0.0442
-0.00045131 -0.043887 0.71552 -0.32745 0.1436 0.59855
-0.12164 0.11611 0.5813 0.59046 -0.47173 -0.25029
S =
269.62 0 0 0 0 0
0 4.1038 0 0 0 0
0 0 1.656 0 0 0
0 0 0 0.6416 0 0
0 0 0 0 0.49215 0
0 0 0 0 0 0.00027528
V =
-0.002597 -0.11341 0.68728 -0.12654 0.70622 0.0050325
-0.0024567 0.018021 0.4439 0.85217 -0.27644 0.0028357
-0.0036713 -0.1539 0.55281 -0.4961 -0.6516 0.00013067
-0.9999 -0.011204 -0.0068651 0.0013713 0.0014128 0.0052698
0.0030264 0.17515 0.02341 -0.020917 -0.0054032 0.98402
0.012996 -0.96557 -0.15623 0.10603 0.014754 0.17788
So,
x =
0.0050325
0.0028357
0.00013067
0.0052698
0.98402
0.17788
And, ||A*x|| = 0.00027528 as opposed to your previous solution for x where ||A*x_old|| = 0.079442
Attention: There might be confusion with the SVD in python vs. matlab-syntax(?):
in python, numpy.linalg.svd(A) returns matrices u,s,v such that u*s*v = A
(strictly: dot(u, dot(diag(s), v) = A, because s is a vector and not a 2D-matrix in numpy).
The uppermost Answer is correct in that sense that usually you write u*s*vh = A and vh is returned, and this answer discusses v AND NOT vh.
To make a long story short: if you have matrices u,s,v such that u*s*v = A, then the last rows of v, not the last colums of v, describe the nullspace.
Edit: [for people like me:] each of the last rows is a vector v0 such that A*v0 = 0 (if the corresponding singular value is 0)

Categories