I am trying to create sparse matrix representation of a random hash map h:[n] -> [t] which maps each i
to exactly s random location of available d locations and the value at those location are drawn from some discrete distribution.
:param d: number of bins
:param n: number of items hashed
:param s: sparsity of each column
:param distribution: distribution object.
Here is my attempt:
start_time=time.time()
distribution = scipy.stats.rv_discrete(values=([-1.0, +1.0 ], [0.5, 0.5]),name = 'dist')
data = (1.0/sqrt(self._s))*distribution.rvs(size=self._n*self._s)
col = numpy.empty(self._s*self._n)
for i in range(self._n):
col[i*self._s:(i+1)*self._s]=i
row = numpy.empty(self._s*self._n)
print time.time()-start_time
for i in range(self._n):
row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)
S = scipy.sparse.csr_matrix( (data, (row, col)), shape = (self._d,self._n))
print time.time()-start_time
return S
Now for creating this map for n=500000, s=10,d=1000, it is taking me around 20s on my decent workstation, in which 90% of time is consumed in generating row indices. Is there anything I can do to speed this up? Any alternatives? Thanks.
col = numpy.empty(self._s*self._n)
for i in range(self._n):
col[i*self._s:(i+1)*self._s]=i
looks like something that could be written as one non-looping expression; though it probably isn't a big time consumer
My first guess is - but I need to play with this to be sure; I think it's assigning all rows the column index number.
col = np.empty(self._s, self._n)
col[:,:] = np.arange(self._n)
col = col.ravel()
Something similar for:
for i in range(self._n):
row[i*self._s:(i+1)*self._s]=numpy.random.choice(self._d, self._s, replace=False)
is, I think, picking _s values from _d _n times. Doing the no-replace along _s, but allowing replace on _n could be tricky.
Without running the code myself (with smaller n) I'm stumbling around a bit. Which is the slow part, generating col, row, or the final csr? Iteration on n=500000 is going to be slow.
The matrix will be (1000, 500000), but with (10*500000) nonzero items. So a sparsity of .01. Just for comparison it would be interesting to generate a sparse random matrix of similar size and sparsity
In [5]: %timeit sparse.random(1000, 500000, .01)
1 loop, best of 3: 24.6 s per loop
and the dense random choices:
In [8]: timeit np.random.choice(1000,(10,500000)).shape
10 loops, best of 3: 53 ms per loop
In [9]: np.array([np.random.choice(1000,(10,)) for i in range(500000)]).shape
Out[9]: (500000, 10)
In [10]: timeit np.array([np.random.choice(1000,(10,)) for i in range(500000)]).
...: shape
1 loop, best of 3: 12.7 s per loop
So, yes, the large iteration loop is expensive. But given the replacement policy there might not be a way around that. Or is there?
So as first guess, creating row takes half the time, creating the sparse matrix the other half. I'm not surprised. You are using the coo style of input, which requires lexsorting and summing for duplicates when converting to csr. We might be able to gain speed by using the indptr type of input. There won't be duplicates to sum. And since there are consistently 10 nonzero terms per row, generating the indptr values won't be hard. But I can't do that off the top of my head. (oops, that's the transpose).
random sparse to csr is just a bit slower:
In [11]: %timeit sparse.random(1000, 500000, .01, 'csr')
1 loop, best of 3: 28.3 s per loop
Related
I have two large matrices (40000*4096) and I would like to compare and match each row of the first matrix to all of the rows for the second matrix and as a result, the output will have a size (40000*40000). However, since I need to do this for several thousand times, it is wildy time consuming 26k seconds for each iteration so for 5000 times ...
I would be glad if you could give me some smart suggestion. Thank you.
P.S. this is what I did so far for just one iteration (1 of 5000)
def matcher(Antigens, Antibodies,ind):
temp = np.zeros((Antibodies.shape[0],Antibodies.shape[1]))
output = np.zeros((Antibodies.shape[0],1))
for i in range(len(Antibodies)):
temp[i] = np.int32(np.equal(Antigens[ind],Antibodies[i]))
output[i] = np.sum(temp[i])
return output
output = [matcher(gens,Antibodies) for gens in Antigens]
Okay, I think I understand what your goal is:
Count number of row matches (antigen vs antibody matrix). Each row of the resulting vector (40,000 x 1) represents a count of exact matches between 1 antigen row and all of the antibodies row (so values from 0 - 40_000).
I made some fake data:
import numpy as np
import numba as nb
num_mat = 5 # number of matrices
num_row = 10_000 # number of rows per matrix
num_elm = 4_096 # number of elements per row
dim = (num_mat,num_row,num_elm)
Antigens = np.random.randint(0,256,dim,dtype=np.uint8)
Antibodies = np.random.randint(0,256,dim,dtype=np.uint8)
There's one important point here, I reduced the matrices to the smallest datatype that can represent the data in order to reduce their memory foot-print. I'm not sure what your data looks like, but hopefully you can do this as well.
Also, the following code assumes your dimensions look the fake data:
(number of matrices, rows, elements)
#nb.njit
def match_arr(arr1, arr2):
for i in range(arr1.shape[0]): #4096 vs 4096
if arr1[i] != arr2[i]:
return False
return True
#nb.njit
def match_mat_sum(ag, ab):
out = np.zeros((ag.shape[0])) # 40000
for i in range(ag.shape[0]):
tmp = 0
for j in range(ab.shape[0]):
tmp += match_arr(ag[i], ab[j])
out[i] = tmp
return out
#nb.njit(parallel=True)
def match_sets(Antigens, Antibodies):
out = np.empty((Antigens.shape[0] * Antibodies.shape[0], Antigens.shape[1])) # 5000 x 40000
# multiprocessing per antigen matrix, may want to move this as suits your data
for i in nb.prange(Antigens.shape[0]):
for j in range(Antibodies.shape[0]):
out[j+(5*i)] = match_mat_sum(Antigens[i], Antibodies[j]) # need to figure out the index to avoid race conditions
return out
I lean on Numba heavily. One of the key optimizations is not to check the equivalence of entire rows with np.equal() but to write a custom function match_arr() that breaks as soon as it finds a mis-matched element. Hopefully, this lets us skip a ton of comparisons.
Time comparison:
%timeit match_arr(arr1, arr2)
314 ns ± 0.361 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.equal(arr1, arr2)
1.07 µs ± 5.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
match_mat_sum
This function simply calculates the middle step (the 40,000 x 1 vector) that represents the sum of exact matches between two matrices. This step reduces two matrices like: (m x n), (o x n) -> (m)
match_sets()
The last function parallelizes this operation with explicit parallel loops through nb.prange. You might want to move this function to a different loop depending on what your data looks like (like if you have one antigen matrix, but 5000 antibody matrices, you should move prange to the inner loop or you'll not be leveraging parallelization). The fake data assumes some antigen and some antibody matrices.
Another important thing to note here is the indexing on the out array. In order to avoid race conditions, each explicit loops needs to write to a unique space. Again, depending on your data, you'll need to index the proper "place" to put the result.
On a Ryzen 1600 (6-core) with 16 gigs of RAM, using this fake data, I generated a result in 10.2 seconds.
Your data is about 3200x times larger. Assuming linear scaling, the full set would take approximately 9 hours, assuming you have enough memory.
You could write some kind of batch loader as well, rather than loading 5000 giant matrices directly into memory.
This problem can be tackled with a mixture of numpy broadcasting, and the module numexpr, which performs operations fast while minimizing the storage of intermediate values
import numexpr as ne
# expand arrays dimensions to support broadcasting when doing comparison
Antigens, Antibodies = Antigens[None, :, :], Antibodies[:, None, :]
output = ne.evaluate('sum((Antigens==Antibodies)*1, axis=2)')
# *1 is a hack because numexpr does not currently support sum on bool
This may be faster than your current solution, but for such large arrays it will take a while.
The performance of numexpr for this operations is a bit lackluster, but you can at least use broadcasting inside the loop:
output = np.zeros((Antibodies.shape[0],)*2, dtype=np.int32)
for row, out_row in zip(Antibodies, output):
(row[None,:]==Antigens).sum(1, out=out_row)
I am writing a code for proposing typo correction using HMM and Viterbi algorithm. At some point for each word in the text I have to do the following. (lets assume I have 10,000 words)
#FYI Windows 10, 64bit, interl i7 4GRam, Python 2.7.3
import numpy as np
import pandas as pd
for k in range(10000):
tempWord = corruptList20[k] #Temp word read form the list which has all of the words
delta = np.zeros(26, len(tempWord)))
sai = np.chararray(26, len(tempWord)))
sai[:] = '#'
# INITIALIZATION DELTA
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is different
# INITILIZATION END
# 6.DELTA CALCULATION
for deltaIndex in range(1, len(tempWord)):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
# 7. SAI BACKWARD TRACKING
delta2 = pd.DataFrame(delta)
sai2 = pd.DataFrame(sai)
proposedWord = np.zeros(len(tempWord), str)
editId = 0
for col in delta2.columns:
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
editList20.append(''.join(editWord))
#END OF LOOP
As you can see it is computationally involved and When I run it takes too much time to run.
Currently my laptop is stolen and I run this on Windows 10, 64bit, 4GRam, Python 2.7.3
My question: Anybody can see any point that I can use to optimize? Do I have to delete the the matrices I created in the loop before loop goes to next round to make memory free or is this done automatically?
After the below comments and using xrange instead of range the performance increased almost by 30%. I am adding the screenshot here after this change.
I don't think that range discussion makes much difference. With Python3, where range is the iterator, expanding it into a list before iteration doesn't change time much.
In [107]: timeit for k in range(10000):x=k+1
1000 loops, best of 3: 1.43 ms per loop
In [108]: timeit for k in list(range(10000)):x=k+1
1000 loops, best of 3: 1.58 ms per loop
With numpy and pandas the real key to speeding up loops is to replace them with compiled operations that work on the whole array or dataframe. But even in pure Python, focus on streamlining the contents of the iteration, not the iteration mechanism.
======================
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication
A minor change: delta[i, 0] = ...; this is the array way of addressing a single element; functionally it often is the same, but the intent is clearer. But think, can't you set all of that column as once?
delta[:,0] = ...
====================
N = len(tempWord)
delta = np.zeros(26, N))
etc
In tight loops temporary variables like this can save time. This isn't tight, so here is just adds clarity.
===========================
This one ugly nested triple loop; admittedly 26 steps isn't large, but 26*26*N is:
for deltaIndex in range(1,N):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
Focus on replacing this with array operations. It's those 3 commented lines that need to be changed, not the iteration mechanism.
================
Make proposedWord a list rather than array might be faster. Small list operations are often faster than array one, since numpy arrays have a creation overhead.
In [136]: timeit np.zeros(20,str)
100000 loops, best of 3: 2.36 µs per loop
In [137]: timeit x=[' ']*20
1000000 loops, best of 3: 614 ns per loop
You have to careful when creating 'empty' lists that the elements are truly independent, not just copies of the same thing.
In [159]: %%timeit
x = np.zeros(20,str)
for i in range(20):
x[i] = chr(65+i)
.....:
100000 loops, best of 3: 14.1 µs per loop
In [160]: timeit [chr(65+i) for i in range(20)]
100000 loops, best of 3: 7.7 µs per loop
As noted in the comments, the behavior of range changed between Python 2 and 3.
In 2, range constructs an entire list populated with the numbers to iterate over, then iterates over the list. Doing this in a tight loop is very expensive.
In 3, range instead constructs a simple object that (as far as I know), consists only of 3 numbers: the starting number, the step (distance between numbers), and the end number. Using simple math, you can calculate any point along the range instead of needing to iterate necessarily. This makes "random access" on it O(1) instead of O(n) when the entire list is interated, and prevents the creation of a costly list.
In 2, use xrange to iterate over a range object instead of a list.
(#Tom: I'll delete this if you post an answer).
It's hard to see exactly what you need to do because of the missing code, but it's clear that you need to learn how to vectorize your numpy code. This can lead to a 100x speedup.
You can probably get rid of all the inner for-loops and replace them with vectorized operations.
eg. instead of
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is differen
do
delta[:, 0] = # Vectorized form of whatever operation you were going to do.
I am working with huge numbers, such as 150!. To calculate the result is not a problem, by example
f = factorial(150) is
57133839564458545904789328652610540031895535786011264182548375833179829124845398393126574488675311145377107878746854204162666250198684504466355949195922066574942592095735778929325357290444962472405416790722118445437122269675520000000000000000000000000000000000000.
But I also need to store an array with N of those huge numbers, in full presison. A list of python can store it, but it is slow. A numpy array is fast, but can not handle the full precision, wich is required for some operations I perform later, and as I have tested, a number in scientific notation (float) does not produce the accurate result.
Edit:
150! is just an example of huge number, it does not mean I am working only with factorials. Also, the full set of numbers (NOT always a result of factorial) change over time, and I need to do the actualization and reevaluation of a function for wich those numbers are a parameter, and yes, full precision is required.
numpy arrays are very fast when they can internally work with a simple data type that can be directly manipulated by the processor. Since there is no simple, native data type that can store huge numbers, they are converted to a float. numpy can be told to work with Python objects but then it will be slower.
Here are some times on my computer. First the setup.
a is a Python list containing the first 50 factorials. b is a numpy array with all the values converted to float64. c is a numpy array storing Python objects.
import numpy as np
import math
a=[math.factorial(n) for n in range(50)]
b=np.array(a, dtype=np.float64)
c=np.array(a, dtype=np.object)
a[30]
265252859812191058636308480000000L
b[30]
2.6525285981219107e+32
c[30]
265252859812191058636308480000000L
Now to measure indexing.
%timeit a[30]
10000000 loops, best of 3: 34.9 ns per loop
%timeit b[30]
1000000 loops, best of 3: 111 ns per loop
%timeit c[30]
10000000 loops, best of 3: 51.4 ns per loop
Indexing into a Python list is fastest, followed by extracting a Python object from a numpy array, and slowest is extracting a 64-bit float from an optimized numpy array.
Now let's measure multiplying each element by 2.
%timeit [n*2 for n in a]
100000 loops, best of 3: 4.73 µs per loop
%timeit b*2
100000 loops, best of 3: 2.76 µs per loop
%timeit c*2
100000 loops, best of 3: 7.24 µs per loop
Since b*2 can take advantage of numpy's optimized array, it is the fastest. The Python list takes second place. And a numpy array using Python objects is the slowest.
At least with the tests I ran, indexing into a Python list doesn't seem slow. What is slow for you?
Store it as tuples of prime factors and their powers. A factorization of a factorial (of, let's say, N) will contain ALL primes less than N. So k'th place in each tuple will be k'th prime. And you'll want to keep a separate list of all the primes you've found. You can easily store factorials as high as a few hundred thousand in this notation. If you really need the digits, you can easily restore them from this (just ignore the power of 5 and subtract the power of 5 from the power of 2 when you multiply the factors to get the factorial... cause 5*2=10).
If you need for the future the exact number of a factorial why dont you save in an array not the result but the number you want to 'factorialize'?
E.G.
You have f = factorial(150)
and you have the result 57133839564458545904789328652610540031895535786011264182548375833179829124845398393126574488675311145377107878746854204162666250198684504466355949195922066574942592095735778929325357290444962472405416790722118445437122269675520000000000000000000000000000000000000
But you can simply:
def values():
to_factorial_list = []
...
to_factorial_list.append(values_you_want_to_factorialize)
return to_factorial_list
def setToFactorial(number):
return factorial(number)
print setToFactorial(values()[302])
EDIT:
fair, then my advice is to work both with the logic i suggested as the getsizeof(number) you can merge or work with two arrays, an array to save low factorialized numbers and another to save the big ones, e.g. when getsizeof(number) exceed any size.
I've made my own version of insertion sort that uses pop and insert - to pick out the current element and insert it before the smallest element larger than the current one - rather than the standard swapping backwards until a larger element is found. When I run the two on my computer, mine is about 3.5 times faster. When we did it in class, however, mine was much slower, which is really confusing. Anyway, here are the two functions:
def insertionSort(alist):
for x in range(len(alist)):
for y in range(x,0,-1):
if alist[y]<alist[y-1]:
alist[y], alist[y-1] = alist[y-1], alist[y]
else:
break
def myInsertionSort(alist):
for x in range(len(alist)):
for y in range(x):
if alist[y]>alist[x]:
alist.insert(y,alist.pop(x))
break
Which one should be faster? Does alist.insert(y,alist.pop(x)) change the size of the list back and forth, and how does that affect time efficiency?
Here's my quite primitive test of the two functions:
from time import time
from random import shuffle
listOfLists=[]
for x in range(100):
a=list(range(1000))
shuffle(a)
listOfLists.append(a)
start=time()
for i in listOfLists:
myInsertionSort(i[:])
myInsertionTime=time()-start
start=time()
for i in listOfLists:
insertionSort(i[:])
insertionTime=time()-start
print("regular:",insertionTime)
print("my:",myInsertionTime)
I had underestimated your question, but it actually isn't easy to answer. There are a lot of different elements to consider.
Doing lst.insert(y, lst.pop(x)) is a O(n) operation, because lst.pop(x) costs O(len(lst) - x) since list elements must be contiguous, and thus the list has to shift-left by one all the elements after index x, and dually lst.insert(y, _) costs O(len(lst) - y) since it has to shift all the elements right by one.
This means that a naive analysis can give an upperbound of O(n^3) complexity in the worst case for your code. As you suggested this is actually correct [remember that O(n^2) is a subset of O(n^3)], however it's not a tight upperbound because you swap each element only once. So for n times you do n work, and this complexity is indeed O(n * n + n^2) = O(n^2), where the second n^2 refers to the number of comparisons which is n^2 in the worst case. So, asymptotically your solution is the same as insertion sort.
The first algorithm and the second algorithm change the order of iterations over the y. As I have already commented this changes the worst-case for the algorithm.
While insertion sort has its worst-case with reverse-sorted sequences, your algorithm doesn't (which is actually good). This might be a factor that adds to the difference in timings since if you do not use random lists you might use an input that is worst-case for one algorithm but not worst-case for the other.
In [2]: %timeit insertionSort(list(range(10)))
100000 loops, best of 3: 5.46 us per loop
In [3]: %timeit myInsertionSort(list(range(10)))
100000 loops, best of 3: 8.47 us per loop
In [4]: %timeit insertionSort(list(reversed(range(10))))
10000 loops, best of 3: 20.4 us per loop
In [5]: %timeit myInsertionSort(list(reversed(range(10))))
100000 loops, best of 3: 9.81 us per loop
You should always tests with (also) random inputs with different lengths.
The average complexity of insertion sort is O(n^2). Your algorithm might have a lower average time, however it's not entirely trivial to compute it.
I don't get why you use the insert+pop at all when you can use the swap. Trying this on my machine yields a quite big improvement in efficiency since you reduce an O(n^2) component to a O(n) component.
Now, you ask why there was such a big change between the execution at home and in class.
There can be various reasons, for example if you did not use a random generated list you might have used an almost best-case input for insertion sort while it was an almost worst-case input for your algorithm. And similar considerations. Without seeing what you did in class is not possible to give an exact answer.
However I believe there is a very simple answer: you forgot to copy the list before profiling. This is the same error I did when I first posted this answer (quote from the previous answer):
If you want to compare the two functions you should use random
lists:
In [6]: import random
...: input_list = list(range(10))
...: random.shuffle(input_list)
...:
In [7]: %timeit insertionSort(input_list) # Note: no input_list[:]!! Argh!
100000 loops, best of 3: 4.82 us per loop
In [8]: %timeit myInsertionSort(input_list)
100000 loops, best of 3: 7.71 us per loop
Also you should use big inputs to see the difference clearly:
In [11]: input_list = list(range(1000))
...: random.shuffle(input_list)
In [12]: %timeit insertionSort(input_list) # Note: no input_list[:]! Argh!
1000 loops, best of 3: 508 us per loop
In [13]: %timeit myInsertionSort(input_list)
10 loops, best of 3: 55.7 ms per loop
Note also that I, unfortunately, always executed the pairs of profilings in the same order, confirming my previous ideas.
As you can see all calls to insertionSort except the first one used a sorted list as input, which is the best-case for insertion-sort! This means that the timing for insertion sort is wrong (and I'm sorry for having written this before!) While myInsertionSort was always executed with an already sorted list, and guess what? Turns out that one of the worst-cases for myInsertionSort is the sorted list!
think about it:
for x in range(len(alist)):
for y in range(x):
if alist[y]>alist[x]:
alist.insert(y,alist.pop(x))
break
If you have a sorted list the alist[y] > alist[x] comparison will always be false. You might say "perfect! no swaps => no O(n) work => better timing", unfortunately this is false because no swaps also mean no break and hence you are doing n*(n+1)/2 iterations, i.e. the worst-case performance.
Note that this is very bad!!! Real-world data really often is partially sorted, so an algorithm whose worst-case is the sorted list is usually not a good algorithm for real-world use.
Note that this does not change if you replace insert + pop with a simple swap, hence the algorithm itself is not good from this point of view, independently from the implementation.
So, I'm doing some Kmeans classification using numpy arrays that are quite sparse-- lots and lots of zeroes. I figured that I'd use scipy's 'sparse' package to reduce the storage overhead, but I'm a little confused about how to create arrays, not matrices.
I've gone through this tutorial on how to create sparse matrices:
http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b90df7
To mimic an array, I just create a 1xN matrix, but as you may guess, Asp.dot(Bsp) doesn't quite work because you can't multiply two 1xN matrices. I'd have to transpose each array to Nx1, and that's pretty lame, since I'd be doing it for every dot-product calculation.
Next up, I tried to create an NxN matrix where column 1 == row 1 (such that you can multiply two matrices and just take the top-left corner as the dot product), but that turned out to be really inefficient.
I'd love to use scipy's sparse package as a magic replacement for numpy's array(), but as yet, I'm not really sure what to do.
Any advice?
Use a scipy.sparse format that is row or column based: csc_matrix and csr_matrix.
These use efficient, C implementations under the hood (including multiplication), and transposition is a no-op (esp. if you call transpose(copy=False)), just like with numpy arrays.
EDIT: some timings via ipython:
import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
Now x_csr and x_dok are 50% sparse:
print repr(x_csr)
<1x100000 sparse matrix of type '<type 'numpy.float64'>'
with 49757 stored elements in Compressed Sparse Row format>
And the timings:
timeit numpy.dot(x, x)
10000 loops, best of 3: 123 us per loop
timeit x_dok * x_dok.T
1 loops, best of 3: 1.73 s per loop
timeit x_csr.multiply(x_csr).sum()
1000 loops, best of 3: 1.64 ms per loop
timeit x_csr * x_csr.T
100 loops, best of 3: 3.62 ms per loop
So it looks like I told a lie. Transposition is very cheap, but there is no efficient C implementation of csr * csc (in the latest scipy 0.9.0). A new csr object is constructed in each call :-(
As a hack (though scipy is relatively stable these days), you can do the dot product directly on the sparse data:
timeit numpy.dot(x_csr.data, x_csr.data)
10000 loops, best of 3: 62.9 us per loop
Note this last approach does a numpy dense multiplication again. The sparsity is 50%, so it's actually faster than dot(x, x) by a factor of 2.
You could create a subclass of one of the existing 2d sparse arrays
from scipy.sparse import dok_matrix
class sparse1d(dok_matrix):
def __init__(self, v):
dok_matrix.__init__(self, (v,))
def dot(self, other):
return dok_matrix.dot(self, other.transpose())[0,0]
a=sparse1d((1,2,3))
b=sparse1d((4,5,6))
print a.dot(b)
I'm not sure that it is really much better or faster, but you could do this to avoid using the transpose:
Asp.multiply(Bsp).sum()
This just takes the element-by-element product of the two matrices and sums the products. You could make a subclass of whatever matrix format you are using that has the above statement as the dot product.
However, it is probably just easier to tranpose them:
Asp*Bsp.T
That doesn't seem like so much to do, but you could also make a subclass and modify the mul() method.