I would like to evaluate
E = np.einsum('ij,jk,kl->ijkl',A,A,A)
F = np.einsum('ijki->ijk',E)
where A is a matrix (no more than 1000 by 1000 in size). Computing E is slow. I would like to speed this up by only computing the "diagonal" elements which I store in F. Is it possible to combine these two expressions?/Are there any better ways to speed up this computation?
I'm not sure if there is an automatic way, but you can always do the maths yourself and give einsum the final expression:
F = np.einsum('ij,jk,ki->ijk', A, A, A)
In [86]: A=np.random.randint(0,100,(100,100))
In [88]: E1=np.einsum('ijki->ijk',np.einsum('ij,jk,kl->ijkl',A,A,A))
In [89]: E2=np.einsum('ij,jk,ki->ijk',A,A,A)
In [90]: np.allclose(E1,E2)
Out[90]: True
Good time improvement - 100x, corresponding to the saved dimension (l)
In [91]: timeit np.einsum('ijki->ijk',np.einsum('ij,jk,kl->ijkl',A,A,A))
1 loops, best of 3: 1.1 s per loop
In [92]: timeit np.einsum('ij,jk,ki->ijk',A,A,A)
100 loops, best of 3: 10.9 ms per loop
einsum performs a combined iteration over all the indices, albeit in Cython code. So reducing the number of indices can have a significant time savings. Looks like doing that i...i combination works in the initial calc.
With only 2g of memory, the (1000,1000) is too large 'iterator too large' in the E1 case, 'memory error' in the E2 case.
Related
In the following question,
https://stackoverflow.com/a/40056135/5714445
Numpy's broadcasting provides a solution that's almost 6x faster than using np.setdiff1d() paired with np.view(). How does it manage to do this?
And using A[~((A[:,None,:] == B).all(-1)).any(1)] speeds it up even more.
Interesting, but raises yet another question. How does this perform even better?
I would try to answer the second part of the question.
So, with it we are comparing :
A[np.all(np.any((A-B[:, None]), axis=2), axis=0)] (I)
and
A[~((A[:,None,:] == B).all(-1)).any(1)]
To compare with a matching perspective against the first one, we could write down the second approach like this -
A[(((~(A[:,None,:] == B)).any(2))).all(1)] (II)
The major difference when considering performance, would be the fact that with the first one, we are getting non-matches with subtraction and then checking for non-zeros with .any(). Thus, any() is made to operate on an array of non-boolean dtype array. In the second approach, instead we are feeding it a boolean array obtained with A[:,None,:] == B.
Let's do a small runtime test to see how .any() performs on int dtype vs boolean array -
In [141]: A = np.random.randint(0,9,(1000,1000)) # An int array
In [142]: %timeit A.any(0)
1000 loops, best of 3: 1.43 ms per loop
In [143]: A = np.random.randint(0,9,(1000,1000))>5 # A boolean array
In [144]: %timeit A.any(0)
10000 loops, best of 3: 164 µs per loop
So, with close to 9x speedup on this part, we see a huge advantage to use any() with boolean arrays. This I think was the biggest reason to make the second approach faster.
I am writing a code for proposing typo correction using HMM and Viterbi algorithm. At some point for each word in the text I have to do the following. (lets assume I have 10,000 words)
#FYI Windows 10, 64bit, interl i7 4GRam, Python 2.7.3
import numpy as np
import pandas as pd
for k in range(10000):
tempWord = corruptList20[k] #Temp word read form the list which has all of the words
delta = np.zeros(26, len(tempWord)))
sai = np.chararray(26, len(tempWord)))
sai[:] = '#'
# INITIALIZATION DELTA
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is different
# INITILIZATION END
# 6.DELTA CALCULATION
for deltaIndex in range(1, len(tempWord)):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
# 7. SAI BACKWARD TRACKING
delta2 = pd.DataFrame(delta)
sai2 = pd.DataFrame(sai)
proposedWord = np.zeros(len(tempWord), str)
editId = 0
for col in delta2.columns:
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
editList20.append(''.join(editWord))
#END OF LOOP
As you can see it is computationally involved and When I run it takes too much time to run.
Currently my laptop is stolen and I run this on Windows 10, 64bit, 4GRam, Python 2.7.3
My question: Anybody can see any point that I can use to optimize? Do I have to delete the the matrices I created in the loop before loop goes to next round to make memory free or is this done automatically?
After the below comments and using xrange instead of range the performance increased almost by 30%. I am adding the screenshot here after this change.
I don't think that range discussion makes much difference. With Python3, where range is the iterator, expanding it into a list before iteration doesn't change time much.
In [107]: timeit for k in range(10000):x=k+1
1000 loops, best of 3: 1.43 ms per loop
In [108]: timeit for k in list(range(10000)):x=k+1
1000 loops, best of 3: 1.58 ms per loop
With numpy and pandas the real key to speeding up loops is to replace them with compiled operations that work on the whole array or dataframe. But even in pure Python, focus on streamlining the contents of the iteration, not the iteration mechanism.
======================
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication
A minor change: delta[i, 0] = ...; this is the array way of addressing a single element; functionally it often is the same, but the intent is clearer. But think, can't you set all of that column as once?
delta[:,0] = ...
====================
N = len(tempWord)
delta = np.zeros(26, N))
etc
In tight loops temporary variables like this can save time. This isn't tight, so here is just adds clarity.
===========================
This one ugly nested triple loop; admittedly 26 steps isn't large, but 26*26*N is:
for deltaIndex in range(1,N):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
Focus on replacing this with array operations. It's those 3 commented lines that need to be changed, not the iteration mechanism.
================
Make proposedWord a list rather than array might be faster. Small list operations are often faster than array one, since numpy arrays have a creation overhead.
In [136]: timeit np.zeros(20,str)
100000 loops, best of 3: 2.36 µs per loop
In [137]: timeit x=[' ']*20
1000000 loops, best of 3: 614 ns per loop
You have to careful when creating 'empty' lists that the elements are truly independent, not just copies of the same thing.
In [159]: %%timeit
x = np.zeros(20,str)
for i in range(20):
x[i] = chr(65+i)
.....:
100000 loops, best of 3: 14.1 µs per loop
In [160]: timeit [chr(65+i) for i in range(20)]
100000 loops, best of 3: 7.7 µs per loop
As noted in the comments, the behavior of range changed between Python 2 and 3.
In 2, range constructs an entire list populated with the numbers to iterate over, then iterates over the list. Doing this in a tight loop is very expensive.
In 3, range instead constructs a simple object that (as far as I know), consists only of 3 numbers: the starting number, the step (distance between numbers), and the end number. Using simple math, you can calculate any point along the range instead of needing to iterate necessarily. This makes "random access" on it O(1) instead of O(n) when the entire list is interated, and prevents the creation of a costly list.
In 2, use xrange to iterate over a range object instead of a list.
(#Tom: I'll delete this if you post an answer).
It's hard to see exactly what you need to do because of the missing code, but it's clear that you need to learn how to vectorize your numpy code. This can lead to a 100x speedup.
You can probably get rid of all the inner for-loops and replace them with vectorized operations.
eg. instead of
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is differen
do
delta[:, 0] = # Vectorized form of whatever operation you were going to do.
I am working with huge numbers, such as 150!. To calculate the result is not a problem, by example
f = factorial(150) is
57133839564458545904789328652610540031895535786011264182548375833179829124845398393126574488675311145377107878746854204162666250198684504466355949195922066574942592095735778929325357290444962472405416790722118445437122269675520000000000000000000000000000000000000.
But I also need to store an array with N of those huge numbers, in full presison. A list of python can store it, but it is slow. A numpy array is fast, but can not handle the full precision, wich is required for some operations I perform later, and as I have tested, a number in scientific notation (float) does not produce the accurate result.
Edit:
150! is just an example of huge number, it does not mean I am working only with factorials. Also, the full set of numbers (NOT always a result of factorial) change over time, and I need to do the actualization and reevaluation of a function for wich those numbers are a parameter, and yes, full precision is required.
numpy arrays are very fast when they can internally work with a simple data type that can be directly manipulated by the processor. Since there is no simple, native data type that can store huge numbers, they are converted to a float. numpy can be told to work with Python objects but then it will be slower.
Here are some times on my computer. First the setup.
a is a Python list containing the first 50 factorials. b is a numpy array with all the values converted to float64. c is a numpy array storing Python objects.
import numpy as np
import math
a=[math.factorial(n) for n in range(50)]
b=np.array(a, dtype=np.float64)
c=np.array(a, dtype=np.object)
a[30]
265252859812191058636308480000000L
b[30]
2.6525285981219107e+32
c[30]
265252859812191058636308480000000L
Now to measure indexing.
%timeit a[30]
10000000 loops, best of 3: 34.9 ns per loop
%timeit b[30]
1000000 loops, best of 3: 111 ns per loop
%timeit c[30]
10000000 loops, best of 3: 51.4 ns per loop
Indexing into a Python list is fastest, followed by extracting a Python object from a numpy array, and slowest is extracting a 64-bit float from an optimized numpy array.
Now let's measure multiplying each element by 2.
%timeit [n*2 for n in a]
100000 loops, best of 3: 4.73 µs per loop
%timeit b*2
100000 loops, best of 3: 2.76 µs per loop
%timeit c*2
100000 loops, best of 3: 7.24 µs per loop
Since b*2 can take advantage of numpy's optimized array, it is the fastest. The Python list takes second place. And a numpy array using Python objects is the slowest.
At least with the tests I ran, indexing into a Python list doesn't seem slow. What is slow for you?
Store it as tuples of prime factors and their powers. A factorization of a factorial (of, let's say, N) will contain ALL primes less than N. So k'th place in each tuple will be k'th prime. And you'll want to keep a separate list of all the primes you've found. You can easily store factorials as high as a few hundred thousand in this notation. If you really need the digits, you can easily restore them from this (just ignore the power of 5 and subtract the power of 5 from the power of 2 when you multiply the factors to get the factorial... cause 5*2=10).
If you need for the future the exact number of a factorial why dont you save in an array not the result but the number you want to 'factorialize'?
E.G.
You have f = factorial(150)
and you have the result 57133839564458545904789328652610540031895535786011264182548375833179829124845398393126574488675311145377107878746854204162666250198684504466355949195922066574942592095735778929325357290444962472405416790722118445437122269675520000000000000000000000000000000000000
But you can simply:
def values():
to_factorial_list = []
...
to_factorial_list.append(values_you_want_to_factorialize)
return to_factorial_list
def setToFactorial(number):
return factorial(number)
print setToFactorial(values()[302])
EDIT:
fair, then my advice is to work both with the logic i suggested as the getsizeof(number) you can merge or work with two arrays, an array to save low factorialized numbers and another to save the big ones, e.g. when getsizeof(number) exceed any size.
I'm using numpy to do linear algebra. I want to do fast subset-indexed dot and other linear operations.
When dealing with big matrices, slicing solution like A[:,subset].dot(x[subset]) may be longer than doing the multiplication on the full matrix.
A = np.random.randn(1000,10000)
x = np.random.randn(10000,1)
subset = np.sort(np.random.randint(0,10000,500))
Timings show that sub-indexing can be faster when columns are in one block.
%timeit A.dot(x)
100 loops, best of 3: 4.19 ms per loop
%timeit A[:,subset].dot(x[subset])
100 loops, best of 3: 7.36 ms per loop
%timeit A[:,:500].dot(x[:500])
1000 loops, best of 3: 1.75 ms per loop
Still the acceleration is not what I would expect (20x faster!).
Does anyone know an idea of a library/module that allow these kind of fast operation through numpy or scipy?
For now on I'm using cython to code a fast column-indexed dot product through the cblas library. But for more complex operation (pseudo-inverse, or subindexed least square solving) I'm not shure to reach good acceleration.
Thanks!
Well, this is faster.
%timeit A.dot(x)
#4.67 ms
%%timeit
y = numpy.zeros_like(x)
y[subset]=x[subset]
d = A.dot(y)
#4.77ms
%timeit c = A[:,subset].dot(x[subset])
#7.21ms
And you have all(d-ravel(c)==0) == True.
Notice that how fast this is depends on the input. With subset = array([1,2,3]) you have that the time of my solution is pretty much the same, while the timing of the last solution is 46micro seconds.
Basically this will be faster if the size ofsubset is not much smaller than the size of x
I have two Pandas dataframes, with some common information between them
n_classes = 100
classes = range(n_classes)
activity_data = pd.DataFrame(columns=['Class','Activity'], data=list(zip(classes,rand(n_classes))))
weight_lookuptable = pd.DataFrame(index=classes, columns=classes, data=rand(n_classes,n_classes))
#Important for comprehension: the classes are both the indices and the columns. Every class has a relationship with every other class.
I then want to perform this operation:
q =[sum(activity_data['Activity']*activity_data['Class'].map(weight_lookuptable[c])) for c in activity_data['Class']]
Description: For every class, look up that class' class-to-class weights in the lookup table, and multiply them by their respective classes. Then sum.
Is there a smarter way to do this so as to be faster? It's pretty fast now, but I'll be doing this millions of times and could really an order of magnitude or two reduction.
Maybe there is something clever with making activity_data['Class'] and index. But obviously the biggest opportunity for gains would be to not have the for c in activity_data['Class'] component. I just don't see how to do it.
IIUC, you could use dot, I think:
>>> q = [sum(activity_data['Activity']*activity_data['Class'].map(weight_lookuptable[c])) for c in activity_data['Class']]
>>> new_q = activity_data["Activity"].dot(weight_lookuptable)
>>> np.allclose(q, new_q)
True
which is much faster for me:
>>> %timeit q = [sum(activity_data['Activity']*activity_data['Class'].map(weight_lookuptable[c])) for c in activity_data['Class']]
10 loops, best of 3: 28.8 ms per loop
>>> %timeit new_q = activity_data["Activity"].dot(weight_lookuptable)
1000 loops, best of 3: 218 µs per loop
You can sometimes squeeze out a bit more performance by dropping to bare numpy (although then you have to be more careful to make sure that your indices are aligned):
>>> %timeit new_q = activity_data["Activity"].values.dot(weight_lookuptable.values)
10000 loops, best of 3: 43.4 µs per loop