I am trying to write a piece of nested loops in my algorithm, and meet some problems that the whole algorithm takes too long time due to these nested loops. I am quite new to Python (as you may find from my below unprofessional code :( ) and hopefully someone can guide me a way to speed up my code!
The whole algorithm is for fire detection in multi 1500*6400 arrays. A small contextual analyse is applied when go through the whole array. The contextual analyse is performed in a dynamically assigned windows size way. The windows size can go from 11*11 to 31*31 until the validate values inside the sampling windows are enough for the next round calculation, for example like below:
def ContextualWindows (arrb4,arrb5,pfire):
####arrb4,arrb5,pfire are 31*31 sampling windows from large 1500*6400 numpy array
i=5
while i in range (5,16):
arrb4back=arrb4[15-i:16+i,15-i:16+i]
## only output the array data when it is 'large' enough
## to have enough good quality data to do calculation
if np.ma.count(arrb4back)>=min(10,0.25*i*i):
arrb5back=arrb5[15-i:16+i,15-i:16+i]
pfireback=pfire[15-i:16+i,15-i:16+i]
canfire=0
i=20
else:
i=i+1
###unknown pixel: background condition could not be characterized
if i!=20:
canfire=1
arrb5back=arrb5
pfireback=pfire
arrb4back=arrb4
return (arrb4back,arrb5back,pfireback,canfire)
Then this dynamic windows will be feed into next round test, for example:
b4backave=np.mean(arrb4Windows)
b4backdev=np.std(arrb4Windows)
if b4>b4backave+3.5*b4backdev:
firetest=True
To run the whole code to my multi 1500*6400 numpy arrays, it took over half an hour, or even longer. Just wondering if anyone got an idea how to deal with it? A general idea which part I should put my effort to would be greatly helpful!
Many thanks!
Avoid while loops if speed is a concern. The loop lends itself to a for loop as start and end are fixed. Additionally, your code does a lot of copying which isn't really necessary. The rewritten function:
def ContextualWindows (arrb4,arrb5,pfire):
''' arrb4,arrb5,pfire are 31*31 sampling windows from
large 1500*6400 numpy array '''
for i in range (5, 16):
lo = 15 - i # 10..0
hi = 16 + i # 21..31
# only output the array data when it is 'large' enough
# to have enough good quality data to do calculation
if np.ma.count(arrb4[lo:hi, lo:hi]) >= min(10, 0.25*i*i):
return (arrb4[lo:hi, lo:hi], arrb5[lo:hi, lo:hi], pfire[lo:hi, lo:hi], 0)
else: # unknown pixel: background condition could not be characterized
return (arrb4, arrb5, pfire, 1)
For clarity I've used style guidelines from PEP 8 (like extended comments, number of comment chars, spaces around operators etc.). Copying of a windowed arrb4 occurs twice here but only if the condition is fulfilled and this will happen only once per function call. The else clause will be executed only if the for-loop has run to it's end. We don't even need a break from the loop as we exit the function altogether.
Let us know if that speeds up the code a bit. I don't think it'll be much but then again there isn't much code anyway.
I've run some time tests with ContextualWindows and variants. One i step takes about 50us, all ten about 500.
This simple iteration takes about the same time:
[np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]
The iteration mechanism, and the 'copying' arrays are minor parts of the time. Where possible numpy is making views, not copies.
I'd focus on either minimizing the number of these count steps, or speeding up the count.
Comparing times for various operations on these windows:
First time for 1 step:
In [167]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,6)]
10000 loops, best of 3: 43.9 us per loop
now for the 10 steps:
In [139]: timeit [arrb4[15-i:16+i,15-i:16+i].shape for i in range(5,16)]
10000 loops, best of 3: 33.7 us per loop
In [140]: timeit [np.sum(arrb4[15-i:16+i,15-i:16+i]>500) for i in range(5,16)]
1000 loops, best of 3: 390 us per loop
In [141]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]
1000 loops, best of 3: 464 us per loop
Simply indexing does not take much time, but testing for conditions takes substantially more.
cumsum is sometimes used to speed up sums over sliding windows. Instead of taking sum (or mean) over each window, you calculate the cumsum and then use the differences between the front and end of window.
Trying something like that, but in 2d - cumsum in both dimensions, followed by differences between diagonally opposite corners:
In [164]: %%timeit
.....: cA4=np.cumsum(np.cumsum(arrb4,0),1)
.....: [cA4[15-i,15-i]-cA4[15+i,15+i] for i in range(5,16)]
.....:
10000 loops, best of 3: 43.1 us per loop
This is almost 10x faster than the (nearly) equivalent sum. Values don't quite match, but timing suggest that this may be worth refining.
Related
results is 2d numpy array with size 300000
for i in range(np.size(results,0)):
if results[i][0]>=0.7:
count+=1
it takes me 0.7 second in this python code,but I run this in C++ code,it takes less than 0.07 second.
So how to make this python code as fast as possible?
When doing numerical computation for speed, especially in Python, you never want to use for loops if possible. Numpy is optimized for "vectorized" computation, so you want to pass off the work you'd typically do in for loops to special numpy indexing and functions like where.
I did a quick test on a 300,000 x 600 array of random values from 0 to 1 and found the following.
Your code, non-vectorized with one for loop:
226 ms per run
%%timeit
count = 0
for i in range(np.size(n,0)):
if results[i][0]>=0.7:
count+=1
emilaz Solution:
8.36 ms per run
%%timeit
first_col = results[:,0]
x = len(first_col[first_col>.7])
Ethan's Solution:
7.84 ms per run
%%timeit
np.bincount(results[:,0]>=.7)[1]
Best I came up with
6.92 ms per run
%%timeit
len(np.where(results[:,0] > 0.7)[0])
All 4 methods yielded the same answer, which for my data was 90,134. Hope this helps!
Try
first_col=results[:,0]
res =len(first_col[first_col>.7])
Depending on the shape of your matrix, this can be 2-10 times faster than your approach.
You could give the following a try:
np.bincount(results[:,0]>=.7)[1]
Not sure it’s faster, but should produce the correct answer
I am writing a code for proposing typo correction using HMM and Viterbi algorithm. At some point for each word in the text I have to do the following. (lets assume I have 10,000 words)
#FYI Windows 10, 64bit, interl i7 4GRam, Python 2.7.3
import numpy as np
import pandas as pd
for k in range(10000):
tempWord = corruptList20[k] #Temp word read form the list which has all of the words
delta = np.zeros(26, len(tempWord)))
sai = np.chararray(26, len(tempWord)))
sai[:] = '#'
# INITIALIZATION DELTA
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is different
# INITILIZATION END
# 6.DELTA CALCULATION
for deltaIndex in range(1, len(tempWord)):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
# 7. SAI BACKWARD TRACKING
delta2 = pd.DataFrame(delta)
sai2 = pd.DataFrame(sai)
proposedWord = np.zeros(len(tempWord), str)
editId = 0
for col in delta2.columns:
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
editList20.append(''.join(editWord))
#END OF LOOP
As you can see it is computationally involved and When I run it takes too much time to run.
Currently my laptop is stolen and I run this on Windows 10, 64bit, 4GRam, Python 2.7.3
My question: Anybody can see any point that I can use to optimize? Do I have to delete the the matrices I created in the loop before loop goes to next round to make memory free or is this done automatically?
After the below comments and using xrange instead of range the performance increased almost by 30%. I am adding the screenshot here after this change.
I don't think that range discussion makes much difference. With Python3, where range is the iterator, expanding it into a list before iteration doesn't change time much.
In [107]: timeit for k in range(10000):x=k+1
1000 loops, best of 3: 1.43 ms per loop
In [108]: timeit for k in list(range(10000)):x=k+1
1000 loops, best of 3: 1.58 ms per loop
With numpy and pandas the real key to speeding up loops is to replace them with compiled operations that work on the whole array or dataframe. But even in pure Python, focus on streamlining the contents of the iteration, not the iteration mechanism.
======================
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication
A minor change: delta[i, 0] = ...; this is the array way of addressing a single element; functionally it often is the same, but the intent is clearer. But think, can't you set all of that column as once?
delta[:,0] = ...
====================
N = len(tempWord)
delta = np.zeros(26, N))
etc
In tight loops temporary variables like this can save time. This isn't tight, so here is just adds clarity.
===========================
This one ugly nested triple loop; admittedly 26 steps isn't large, but 26*26*N is:
for deltaIndex in range(1,N):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
Focus on replacing this with array operations. It's those 3 commented lines that need to be changed, not the iteration mechanism.
================
Make proposedWord a list rather than array might be faster. Small list operations are often faster than array one, since numpy arrays have a creation overhead.
In [136]: timeit np.zeros(20,str)
100000 loops, best of 3: 2.36 µs per loop
In [137]: timeit x=[' ']*20
1000000 loops, best of 3: 614 ns per loop
You have to careful when creating 'empty' lists that the elements are truly independent, not just copies of the same thing.
In [159]: %%timeit
x = np.zeros(20,str)
for i in range(20):
x[i] = chr(65+i)
.....:
100000 loops, best of 3: 14.1 µs per loop
In [160]: timeit [chr(65+i) for i in range(20)]
100000 loops, best of 3: 7.7 µs per loop
As noted in the comments, the behavior of range changed between Python 2 and 3.
In 2, range constructs an entire list populated with the numbers to iterate over, then iterates over the list. Doing this in a tight loop is very expensive.
In 3, range instead constructs a simple object that (as far as I know), consists only of 3 numbers: the starting number, the step (distance between numbers), and the end number. Using simple math, you can calculate any point along the range instead of needing to iterate necessarily. This makes "random access" on it O(1) instead of O(n) when the entire list is interated, and prevents the creation of a costly list.
In 2, use xrange to iterate over a range object instead of a list.
(#Tom: I'll delete this if you post an answer).
It's hard to see exactly what you need to do because of the missing code, but it's clear that you need to learn how to vectorize your numpy code. This can lead to a 100x speedup.
You can probably get rid of all the inner for-loops and replace them with vectorized operations.
eg. instead of
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is differen
do
delta[:, 0] = # Vectorized form of whatever operation you were going to do.
I am working with huge numbers, such as 150!. To calculate the result is not a problem, by example
f = factorial(150) is
57133839564458545904789328652610540031895535786011264182548375833179829124845398393126574488675311145377107878746854204162666250198684504466355949195922066574942592095735778929325357290444962472405416790722118445437122269675520000000000000000000000000000000000000.
But I also need to store an array with N of those huge numbers, in full presison. A list of python can store it, but it is slow. A numpy array is fast, but can not handle the full precision, wich is required for some operations I perform later, and as I have tested, a number in scientific notation (float) does not produce the accurate result.
Edit:
150! is just an example of huge number, it does not mean I am working only with factorials. Also, the full set of numbers (NOT always a result of factorial) change over time, and I need to do the actualization and reevaluation of a function for wich those numbers are a parameter, and yes, full precision is required.
numpy arrays are very fast when they can internally work with a simple data type that can be directly manipulated by the processor. Since there is no simple, native data type that can store huge numbers, they are converted to a float. numpy can be told to work with Python objects but then it will be slower.
Here are some times on my computer. First the setup.
a is a Python list containing the first 50 factorials. b is a numpy array with all the values converted to float64. c is a numpy array storing Python objects.
import numpy as np
import math
a=[math.factorial(n) for n in range(50)]
b=np.array(a, dtype=np.float64)
c=np.array(a, dtype=np.object)
a[30]
265252859812191058636308480000000L
b[30]
2.6525285981219107e+32
c[30]
265252859812191058636308480000000L
Now to measure indexing.
%timeit a[30]
10000000 loops, best of 3: 34.9 ns per loop
%timeit b[30]
1000000 loops, best of 3: 111 ns per loop
%timeit c[30]
10000000 loops, best of 3: 51.4 ns per loop
Indexing into a Python list is fastest, followed by extracting a Python object from a numpy array, and slowest is extracting a 64-bit float from an optimized numpy array.
Now let's measure multiplying each element by 2.
%timeit [n*2 for n in a]
100000 loops, best of 3: 4.73 µs per loop
%timeit b*2
100000 loops, best of 3: 2.76 µs per loop
%timeit c*2
100000 loops, best of 3: 7.24 µs per loop
Since b*2 can take advantage of numpy's optimized array, it is the fastest. The Python list takes second place. And a numpy array using Python objects is the slowest.
At least with the tests I ran, indexing into a Python list doesn't seem slow. What is slow for you?
Store it as tuples of prime factors and their powers. A factorization of a factorial (of, let's say, N) will contain ALL primes less than N. So k'th place in each tuple will be k'th prime. And you'll want to keep a separate list of all the primes you've found. You can easily store factorials as high as a few hundred thousand in this notation. If you really need the digits, you can easily restore them from this (just ignore the power of 5 and subtract the power of 5 from the power of 2 when you multiply the factors to get the factorial... cause 5*2=10).
If you need for the future the exact number of a factorial why dont you save in an array not the result but the number you want to 'factorialize'?
E.G.
You have f = factorial(150)
and you have the result 57133839564458545904789328652610540031895535786011264182548375833179829124845398393126574488675311145377107878746854204162666250198684504466355949195922066574942592095735778929325357290444962472405416790722118445437122269675520000000000000000000000000000000000000
But you can simply:
def values():
to_factorial_list = []
...
to_factorial_list.append(values_you_want_to_factorialize)
return to_factorial_list
def setToFactorial(number):
return factorial(number)
print setToFactorial(values()[302])
EDIT:
fair, then my advice is to work both with the logic i suggested as the getsizeof(number) you can merge or work with two arrays, an array to save low factorialized numbers and another to save the big ones, e.g. when getsizeof(number) exceed any size.
Imagine that you have some counter or other data element that needs to be stored in a field of a binary protocol. The field naturally has some fixed number n of bits and the protocol specifies that you should store the n least significant bits of the counter, so that it wraps around when it is too large. One possible way to implement that is actually taking the modulus by a power of two:
field_value = counter % 2 ** n
But this certainly isn't the most efficient way and maybe not even the easiest to understand, taking into account that the specification is talking about the least significant bits and does not mention a modulus operation. Thus, investigating alternatives is appropriate. Some examples are:
field_value = counter % (1 << n)
field_value = counter & (1 << n) - 1
field_value = counter & ~(-1 << 8)
What is the way preferred by experienced Python programmers to implement such a requirement trying to maximize code clarity without sacrificing too much performance?
There is of course no right or wrong answer to this question, so I would like to use this question to collect all the reasonable implementations of this seemingly trivial requirement. An answer should list the alternatives and shortly describe in what circumstance what alternative would preferably be used.
Bit shifting and bitwise operations are more readable in your case. Because it simply tells the reader, you are doing bitwise operations here. If you use numeric operation, the reader may not be able to understand what does it mean by moduloing that number.
Talking about performance, actually you don't have to worry too much about this in Python. Because operation to Python object itself is expensive enough, by either doing it in numeric operations or bitwise operations, it simply doesn't matter. Here I explain it in a visual way
<-------------- Python object operation cost --------------><- bit op ->
<-------------- Python object operation cost --------------><----- num op ----->
This is just a simple rough idea of what it costs to perform a simplest bit operation or number operation. As you can see Python object operation cost takes the majority, so it doesn't matter you use bitwise or numeric, the difference is too small can be ignored.
If you really need performance, you have to process massive amount of data, you should consider
Write the logic in C/C++ module for Python, you can use library like Boost.Python
Use third party library for mass number processing such as numpy
you should simply throw away the top bits.
#field_value = counter & (1 << n) - 1
field_value = counter & ALLOWED_BIT_WIDTH
If this was implemented in an embedded device, the registers used could be the limiting factor. In my experience this is way it is normally done.
The "limitation" in the protocol is a way of constraining the overhead bandwidth needed by the protocol.
It will be dependent on the python implementation probably, but in CPython 2.6, it looks like this:
In [1]: counter = 0xfedcba9876543210
In [10]: %timeit counter % 2**15
1000000 loops, best of 3: 304 ns per loop
In [11]: %timeit counter % (1<<15)
1000000 loops, best of 3: 302 ns per loop
In [12]: %timeit counter & ((1<<15)-1)
10000000 loops, best of 3: 104 ns per loop
In [13]: %timeit counter & ~(1<<15)
10000000 loops, best of 3: 170 ns per loop
In this case, counter & ((1<<15)-1) is the clear winner. Interesting is that 2**15 and 1<<15 take the same amount of time (more or less); I am guessing Python internally optimizes this case and 2**15 -> 1<<15 anyways.
I once wrote a class that lets you just do this:
bc = BitSliceLong(counter)
bc = bc[15:0]
derived from long, but it's a more general implementation (lets you take any range of the bits, not just x:0) and the extra overhead for that makes it slower by an order of magnitude, even though it's using the same method inside.
Edit: BTW, precalculating the values doesn't appear to provide any benefit - the dominant factor here is not the actual math operation. If we do
cx_mask = 2**15
counter % cx_mask
the time is the same as when it had to calculate 2**15. This was also true for our 'best case' - precalculating ((1<<15)-1) has no benefit.
Also, in the previous case, I used a large number that is implemented as a long in python. This is not really a native type - it supports arbitrary length numbers, and so needs to handle nearly anything, so implementing operations is not just a single ALU call - it involves a series of bit-shifting and arithmetic operations.
If you can keep the counter below sys.maxint, you'll be using int types instead, and they both appear to be faster & also more dominated by actual math code:
In [55]: %timeit x % (1<<15)
10000000 loops, best of 3: 53.6 ns per loop
In [56]: %timeit x & ((1<<15)-1)
10000000 loops, best of 3: 49.2 ns per loop
In [57]: %timeit x % (2**15)
10000000 loops, best of 3: 53.9 ns per loop
These are all about the same, so it doesn't matter which one you use here really. (mod slightly slower, but within random variation). It makes sense for div/mod to be an expensive operation on very large numbers, with a more complex algorithm, while for 'small' ints it can be done in hardware.
I've made my own version of insertion sort that uses pop and insert - to pick out the current element and insert it before the smallest element larger than the current one - rather than the standard swapping backwards until a larger element is found. When I run the two on my computer, mine is about 3.5 times faster. When we did it in class, however, mine was much slower, which is really confusing. Anyway, here are the two functions:
def insertionSort(alist):
for x in range(len(alist)):
for y in range(x,0,-1):
if alist[y]<alist[y-1]:
alist[y], alist[y-1] = alist[y-1], alist[y]
else:
break
def myInsertionSort(alist):
for x in range(len(alist)):
for y in range(x):
if alist[y]>alist[x]:
alist.insert(y,alist.pop(x))
break
Which one should be faster? Does alist.insert(y,alist.pop(x)) change the size of the list back and forth, and how does that affect time efficiency?
Here's my quite primitive test of the two functions:
from time import time
from random import shuffle
listOfLists=[]
for x in range(100):
a=list(range(1000))
shuffle(a)
listOfLists.append(a)
start=time()
for i in listOfLists:
myInsertionSort(i[:])
myInsertionTime=time()-start
start=time()
for i in listOfLists:
insertionSort(i[:])
insertionTime=time()-start
print("regular:",insertionTime)
print("my:",myInsertionTime)
I had underestimated your question, but it actually isn't easy to answer. There are a lot of different elements to consider.
Doing lst.insert(y, lst.pop(x)) is a O(n) operation, because lst.pop(x) costs O(len(lst) - x) since list elements must be contiguous, and thus the list has to shift-left by one all the elements after index x, and dually lst.insert(y, _) costs O(len(lst) - y) since it has to shift all the elements right by one.
This means that a naive analysis can give an upperbound of O(n^3) complexity in the worst case for your code. As you suggested this is actually correct [remember that O(n^2) is a subset of O(n^3)], however it's not a tight upperbound because you swap each element only once. So for n times you do n work, and this complexity is indeed O(n * n + n^2) = O(n^2), where the second n^2 refers to the number of comparisons which is n^2 in the worst case. So, asymptotically your solution is the same as insertion sort.
The first algorithm and the second algorithm change the order of iterations over the y. As I have already commented this changes the worst-case for the algorithm.
While insertion sort has its worst-case with reverse-sorted sequences, your algorithm doesn't (which is actually good). This might be a factor that adds to the difference in timings since if you do not use random lists you might use an input that is worst-case for one algorithm but not worst-case for the other.
In [2]: %timeit insertionSort(list(range(10)))
100000 loops, best of 3: 5.46 us per loop
In [3]: %timeit myInsertionSort(list(range(10)))
100000 loops, best of 3: 8.47 us per loop
In [4]: %timeit insertionSort(list(reversed(range(10))))
10000 loops, best of 3: 20.4 us per loop
In [5]: %timeit myInsertionSort(list(reversed(range(10))))
100000 loops, best of 3: 9.81 us per loop
You should always tests with (also) random inputs with different lengths.
The average complexity of insertion sort is O(n^2). Your algorithm might have a lower average time, however it's not entirely trivial to compute it.
I don't get why you use the insert+pop at all when you can use the swap. Trying this on my machine yields a quite big improvement in efficiency since you reduce an O(n^2) component to a O(n) component.
Now, you ask why there was such a big change between the execution at home and in class.
There can be various reasons, for example if you did not use a random generated list you might have used an almost best-case input for insertion sort while it was an almost worst-case input for your algorithm. And similar considerations. Without seeing what you did in class is not possible to give an exact answer.
However I believe there is a very simple answer: you forgot to copy the list before profiling. This is the same error I did when I first posted this answer (quote from the previous answer):
If you want to compare the two functions you should use random
lists:
In [6]: import random
...: input_list = list(range(10))
...: random.shuffle(input_list)
...:
In [7]: %timeit insertionSort(input_list) # Note: no input_list[:]!! Argh!
100000 loops, best of 3: 4.82 us per loop
In [8]: %timeit myInsertionSort(input_list)
100000 loops, best of 3: 7.71 us per loop
Also you should use big inputs to see the difference clearly:
In [11]: input_list = list(range(1000))
...: random.shuffle(input_list)
In [12]: %timeit insertionSort(input_list) # Note: no input_list[:]! Argh!
1000 loops, best of 3: 508 us per loop
In [13]: %timeit myInsertionSort(input_list)
10 loops, best of 3: 55.7 ms per loop
Note also that I, unfortunately, always executed the pairs of profilings in the same order, confirming my previous ideas.
As you can see all calls to insertionSort except the first one used a sorted list as input, which is the best-case for insertion-sort! This means that the timing for insertion sort is wrong (and I'm sorry for having written this before!) While myInsertionSort was always executed with an already sorted list, and guess what? Turns out that one of the worst-cases for myInsertionSort is the sorted list!
think about it:
for x in range(len(alist)):
for y in range(x):
if alist[y]>alist[x]:
alist.insert(y,alist.pop(x))
break
If you have a sorted list the alist[y] > alist[x] comparison will always be false. You might say "perfect! no swaps => no O(n) work => better timing", unfortunately this is false because no swaps also mean no break and hence you are doing n*(n+1)/2 iterations, i.e. the worst-case performance.
Note that this is very bad!!! Real-world data really often is partially sorted, so an algorithm whose worst-case is the sorted list is usually not a good algorithm for real-world use.
Note that this does not change if you replace insert + pop with a simple swap, hence the algorithm itself is not good from this point of view, independently from the implementation.