results is 2d numpy array with size 300000
for i in range(np.size(results,0)):
if results[i][0]>=0.7:
count+=1
it takes me 0.7 second in this python code,but I run this in C++ code,it takes less than 0.07 second.
So how to make this python code as fast as possible?
When doing numerical computation for speed, especially in Python, you never want to use for loops if possible. Numpy is optimized for "vectorized" computation, so you want to pass off the work you'd typically do in for loops to special numpy indexing and functions like where.
I did a quick test on a 300,000 x 600 array of random values from 0 to 1 and found the following.
Your code, non-vectorized with one for loop:
226 ms per run
%%timeit
count = 0
for i in range(np.size(n,0)):
if results[i][0]>=0.7:
count+=1
emilaz Solution:
8.36 ms per run
%%timeit
first_col = results[:,0]
x = len(first_col[first_col>.7])
Ethan's Solution:
7.84 ms per run
%%timeit
np.bincount(results[:,0]>=.7)[1]
Best I came up with
6.92 ms per run
%%timeit
len(np.where(results[:,0] > 0.7)[0])
All 4 methods yielded the same answer, which for my data was 90,134. Hope this helps!
Try
first_col=results[:,0]
res =len(first_col[first_col>.7])
Depending on the shape of your matrix, this can be 2-10 times faster than your approach.
You could give the following a try:
np.bincount(results[:,0]>=.7)[1]
Not sure it’s faster, but should produce the correct answer
Related
I'm following the exercises from "Doing Bayesian Data Analysis" in both R and Python.
I would like to find a fast method of doing Monte-Carlo simulation that uses constant space.
The problem below is trivial, but serves as a good test for different methods:
ex 4.3
Determine the exact probability of drawing a 10 from a shuffled pinochle deck. (In a pinochle deck, there are 48 cards. There are six values: 9, 10, Jack, Queen, King, Ace. There are two copies of each value in each of the standard four suits: hearts, diamonds, clubs, spades.)
(A) What is the probability of getting a 10?
Of course, the answer is 1/6.
The fastest solution I could find (comparable to the speed of R) is generating a large array of card draws using np.random.choice, then applying a Counter. I don't like the idea of creating arrays unnecessarily, so I tried using a dictionary and a for loop, drawing one card at a time and incrementing the count for that type of card. To my surprise, it was much slower!
The full code is below for the 3 methods I tested. _Is there a way of doing this that will be as performant as method1(), but using constant space?
Python code: (Google Colab link)
deck = [c for c in ['9','10','Jack','Queen','King','Ace'] for _ in range(8)]
num_draws = 1000000
def method1():
draws = np.random.choice(deck, size=num_draws, replace=True)
df = pd.DataFrame([Counter(draws)])/num_draws
print(df)
def method2():
card_counts = defaultdict(int)
for _ in range(num_draws):
card_counts[np.random.choice(deck, replace=True)] += 1
df = pd.DataFrame([card_counts])/num_draws
print(df)
def method3():
card_counts = defaultdict(int)
for _ in range(num_draws):
card_counts[deck[random.randint(0, len(deck)-1)]] += 1
df = pd.DataFrame([card_counts])/num_draws
print(df)
Python timeit() results:
method1: 1.2997
method2: 23.0626
method3: 5.5859
R code:
card = sample(deck, numDraws, replace=TRUE)
print(as.data.frame(table(card)/numDraws))
Here's one with np.unique+np.bincount -
def unique():
unq,ids = np.unique(deck, return_inverse=True)
all_ids = np.random.choice(ids, size=num_draws, replace=True)
ar = np.bincount(all_ids)/num_draws
return pd.DataFrame(ar[None], columns=unq)
How does NumPy help here?
There are two major improvements that's helping us here :
We convert the string data to numeric. NumPy works well with such data. To achieve this, we are using np.unique.
We use np.bincount to replace the counting step. Again, it works well with numeric data and we do have that from the numeric conversion done at the start of this method.
NumPy in general works well with large data, which is the case here.
Timings with given sample dataset comparing against fastest method1 -
In [177]: %timeit method1()
328 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [178]: %timeit unique()
12.4 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy achieves efficiency by running C code in its numerical engine. Python is convenient, but it is orders of magnitude slower than C.
In Numpy and other high-performance Python libraries, the Python code consists mostly of glue code, preparing the task to be dispatched. Since there is overhead, it is much faster to draw a lot of samples at once.
Remember that providing a buffer of 1 million elements for Numpy to work is still constant space. Then you can sample 1 billion times by looping it.
This extra memory allocation is usually not a problem. If you must avoid using memory at all costs while still getting performance benefits from Numpy, you can try using Numba or Cython to accelerate it.
from numba import jit
#jit(nopython=True)
def method4():
card_counts = np.zeros(6)
for _ in range(num_draws):
card_counts[np.random.randint(0, 6)] += 1
return card_counts/num_draws
I am working with huge numbers, such as 150!. To calculate the result is not a problem, by example
f = factorial(150) is
57133839564458545904789328652610540031895535786011264182548375833179829124845398393126574488675311145377107878746854204162666250198684504466355949195922066574942592095735778929325357290444962472405416790722118445437122269675520000000000000000000000000000000000000.
But I also need to store an array with N of those huge numbers, in full presison. A list of python can store it, but it is slow. A numpy array is fast, but can not handle the full precision, wich is required for some operations I perform later, and as I have tested, a number in scientific notation (float) does not produce the accurate result.
Edit:
150! is just an example of huge number, it does not mean I am working only with factorials. Also, the full set of numbers (NOT always a result of factorial) change over time, and I need to do the actualization and reevaluation of a function for wich those numbers are a parameter, and yes, full precision is required.
numpy arrays are very fast when they can internally work with a simple data type that can be directly manipulated by the processor. Since there is no simple, native data type that can store huge numbers, they are converted to a float. numpy can be told to work with Python objects but then it will be slower.
Here are some times on my computer. First the setup.
a is a Python list containing the first 50 factorials. b is a numpy array with all the values converted to float64. c is a numpy array storing Python objects.
import numpy as np
import math
a=[math.factorial(n) for n in range(50)]
b=np.array(a, dtype=np.float64)
c=np.array(a, dtype=np.object)
a[30]
265252859812191058636308480000000L
b[30]
2.6525285981219107e+32
c[30]
265252859812191058636308480000000L
Now to measure indexing.
%timeit a[30]
10000000 loops, best of 3: 34.9 ns per loop
%timeit b[30]
1000000 loops, best of 3: 111 ns per loop
%timeit c[30]
10000000 loops, best of 3: 51.4 ns per loop
Indexing into a Python list is fastest, followed by extracting a Python object from a numpy array, and slowest is extracting a 64-bit float from an optimized numpy array.
Now let's measure multiplying each element by 2.
%timeit [n*2 for n in a]
100000 loops, best of 3: 4.73 µs per loop
%timeit b*2
100000 loops, best of 3: 2.76 µs per loop
%timeit c*2
100000 loops, best of 3: 7.24 µs per loop
Since b*2 can take advantage of numpy's optimized array, it is the fastest. The Python list takes second place. And a numpy array using Python objects is the slowest.
At least with the tests I ran, indexing into a Python list doesn't seem slow. What is slow for you?
Store it as tuples of prime factors and their powers. A factorization of a factorial (of, let's say, N) will contain ALL primes less than N. So k'th place in each tuple will be k'th prime. And you'll want to keep a separate list of all the primes you've found. You can easily store factorials as high as a few hundred thousand in this notation. If you really need the digits, you can easily restore them from this (just ignore the power of 5 and subtract the power of 5 from the power of 2 when you multiply the factors to get the factorial... cause 5*2=10).
If you need for the future the exact number of a factorial why dont you save in an array not the result but the number you want to 'factorialize'?
E.G.
You have f = factorial(150)
and you have the result 57133839564458545904789328652610540031895535786011264182548375833179829124845398393126574488675311145377107878746854204162666250198684504466355949195922066574942592095735778929325357290444962472405416790722118445437122269675520000000000000000000000000000000000000
But you can simply:
def values():
to_factorial_list = []
...
to_factorial_list.append(values_you_want_to_factorialize)
return to_factorial_list
def setToFactorial(number):
return factorial(number)
print setToFactorial(values()[302])
EDIT:
fair, then my advice is to work both with the logic i suggested as the getsizeof(number) you can merge or work with two arrays, an array to save low factorialized numbers and another to save the big ones, e.g. when getsizeof(number) exceed any size.
I am trying to write a piece of nested loops in my algorithm, and meet some problems that the whole algorithm takes too long time due to these nested loops. I am quite new to Python (as you may find from my below unprofessional code :( ) and hopefully someone can guide me a way to speed up my code!
The whole algorithm is for fire detection in multi 1500*6400 arrays. A small contextual analyse is applied when go through the whole array. The contextual analyse is performed in a dynamically assigned windows size way. The windows size can go from 11*11 to 31*31 until the validate values inside the sampling windows are enough for the next round calculation, for example like below:
def ContextualWindows (arrb4,arrb5,pfire):
####arrb4,arrb5,pfire are 31*31 sampling windows from large 1500*6400 numpy array
i=5
while i in range (5,16):
arrb4back=arrb4[15-i:16+i,15-i:16+i]
## only output the array data when it is 'large' enough
## to have enough good quality data to do calculation
if np.ma.count(arrb4back)>=min(10,0.25*i*i):
arrb5back=arrb5[15-i:16+i,15-i:16+i]
pfireback=pfire[15-i:16+i,15-i:16+i]
canfire=0
i=20
else:
i=i+1
###unknown pixel: background condition could not be characterized
if i!=20:
canfire=1
arrb5back=arrb5
pfireback=pfire
arrb4back=arrb4
return (arrb4back,arrb5back,pfireback,canfire)
Then this dynamic windows will be feed into next round test, for example:
b4backave=np.mean(arrb4Windows)
b4backdev=np.std(arrb4Windows)
if b4>b4backave+3.5*b4backdev:
firetest=True
To run the whole code to my multi 1500*6400 numpy arrays, it took over half an hour, or even longer. Just wondering if anyone got an idea how to deal with it? A general idea which part I should put my effort to would be greatly helpful!
Many thanks!
Avoid while loops if speed is a concern. The loop lends itself to a for loop as start and end are fixed. Additionally, your code does a lot of copying which isn't really necessary. The rewritten function:
def ContextualWindows (arrb4,arrb5,pfire):
''' arrb4,arrb5,pfire are 31*31 sampling windows from
large 1500*6400 numpy array '''
for i in range (5, 16):
lo = 15 - i # 10..0
hi = 16 + i # 21..31
# only output the array data when it is 'large' enough
# to have enough good quality data to do calculation
if np.ma.count(arrb4[lo:hi, lo:hi]) >= min(10, 0.25*i*i):
return (arrb4[lo:hi, lo:hi], arrb5[lo:hi, lo:hi], pfire[lo:hi, lo:hi], 0)
else: # unknown pixel: background condition could not be characterized
return (arrb4, arrb5, pfire, 1)
For clarity I've used style guidelines from PEP 8 (like extended comments, number of comment chars, spaces around operators etc.). Copying of a windowed arrb4 occurs twice here but only if the condition is fulfilled and this will happen only once per function call. The else clause will be executed only if the for-loop has run to it's end. We don't even need a break from the loop as we exit the function altogether.
Let us know if that speeds up the code a bit. I don't think it'll be much but then again there isn't much code anyway.
I've run some time tests with ContextualWindows and variants. One i step takes about 50us, all ten about 500.
This simple iteration takes about the same time:
[np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]
The iteration mechanism, and the 'copying' arrays are minor parts of the time. Where possible numpy is making views, not copies.
I'd focus on either minimizing the number of these count steps, or speeding up the count.
Comparing times for various operations on these windows:
First time for 1 step:
In [167]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,6)]
10000 loops, best of 3: 43.9 us per loop
now for the 10 steps:
In [139]: timeit [arrb4[15-i:16+i,15-i:16+i].shape for i in range(5,16)]
10000 loops, best of 3: 33.7 us per loop
In [140]: timeit [np.sum(arrb4[15-i:16+i,15-i:16+i]>500) for i in range(5,16)]
1000 loops, best of 3: 390 us per loop
In [141]: timeit [np.ma.count(arrb4[15-i:16+i,15-i:16+i]) for i in range(5,16)]
1000 loops, best of 3: 464 us per loop
Simply indexing does not take much time, but testing for conditions takes substantially more.
cumsum is sometimes used to speed up sums over sliding windows. Instead of taking sum (or mean) over each window, you calculate the cumsum and then use the differences between the front and end of window.
Trying something like that, but in 2d - cumsum in both dimensions, followed by differences between diagonally opposite corners:
In [164]: %%timeit
.....: cA4=np.cumsum(np.cumsum(arrb4,0),1)
.....: [cA4[15-i,15-i]-cA4[15+i,15+i] for i in range(5,16)]
.....:
10000 loops, best of 3: 43.1 us per loop
This is almost 10x faster than the (nearly) equivalent sum. Values don't quite match, but timing suggest that this may be worth refining.
I'm using numpy to do linear algebra. I want to do fast subset-indexed dot and other linear operations.
When dealing with big matrices, slicing solution like A[:,subset].dot(x[subset]) may be longer than doing the multiplication on the full matrix.
A = np.random.randn(1000,10000)
x = np.random.randn(10000,1)
subset = np.sort(np.random.randint(0,10000,500))
Timings show that sub-indexing can be faster when columns are in one block.
%timeit A.dot(x)
100 loops, best of 3: 4.19 ms per loop
%timeit A[:,subset].dot(x[subset])
100 loops, best of 3: 7.36 ms per loop
%timeit A[:,:500].dot(x[:500])
1000 loops, best of 3: 1.75 ms per loop
Still the acceleration is not what I would expect (20x faster!).
Does anyone know an idea of a library/module that allow these kind of fast operation through numpy or scipy?
For now on I'm using cython to code a fast column-indexed dot product through the cblas library. But for more complex operation (pseudo-inverse, or subindexed least square solving) I'm not shure to reach good acceleration.
Thanks!
Well, this is faster.
%timeit A.dot(x)
#4.67 ms
%%timeit
y = numpy.zeros_like(x)
y[subset]=x[subset]
d = A.dot(y)
#4.77ms
%timeit c = A[:,subset].dot(x[subset])
#7.21ms
And you have all(d-ravel(c)==0) == True.
Notice that how fast this is depends on the input. With subset = array([1,2,3]) you have that the time of my solution is pretty much the same, while the timing of the last solution is 46micro seconds.
Basically this will be faster if the size ofsubset is not much smaller than the size of x
I have a function in python that basically takes the sign of an array (75,150), for example.
I'm coming from Matlab and the time execution looks more or less the same less this function.
I'm wondering if sign() works very slowly and you know an alternative to do the same.
Thx,
I can't tell you if this is faster or slower than Matlab, since I have no idea what numbers you're seeing there (you provided no quantitative data at all). However, as far as alternatives go:
import numpy as np
a = np.random.randn(75, 150)
aSign = np.sign(a)
Testing using %timeit in IPython:
In [15]: %timeit np.sign(a)
10000 loops, best of 3: 180 µs per loop
Because the loop over the array (and what happens inside it) is implemented in optimized C code rather than generic Python code, it tends to be about an order of magnitude faster—in the same ballpark as Matlab.
Comparing the exact same code as a numpy vectorized operation vs. a Python loop:
In [276]: %timeit [np.sign(x) for x in a]
1000 loops, best of 3: 276 us per loop
In [277]: %timeit np.sign(a)
10000 loops, best of 3: 63.1 us per loop
So, only 4x as fast here. (But then a is pretty small here.)