I understand that Python's array provided by the array module stores consecutively the actual values (not pointers). Hence I would expect that, when elements of such an array are read in order, CPU cache would play a role.
Thus I would expect that Code A below should be faster than Code B (the difference between the two is in the order of reading the elements).
Code A:
import array
import time
arr = array.array('l', range(100000000))
sum = 0
begin = time.time()
for i in range(10000):
for j in range(10000):
sum += arr[i * 10000 + j]
print(sum)
print(time.time() - begin)
Code B:
import array
import time
arr = array.array('l', range(100000000))
sum = 0
begin = time.time()
for i in range(10000):
for j in range(10000):
sum += arr[j * 10000 + i]
print(sum)
print(time.time() - begin)
The two versions' timings are almost identical (a difference of only ~3%). Am I missing something about the workings of the array?
The two codes are completely dominated by the overhead of CPython (by a very large margin). Let's try to understand why.
First of all, CPython is an interpreter so it optimize (nearly) nothing. This means operation like i * 10000 are recomputed over and over while it can be precomputed in the parent loop. This also means instructions are fetch+decoded from a bytecode which is pretty slow (and cause many memory accesses + branches).
Additionally, access to global variable is significantly slower in CPython because the interpreter needs to fetch the variable from a global dictionary which is much slower than an access to the CPU cache.
Moreover, most CPython operations allocate/free objects and this is expensive (again, far much than a cache access). Indeed, allocating an object require to fetch a bucket data-structure and find some available space in it. Note that small integers are cached so they are not allocated. This means looping on small ranges is actually a bit faster. Checks for caching are always done so they add some overhead even when this is not possible to cache objets. Such operations requires several memory operations (twice since the free is needed). Not to mention the reference counting of each object also requiring memory operations (and the global interpreter lock operations).
In addition, CPython integer operation are pretty slow because CPython deals with variable-sized integers and not native one. This means CPython does additional checks when integers are large. Bad news: sum is a large integer.
The following code is actually about 2.5 times faster than the original one and it still spent a lot of time in CPython overheads (lot of object allocation/free, ref-counting, C calls, etc.) :
import array
import time
arr = array.array('l', range(100000000))
def compute(arr):
sum = 0
for i in range(10000):
tmp = i * 10000
for j in range(100):
tmp2 = tmp + j * 100
for k in range(100):
sum += arr[tmp2 + k]
print(sum)
begin = time.time()
compute(arr)
print(time.time() - begin)
Pure-Python codes running with the CPython interpreter are so slow that you often cannot see the impact of caches. Thus, using Python to benchmark such effect is a terrible idea. The only way to see such an impact is to use vectorized functions, that is C function doing the job far more efficiently than an interpreted Python code. Numpy is able to do that. Here is an equivalent code for the two original codes:
import numpy as np
import time
arr = np.arange(100_000_000).astype(np.int64)
begin = time.time()
sum = 0
for i in range(10000):
sum += arr[i*10000:i*10000+10000].sum()
print(sum)
print(time.time() - begin)
begin = time.time()
sum = 0
for i in range(10000):
sum += arr[i:100_000_000+i:10000].sum()
print(sum)
print(time.time() - begin)
The above code give the resulting timing:
Original first code: 19.581 s
First Numpy code: 0.064 s
Second Numpy code: 0.725 s
The first Numpy code is about 300 times faster than the original one showing how inefficient was the pure-Python code. Indeed, this shows that ≥99.7% of the original code was pure overheads. We can also see that the second Numpy code is slower than the first due to the strided access pattern (but still 27 times faster than the first original code).
Nearly all the time is spent in the same section of the same internal function in Numpy for both variants. That being said, the second one is much slower because of the strided access. Here is the executed assembly code:
Block 6:
0x180198970 add r9, qword ptr [rcx]
0x180198973 add rcx, r11
0x180198976 add r10, qword ptr [rcx]
0x180198979 add rcx, r11
0x18019897c sub rdx, 0x1
0x180198980 jnz 0x180198970 <Block 6>
This code is not optimal when the array slice is contiguous. The compiler could have generated a significantly SIMD code for this case. Not to mention, the the dependency chain prevent the processor to execute more instructions in parallel (on the same core). That being said, it enable us to see the impact of the strided access using the exact same assembly code. Thus, this is a pretty good benchmark unless you want to include the benefit of using SIMD instructions. SIMD instructions can make this code about 2-3 times faster on my machine. They can only speed up the non-strided use-case on mainstream platforms.
If you want to measure cache effects, it is generally better to use a natively compiled code. This can be done using Numba in Python (JIT compiler using LLVM) or simply natively compiled languages like C or C++.
Related
This is a basic example.
#jax.jit
def block(arg1, arg2):
for x1 in range(cons1):
for x2 in range(cons2):
for x3 in range(cons3):
--do something--
return result
When cons are small, the compile-time is around a minute. With larger cons, compile time is much higher—10s of minutes. And I need even higher cons. What can be done?
From what I am reading, the loops are the cause. They are unrolled at compile time.
Are there any workarounds? There is also jax.fori_loop. But I don't understand how to use it. There is jax.experimental.loops module, but again I'm not able to understand it.
I am very new to all this. Hence, all help is appreciated.
If you can provide some examples of how to use jax loops, that will be much appreciated.
Also, what is an ok compile time? Is it ok for it to be in minutes?
In one of the examples, compile time is 262 seconds and remaining runs are ~0.1-0.2 seconds.
Any gain in runtime is overshadowed by the compile time.
JAX's JIT compiler flattens all Python loops. To see what I mean, take a look at this simple function run through jax.make_jaxpr, which is a way to examine how JAX's tracer interprets python code (see Understanding Jaxprs for more):
import jax
def f(x):
for i in range(5):
x += i
return x
print(jax.make_jaxpr(f)(0))
# { lambda ; a.
# let b = add a 0
# c = add b 1
# d = add c 2
# e = add d 3
# f = add e 4
# in (f,) }
Notice that the loop is flattened: every step becomes an explicit operation sent to the XLA compiler. The XLA compile time increases as you increase the number of operations in the function, so it makes sense that a triply-nested for-loop would lead to long compile times.
So, how to address this? Well, unfortunately the answer depends on what your --do something-- is doing, so I can't guess that.
In general, the best option is to use vectorized array operations rather than loops over the values in those vectors; for example, here is a very slow way of adding two vectors:
import jax.numpy as jnp
def f_slow(x, y):
z = []
for xi, yi in zip(xi, yi):
z.append(xi + yi)
return jnp.array(z)
and here is a much faster way to do the same thing:
def f_fast(x, y):
return x + y
If your operations don't lend themselves to vectorization, another option is to use lax control flow operators in place of the for loops: this will push the loop down into XLA. This can have quite good performance on CPU, but is slower on accelerators when compared to equivalent vectorized array operations.
For more discussion on JAX and Python control flow statements (such as for, if, while, etc.), see 🔪 JAX - The Sharp Bits 🔪: Control Flow.
I am not sure if this is will be the same as with numba, but this might be similar case.
When I use numba.jit compiler and have big data input, first I compile function on some small example data, then use it.
Pseudo-code:
func_being_compiled(small_amount_of_data) # compile-only purpose
func_being_compiled(large_amount_of_data)
I am learning the ways of Numba and have not figured out how to use or whether I need to use multiprocessing.queue to combine all my loop data from separate processes.
Do I even want to use the multiprocessing module to break up big loops into multiple smaller ones to run in separate processes or does Numba do this automatically?
The code below is run in the multiprocessing module where it opens up in multiple processes that are divided up into your system core count. So there are many instances of the code running and compute looping through different segments of the overall calculation and then the result 0 or 1 is sent back to the parent function.
My guess is Numba does this differently on its own and I don't want to use queue or the multiprocessing module?
#jit(nopython=True)
def prime_multiprocess(n, c, q):
a, b, c = n[0], n[1], c
for i in range(a, b):
if c % i == 0:
return q.put(0)
return q.put(1)
This error may have been caused by the following argument(s):
- argument 2: cannot determine Numba type of <class 'multiprocessing.queues.Queue'>
I appreciate any explanation or link that explains using numba with parallel loops that speed things up.
I did some testing and it appears that a nested function solved the problem:
I rewrote it to:
def prime_multiprocess(n, c, q):
a, b, c = n[0], n[1], c
#jit(nopython=True)
def speed_comp():
for i in range(a, b):
if c % i == 0:
return 0
return 1
q.put(speed_comp())
It is faster!
edit:
It appears there is a downside to where I am limited to the size of the integers I can use. "sigh" "Why is there always a trade off :( "
I wonder if its possible to workaround this with numpy and if it would slow it down. Answer might be here: Numba support for big integers?
The way Numba works is it converts integers into machine-level integers which are limited in scope to your system level such as 64 bit. This is what makes it run faster because there is no overhead on-top of the calculations. Unfortunately without the overhead slowing things down, you cannot compute bigger integers.
I defined two correct ways of calculating averages in python.
def avg_regular(values):
total = 0
for value in values:
total += value
return total/len(values)
def avg_concurrent(values):
mean = 0
num_of_values = len(values)
for value in values:
#calculate a small portion of the average for each num and add to the total
mean += value/num_of_values
return mean
The first function is the regular way of calculating averages, but I wrote the second one because each run of the loop doesn't depend on previous runs. So theoretically the average can be computed in parallel.
However, the "parallel" one (without running in parallel) takes about 30% more time than the regular one.
Are my assumptions correct and worth the speed loss?
if yes how can I make the second function run the second one parrallely?
if not, where did I go wrong?
The code you implemented is basically the difference between (a1+a2+ ... + an) / n and (a1/n + a2/n + ... + an/n). The result is the same, but in the second version there are more operations (namely (n-1) more divisions) which slows the calculation down. You claimed that in the second version each loop run is independent of the others. In the first loop we need the following information to finish one loop run: total before the run and the current value. In the second version we need the following information to finish one loop run: mean before the run, the current value and num_of_values. As you see in the second version we even depend on more values!
But how could we divide the work between cores (which is the goal of multiprocessing)? We could just give one core the first half of the values and the second the second half, i.e. ((a1+a2+ ... + a(n//2)) + ( a(n//2 +1) + ... + a(n)) / n). Yes, the work of dividing by n is not splitted between the cores, but it's a single instruction so we don't really care. Also we need to add the left total and the right total, which we can't split, but again it's only a single operation.
So the code we want to run:
def my_sum(values):
total = 0
for value in values:
total += value
return total
There's still a problem with python - normally one could use threads to do the computations, because each thread will use one core. But in that case one has to take care that your program does not run into race conditions, and the python interpreter itself also needs to take care of that. CPython decided it's not worth it and basically only runs in one thread at a time. A basic solution is to use multiple processes via multiprocessing.
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(5) as p:
results = p.map(my_sum, [long_list[0:len(long_list)//2], long_list[len(long_list)//2:]))
print(sum(results) / len(long_list)) # add subresults and divide by n
But of course multiple processes do not come for free. You need to fork, copy stuff, etc. so you will not gain a speedup of 2 as one could expect. Also the biggest slowdown is actually using python itself, it's not really optimized for fast numerical computations. There are various ways around that, but using numpy is probably the simplest. Just use:
import numpy
print(numpy.mean(long_list))
Which is probably much faster than the python version. I don't think numpy uses multiprocessing internal, so one could gain a boost by using multiple processes and a fast implementation (numpy or something other written in C) but normally numpy is fast enough.
I have been playing with memory_profiler for some time and got this interesting but confusing results from the small program below:
import pandas as pd
import numpy as np
#profile
def f(p):
tmp = []
for _, frame in p.iteritems():
tmp.append([list(record) for record in frame.to_records(index=False)])
# initialize a list of pandas panels
lp = []
for j in xrange(50):
d = {}
for i in xrange(50):
df = pd.DataFrame(np.random.randn(200, 50))
d[i] = df
lp.append(pd.Panel(d))
# execution (iteration)
for panel in lp:
f(panel)
Then if I use memory_profiler's mprof to analyze the memory usage during runtime, mprof run test.py without any other parameters, I get this:
.
There seems to be memory unreleased after each function call f().
tmp is just a local list and should be reassigned and memory reallocated each time f() is called. Obviously there is some discrepancy here in the graph attached. I know that python has its own memory management blocks and also has free list for int and other types, and gc.collect() should do the magic. It turns out that explicit gc.collect() doesn't work. (Maybe because we are working with pandas objects, panels and frames? I don't know.)
The most confusing part is, I don't change or modify any variable in f(). All it does is just put some list representation copies in a local list. Therefore python doesn't need to make a copy of anything. Then why and how does this happen?
=================
Some other observations:
1) If I call f() with f(panel.copy()) (last line of code), passing the copy instead of the original object reference, I have a totally different memory usage result: . Is python that smart to tell that this value passed is a copy so that it could do some internal tricks to release the memory after each function call?
2) I think it might be because of df.to_records(). Well if I change it to frame.values, I would get similar flat memory curve, just like memory_profiling_results_2.png shown above, during the iteration (Although I do need to_records() because it maintains the column dtype, while .values messes the dtypes up). But I looked into frame.py's implementation on to_records(). I don't see why it would hold on the memory out there, while .values would work just fine.
I am running the program on Windows, with python 2.7.8, memory_profiler 0.43 and psutil 5.0.1.
This is not a memory leak. What you are seeing is a side effect of pandas.core.NDFrame caching some results. This allows it to return the same information the second time you ask for it without running the calculations again. Change the end of your sample code to look like the following code and run it. You should find that the second time through the memory increase will not happen, and the execution time will be less.
import time
# execution (iteration)
start_time = time.time()
for panel in lp:
f(panel)
print(time.time() - start_time)
print('-------------------------------------')
start_time = time.time()
for panel in lp:
f(panel)
print(time.time() - start_time)
I find out that python's VM seems to be very unfriendly with CPU-Cache.
for example:
import time
a = range(500)
sum(a)
for i in range(1000000): #just to create a time interval, seems this disturb cpu cache?
pass
st = time.time()
sum(a)
print (time.time() - st)*1e6
time:
> 100us
another case:
import time
a = range(500)
for i in range(100000):
st = time.time()
sum(a)
print (time.time() - st)*1e6
time:
~ 20us
we can see when running frequently, the code becomes much faster.
is there a solution?
we can refer to this question:(should be the same problem)
python function(or a code block) runs much slower with a time interval in a loop
I feel this question is very difficult. one must has indepth unstanding about the mechanism of python virtual machine, c and cpu-cache.
Do you have any suggestion about where to post this question for a possible answer?
Your hypothesis "we can see when running frequently, the code becomes much faster" is not quite correct.
This loop
#just to create a time interval, seems this disturb cpu cache?
for i in range(1000000):
pass
isn't merely a simple time delay. In Python 2 it creates a list of 1,000,000 integer objects, loops over it, and then destroys it. In Python 3, range() is a generator, so it's much less wasteful of memory; the Python 2 equivalent is xrange().
In the standard CPython implementation, a list is an array of pointers to the objects it contains. On a 32 bit system, an integer object occupies 12 bytes, an empty list object occupies 32 bytes, and of course a pointer occupies 4 bytes. So range(1000000) consumes a total of 32 + 1000000 * (4 + 12) = 16000032 bytes.
Once the loop finishes, the reference count on the range() list drops to zero so it and its contents are eligible for garbage collection. It takes a little bit of time to garbage collect a million integers, so it's not surprising that you notice a time delay after the loop is finished. Also, the process of creating & deallocating all those objects will generally cause some memory fragmentation, and that will impact the efficiency of subsequent code.
FWIW, on my 2GHz Pentium 4 Linux system, your 1st code block prints values in the range 38 - 41µs, but if I change the loop to use xrange() then the time drops to 31 - 34µs.
If the code is modified to prevent garbage collection of the range() until after sum(a) is calculated, then it makes little difference in the timing whether we use range() or xrange().
import time
a = range(500)
sum(a)
bigrange = xrange(1000000)
for i in bigrange:
pass
st = time.time()
sum(a)
print (time.time() - st) * 1e6
#Maintain a reference to bigrange to prevent it
#being garbage collected earlier
print sum(bigrange)