I'm currently having a small side project in which I want to sort a 20GB file on my machine as fast as possible. The idea is to chunk the file, sort the chunks, merge the chunks. I just used pyenv to time the radixsort code with different Python versions and saw that 2.7.18 is way faster than 3.6.10, 3.7.7, 3.8.3 and 3.9.0a. Can anybody explain why Python 3.x is slower than 2.7.18 in this simple example? Were there new features added?
import os
def chunk_data(filepath, prefixes):
"""
Pre-sort and chunk the content of filepath according to the prefixes.
Parameters
----------
filepath : str
Path to a text file which should get sorted. Each line contains
a string which has at least 2 characters and the first two
characters are guaranteed to be in prefixes
prefixes : List[str]
"""
prefix2file = {}
for prefix in prefixes:
chunk = os.path.abspath("radixsort_tmp/{:}.txt".format(prefix))
prefix2file[prefix] = open(chunk, "w")
# This is where most of the execution time is spent:
with open(filepath) as fp:
for line in fp:
prefix2file[line[:2]].write(line)
Execution times (multiple runs):
2.7.18: 192.2s, 220.3s, 225.8s
3.6.10: 302.5s
3.7.7: 308.5s
3.8.3: 279.8s, 279.7s (binary mode), 295.3s (binary mode), 307.7s, 380.6s (wtf?)
3.9.0a: 292.6s
The complete code is on Github, along with a minimal complete version
Unicode
Yes, I know that Python 3 and Python 2 deal different with strings. I tried opening the files in binary mode (rb / wb), see the "binary mode" comments. They are a tiny bit faster on a couple of runs. Still, Python 2.7 is WAY faster on all runs.
Try 1: Dictionary access
When I phrased this question, I thought that dictionary access might be a reason for this difference. However, I think the total execution time is way less for dictionary access than for I/O. Also, timeit did not show anything important:
import timeit
import numpy as np
durations = timeit.repeat(
'a["b"]',
repeat=10 ** 6,
number=1,
setup="a = {'b': 3, 'c': 4, 'd': 5}"
)
mul = 10 ** -7
print(
"mean = {:0.1f} * 10^-7, std={:0.1f} * 10^-7".format(
np.mean(durations) / mul,
np.std(durations) / mul
)
)
print("min = {:0.1f} * 10^-7".format(np.min(durations) / mul))
print("max = {:0.1f} * 10^-7".format(np.max(durations) / mul))
Try 2: Copy time
As a simplified experiment, I tried to copy the 20GB file:
cp via shell: 230s
Python 2.7.18: 237s, 249s
Python 3.8.3: 233s, 267s, 272s
The Python stuff is generated by the following code.
My first thought was that the variance is quite high. So this could be the reason. But then, the variance of chunk_data execution time is also high, but the mean is noticeably lower for Python 2.7 than for Python 3.x. So it seems not to be an I/O scenario as simple as I tried here.
import time
import sys
import os
version = sys.version_info
version = "{}.{}.{}".format(version.major, version.minor, version.micro)
if os.path.isfile("numbers-tmp.txt"):
os.remove("numers-tmp.txt")
t0 = time.time()
with open("numbers-large.txt") as fin, open("numers-tmp.txt", "w") as fout:
for line in fin:
fout.write(line)
t1 = time.time()
print("Python {}: {:0.0f}s".format(version, t1 - t0))
My System
Ubuntu 20.04
Thinkpad T460p
Python through pyenv
This is a combination of multiple effects, mostly the fact that Python 3 needs to perform unicode decoding/encoding when working in text mode and if working in binary mode it will send the data through dedicated buffered IO implementations.
First of all, using time.time to measure execution time uses the wall time and hence includes all sorts of Python unrelated things such as OS-level caching and buffering, as well as buffering of the storage medium. It also reflects any interference with other processes that require the storage medium. That's why you are seeing these wild variations in timing results. Here are the results for my system, from seven consecutive runs for each version:
py3 = [660.9, 659.9, 644.5, 639.5, 752.4, 648.7, 626.6] # 661.79 +/- 38.58
py2 = [635.3, 623.4, 612.4, 589.6, 633.1, 613.7, 603.4] # 615.84 +/- 15.09
Despite the large variation it seems that these results indeed indicate different timings as can be confirmed for example by a statistical test:
>>> from scipy.stats import ttest_ind
>>> ttest_ind(p2, p3)[1]
0.018729004515179636
i.e. there's only a 2% chance that the timings emerged from the same distribution.
We can get a more precise picture by measuring the process time rather than the wall time. In Python 2 this can be done via time.clock while Python 3.3+ offers time.process_time. These two functions report the following timings:
py3_process_time = [224.4, 226.2, 224.0, 226.0, 226.2, 223.7, 223.8] # 224.90 +/- 1.09
py2_process_time = [171.0, 171.1, 171.2, 171.3, 170.9, 171.2, 171.4] # 171.16 +/- 0.16
Now there's much less spread in the data since the timings reflect the Python process only.
This data suggests that Python 3 takes about 53.7 seconds longer to execute. Given the large amount of lines in the input file (550_000_000) this amounts to about 97.7 nanoseconds per iteration.
The first effect causing increased execution time are unicode strings in Python 3. The binary data is read from the file, decoded and then encoded again when it is written back. In Python 2 all strings are stored as binary strings right away, so this doesn't introduce any encoding/decoding overhead. You don't see this effect clearly in your tests because it disappears in the large variation introduced by various external resources which are reflected in the wall time difference. For example we can measure the time it takes for a roundtrip from binary to unicode to binary:
In [1]: %timeit b'000000000000000000000000000000000000'.decode().encode()
162 ns ± 2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
This does include two attribute lookups as well as two function calls, so the actual time needed is smaller than the value reported above. To see the effect on execution time, we can change the test script to use binary modes "rb" and "wb" instead of text modes "r" and "w". This reduces the timing results for Python 3 as follows:
py3_binary_mode = [200.6, 203.0, 207.2] # 203.60 +/- 2.73
That reduces the process time by about 21.3 seconds or 38.7 nanoseconds per iteration. This is in agreement with timing results for the roundtrip benchmark minus timing results for name lookups and function calls:
In [2]: class C:
...: def f(self): pass
...:
In [3]: x = C()
In [4]: %timeit x.f()
82.2 ns ± 0.882 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [5]: %timeit x
17.8 ns ± 0.0564 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
Here %timeit x measures the additional overhead of resolving the global name x and hence the attribute lookup and function call make 82.2 - 17.8 == 64.4 seconds. Subtracting this overhead twice from the above roundtrip data gives 162 - 2*64.4 == 33.2 seconds.
Now there's still a difference of 32.4 seconds between Python 3 using binary mode and Python 2. This comes from the fact that all the IO in Python 3 goes through the (quite complex) implementation of io.BufferedWriter .write while in Python 2 the file.write method proceeds fairly straightforward to fwrite.
We can check the types of the file objects in both implementations:
$ python3.8
>>> type(open('/tmp/test', 'wb'))
<class '_io.BufferedWriter'>
$ python2.7
>>> type(open('/tmp/test', 'wb'))
<type 'file'>
Here we also need to note that the above timing results for Python 2 have been obtained by using text mode, not binary mode. Binary mode aims to support all objects implementing the buffer protocol which results in additional work being performed also for strings (see also this question). If we switch to binary mode also for Python 2 then we obtain:
py2_binary_mode = [212.9, 213.9, 214.3] # 213.70 +/- 0.59
which is actually a bit larger than the Python 3 results (18.4 ns / iteration).
The two implementations also differ in other details such as the dict implementation. To measure this effect we can create a corresponding setup:
from __future__ import print_function
import timeit
N = 10**6
R = 7
results = timeit.repeat(
"d[b'10'].write",
setup="d = dict.fromkeys((str(i).encode() for i in range(10, 100)), open('test', 'rb'))", # requires file 'test' to exist
repeat=R, number=N
)
results = [x/N for x in results]
print(['{:.3e}'.format(x) for x in results])
print(sum(results) / R)
This gives the following results for Python 2 and Python 3:
Python 2: ~ 56.9 nanoseconds
Python 3: ~ 78.1 nanoseconds
This additional difference of about 21.2 nanoseconds amounts to about 12 seconds for the full 550M iterations.
The above timing code checks the dict lookup for only one key, so we also need to verify that there are no hash collisions:
$ python3.8 -c "print(len({str(i).encode() for i in range(10, 100)}))"
90
$ python2.7 -c "print len({str(i).encode() for i in range(10, 100)})"
90
Related
Using a list's insert function is much slower than achieving the same effect using slice assignment:
> python -m timeit -n 100000 -s "a=[]" "a.insert(0,0)"
100000 loops, best of 5: 19.2 usec per loop
> python -m timeit -n 100000 -s "a=[]" "a[0:0]=[0]"
100000 loops, best of 5: 6.78 usec per loop
(Note that a=[] is only the setup, so a starts empty but then grows to 100,000 elements.)
At first I thought maybe it's the attribute lookup or function call overhead or so, but inserting near the end shows that that's negligible:
> python -m timeit -n 100000 -s "a=[]" "a.insert(-1,0)"
100000 loops, best of 5: 79.1 nsec per loop
Why is the presumably simpler dedicated "insert single element" function so much slower?
I can also reproduce it at repl.it:
from timeit import repeat
for _ in range(3):
for stmt in 'a.insert(0,0)', 'a[0:0]=[0]', 'a.insert(-1,0)':
t = min(repeat(stmt, 'a=[]', number=10**5))
print('%.6f' % t, stmt)
print()
# Example output:
#
# 4.803514 a.insert(0,0)
# 1.807832 a[0:0]=[0]
# 0.012533 a.insert(-1,0)
#
# 4.967313 a.insert(0,0)
# 1.821665 a[0:0]=[0]
# 0.012738 a.insert(-1,0)
#
# 5.694100 a.insert(0,0)
# 1.899940 a[0:0]=[0]
# 0.012664 a.insert(-1,0)
I use Python 3.8.1 32-bit on Windows 10 64-bit.
repl.it uses Python 3.8.1 64-bit on Linux 64-bit.
I think it's probably just that they forgot to use memmove in list.insert. If you take a look at the code list.insert uses to shift elements, you can see it's just a manual loop:
for (i = n; --i >= where; )
items[i+1] = items[i];
while list.__setitem__ on the slice assignment path uses memmove:
memmove(&item[ihigh+d], &item[ihigh],
(k - ihigh)*sizeof(PyObject *));
memmove typically has a lot of optimization put into it, such as taking advantage of SSE/AVX instructions.
I have an awkward CSV file which has multiple delimiters: the delimiter for the non-numeric part is ',', for the numeric part ';'. I want to construct a dataframe only out of the numeric part as efficiently as possible.
I have made 5 attempts: among them, utilising the converters argument of pd.read_csv, using regex with engine='python', using str.replace. They are all more than 2x slower than reading the entire CSV file with no conversions. This is prohibitively slow for my use case.
I understand the comparison isn't like-for-like, but it does demonstrate the overall poor performance is not driven by I/O. Is there a more efficient way to read in the data into a numeric Pandas dataframe? Or the equivalent NumPy array?
The below string can be used for benchmarking purposes.
# Python 3.7.0, Pandas 0.23.4
from io import StringIO
import pandas as pd
import csv
# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6
def csv_reader_1(x):
df = pd.read_csv(x, usecols=[3], header=None, delimiter=',',
converters={3: lambda x: x.split(';')})
return df.join(pd.DataFrame(df.pop(3).values.tolist(), dtype=float))
def csv_reader_2(x):
df = pd.read_csv(x, header=None, delimiter=';',
converters={0: lambda x: x.rsplit(',')[-1]}, dtype=float)
return df.astype(float)
def csv_reader_3(x):
return pd.read_csv(x, usecols=[3, 4, 5], header=None, sep=',|;', engine='python')
def csv_reader_4(x):
with x as fin:
reader = csv.reader(fin, delimiter=',')
L = [i[-1].split(';') for i in reader]
return pd.DataFrame(L, dtype=float)
def csv_reader_5(x):
with x as fin:
return pd.read_csv(StringIO(fin.getvalue().replace(';',',')),
sep=',', header=None, usecols=[3, 4, 5])
Checks:
res1 = csv_reader_1(StringIO(x))
res2 = csv_reader_2(StringIO(x))
res3 = csv_reader_3(StringIO(x))
res4 = csv_reader_4(StringIO(x))
res5 = csv_reader_5(StringIO(x))
print(res1.head(3))
# 0 1 2
# 0 34.23 562.45 213.5432
# 1 56.23 63.45 625.2340
# 2 34.23 562.45 213.5432
assert all(np.array_equal(res1.values, i.values) for i in (res2, res3, res4, res5))
Benchmarking results:
%timeit csv_reader_1(StringIO(x)) # 5.31 s per loop
%timeit csv_reader_2(StringIO(x)) # 6.69 s per loop
%timeit csv_reader_3(StringIO(x)) # 18.6 s per loop
%timeit csv_reader_4(StringIO(x)) # 5.68 s per loop
%timeit csv_reader_5(StringIO(x)) # 7.01 s per loop
%timeit pd.read_csv(StringIO(x)) # 1.65 s per loop
Update
I'm open to using command-line tools as a last resort. To that extent, I have included such an answer. My hope is there is a pure-Python or Pandas solution with comparable efficiency.
Use a command-line tool
By far the most efficient solution I've found is to use a specialist command-line tool to replace ";" with "," and then read into Pandas. Pandas or pure Python solutions do not come close in terms of efficiency.
Essentially, using CPython or a tool written in C / C++ is likely to outperform Python-level manipulations.
For example, using Find And Replace Text:
import os
os.chdir(r'C:\temp') # change directory location
os.system('fart.exe -c file.csv ";" ","') # run FART with character to replace
df = pd.read_csv('file.csv', usecols=[3, 4, 5], header=None) # read file into Pandas
How about using a generator to do the replacement, and combining it with an appropriate decorator to get a file-like object suitable for pandas?
import io
import pandas as pd
# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def replacementgenerator(haystack, needle, replace):
for s in haystack:
if s == needle:
yield str.encode(replace);
else:
yield str.encode(s);
csv = pd.read_csv(iterstream(replacementgenerator(x, ";", ",")), usecols=[3, 4, 5])
Note that we convert the string (or its constituent characters) to bytes through str.encode, as this is required for use by Pandas.
This approach is functionally identical to the answer by Daniele except for the fact that we replace values "on-the-fly", as they are requested instead of all in one go.
If this is an option, substituting the character ; with , in the string is faster.
I have written the string x to a file test.dat.
def csv_reader_4(x):
with open(x, 'r') as f:
a = f.read()
return pd.read_csv(StringIO(unicode(a.replace(';', ','))), usecols=[3, 4, 5])
The unicode() function was necessary to avoid a TypeError in Python 2.
Benchmarking:
%timeit csv_reader_2('test.dat') # 1.6 s per loop
%timeit csv_reader_4('test.dat') # 1.2 s per loop
A very very very fast one, 3.51 is the result, simply just make csv_reader_4 the below, it simply converts StringIO to str, then replaces ; with ,, and reads the dataframe with sep=',':
def csv_reader_4(x):
with x as fin:
reader = pd.read_csv(StringIO(fin.getvalue().replace(';',',')), sep=',',header=None)
return reader
The benchmark:
%timeit csv_reader_4(StringIO(x)) # 3.51 s per loop
Python has powerfull features to manipulate data, but don't expect performance using python.When performance is needed , C and C++ are your friend .
Any fast library in python is written in C/C++. It is quite easy to use C/C++ code in python, have a look at swig utility (http://www.swig.org/tutorial.html) . You can write a c++ class that may contain some fast utilities that you will use in your python code when needed.
In my environment (Ubuntu 16.04, 4GB RAM, Python 3.5.2) the fastest method was (the prototypical1) csv_reader_5 (taken from U9-Forward's answer) which ran only less than 25% slower than reading the entire CSV file with no conversions. I improved that approach by implementing a filter/wrapper that replaces the char in the read() method:
class SingleCharReplacingFilter:
def __init__(self, reader, oldchar, newchar):
def proxy(obj, attr):
a = getattr(obj, attr)
if attr in ('read'):
def f(*args):
return a(*args).replace(oldchar, newchar)
return f
else:
return a
for a in dir(reader):
if not a.startswith("_") or a == '__iter__':
setattr(self, a, proxy(reader, a))
def csv_reader_6(x):
with x as fin:
return pd.read_csv(SingleCharReplacingFilter(fin, ";", ","),
sep=',', header=None, usecols=[3, 4, 5])
The result is a little better performance compared to reading the entire CSV file with no conversions:
In [3]: %timeit pd.read_csv(StringIO(x))
605 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit csv_reader_5(StringIO(x))
733 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit csv_reader_6(StringIO(x))
568 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1 I call it prototypical because it assumes that the input stream is of StringIO type (since it calls .getvalue() on it).
I watched Brandon Rhodes' talk about Cython - "The Day of the EXE Is Upon Us".
Brandon mentions at 09:30 that for a specific short piece of code, skipping interpretation gave 40% speedup, while skipping the allocation and dispatch gave 574% speedup (10:10).
My question is - how is this measured for a specific piece of code? Does one need to manually extract the underlying c commands and then somehow make the runtime run them?
This is a very interesting observation, but how do I recreate the experiment?
Let's take a look at this python function:
def py_fun(i,N,step):
res=0.0
while i<N:
res+=i
i+=step
return res
and use ipython-magic to time it:
In [11]: %timeit py_fun(0.0,1.0e5,1.0)
10 loops, best of 3: 25.4 ms per loop
The interpreter will be running through the resulting bytecode and interpret it. However, we could cut out the interpreter by using cython for/cythonizing the very same code:
%load_ext Cython
%%cython
def cy_fun(i,N,step):
res=0.0
while i<N:
res+=i
i+=step
return res
We get a speed up of 50% for it:
In [13]: %timeit cy_fun(0.0,1.0e5,1.0)
100 loops, best of 3: 10.9 ms per loop
When we look into the produced c-code, we see that the right functions are called directly without the need of being interpreted/calling ceval, here after stripping down the boilerplate code:
static PyObject *__pyx_pf_4test_cy_fun(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_i, PyObject *__pyx_v_N, PyObject *__pyx_v_step) {
...
while (1) {
__pyx_t_1 = PyObject_RichCompare(__pyx_v_i, __pyx_v_N, Py_LT);
...
__pyx_t_2 = __Pyx_PyObject_IsTrue(__pyx_t_1);
...
if (!__pyx_t_2) break;
...
__pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_res, __pyx_v_i);
...
__pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_i, __pyx_v_step);
}
...
return __pyx_r;
}
However, this cython function handles python-objects and not c-style floats, so in the function PyNumber_InPlaceAdd it is necessary to figure out, what these objects (integer, float, something else?) really are and to dispatch this call to right functions which would do the job.
With help of cython we could also eliminate the need for this dispatch and to call directly the multiplication for floats:
%%cython
def c_fun(double i,double N, double step):
cdef double res=0.0
while i<N:
res+=i
i+=step
return res
In this version, i, N, step and res are c-style doubles and no longer python objects. So there is no longer need to call dispatch-functions like PyNumber_InPlaceAdd but we can directly call +-operator for double:
static PyObject *__pyx_pf_4test_c_fun(CYTHON_UNUSED PyObject *__pyx_self, double __pyx_v_i, double __pyx_v_N, double __pyx_v_step) {
...
__pyx_v_res = 0.0;
...
while (1) {
__pyx_t_1 = ((__pyx_v_i < __pyx_v_N) != 0);
if (!__pyx_t_1) break;
__pyx_v_res = (__pyx_v_res + __pyx_v_i);
__pyx_v_i = (__pyx_v_i + __pyx_v_step);
}
...
return __pyx_r;
}
And the result is:
In [15]: %timeit c_fun(0.0,1.0e5,1.0)
10000 loops, best of 3: 148 µs per loop
Now, this is a speed-up of almost 100 compared to the version without interpreter but with dispatch.
Actually, to say, that dispatch+allocation is the bottle neck here (because eliminating it caused a speed-up of almost factor 100) is a fallacy: the interpreter is responsible for more than 50% of the running time (15 ms) and dispatch and allocation "only" for 10ms.
However, there are more problems than "interpreter" and dynamic dispatch for the performance: Float is immutable, so every time it changes a new object must be created and registered/unregistered in garbage collector.
We can introduce mutable floats, which are changed in place and don't need registering/unregistering:
%%cython
cdef class MutableFloat:
cdef double x
def __cinit__(self, x):
self.x=x
def __iadd__(self, MutableFloat other):
self.x=self.x+other.x
return self
def __lt__(MutableFloat self, MutableFloat other):
return self.x<other.x
def __gt__(MutableFloat self, MutableFloat other):
return self.x>other.x
def __repr__(self):
return str(self.x)
The timings (now I use a different machine, so the timings a little bit different):
def py_fun(i,N,step,acc):
while i<N:
acc+=i
i+=step
return acc
%timeit py_fun(1.0, 5e5,1.0,0.0)
30.2 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
%timeit cy_fun(1.0, 5e5,1.0,0.0)
16.9 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
...: .0),MutableFloat(0.0); py_fun(i,N,step,acc)
23 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
...: .0),MutableFloat(0.0); cy_fun(i,N,step,acc)
11 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Don't forget to reinitialize i because it is mutable! The results
immutable mutable
py_fun 30ms 23ms
cy_fun 17ms 11ms
So up to 7ms (about 20%) are needed for registering/unregistering floats (I'm not sure there is not something else playing a role) in the version with the interpreter and more then 33% in the version without the interpreter.
As it looks now:
40% (13/30) of the time is used by interpreter
up to 33% of the time is used for the dynamic dispatch
up to 20% of the time is used for creating/deleting temporary objects
about 1% for the arithmetical operations
Another problem is the locality of the data, which becomes obvious for memory band-width bound problems: The modern caches work well for if data processed linearly one consecutive memory address after another. This is true for looping over std::vector<> (or array.array), but not for looping over python lists, because this list consists of pointers which can point to any place in the memory.
Consider the following python scripts:
#list.py
N=int(1e7)
lst=[0]*int(N)
for i in range(N):
lst[i]=i
print(sum(lst))
and
#byte
N=int(1e7)
b=bytearray(8*N)
m=memoryview(b).cast('L') #reinterpret as an array of unsigned longs
for i in range(N):
m[i]=i
print(sum(m))
they both create 1e7 integers, the first version Python-integers and the second the lowly c-ints which are placed continuously in the memory.
The interesting part is, how many cache misses (D) these scripts produce:
valgrind --tool=cachegrind python list.py
...
D1 misses: 33,964,276 ( 27,473,138 rd + 6,491,138 wr)
versus
valgrind --tool=cachegrind python bytearray.py
...
D1 misses: 4,796,626 ( 2,140,357 rd + 2,656,269 wr)
That means 8 time more cache misses for the python-integers. Some part of it is due to the fact, that python integers need more than 8 bytes (probably 32bytes, i.e. factor 4) memory and (maybe, not 100% sure, because neighboring integers are created after each other, so the chances are high, they are stored after each other somewhere in memory, further investigation needed) some due to the fact, that they aren't aligned in memory as it is the case for c-integers of bytearray.
I'm doing a simple Monte Carlo simulation exercise, using ipcluster engines of IPython. I've noticed a huge difference in execution time based on how I define my function, and I'm asking the reason for this. Here are the details:
When I definde the task as below, it is fast:
def sample(n):
return (rand(n)**2 + rand(n)**2 <= 1).sum()
When run in parallel:
from IPython.parallel import Client
rc = Client()
v = rc[:]
with v.sync_imports():
from numpy.random import rand
n = 1000000
timeit -r 1 -n 1 print 4.* sum(v.map_sync(sample, [n]*len(v))) / (n*len(v))
3.141712
1 loops, best of 1: 53.4 ms per loop
But if I change the function to:
def sample(n):
return sum(rand(n)**2 + rand(n)**2 <= 1)
I get:
3.141232
1 loops, best of 1: 3.81 s per loop
...which is 71 time slower. What can be the reason for this?
I can't go too in-depth, but the reason it is slower is because sum(<array>) is the built-in CPython sum function, whereas your <numpy array>.sum() is using the numpy sum function, which is substantially faster than the built-in python version.
I imagine you would get similar results if you replaced sum(<array>) with numpy.sum(<array>)
see numpy sum docs here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
I am baffled by this
def main():
for i in xrange(2560000):
a = [0.0, 0.0, 0.0]
main()
$ time python test.py
real 0m0.793s
Let's now see with numpy:
import numpy
def main():
for i in xrange(2560000):
a = numpy.array([0.0, 0.0, 0.0])
main()
$ time python test.py
real 0m39.338s
Holy CPU cycles batman!
Using numpy.zeros(3) improves, but still not enough IMHO
$ time python test.py
real 0m5.610s
user 0m5.449s
sys 0m0.070s
numpy.version.version = '1.5.1'
If you are wondering if the list creation is skipped for optimization in the first example, it is not:
5 19 LOAD_CONST 2 (0.0)
22 LOAD_CONST 2 (0.0)
25 LOAD_CONST 2 (0.0)
28 BUILD_LIST 3
31 STORE_FAST 1 (a)
Numpy is optimised for large amounts of data. Give it a tiny 3 length array and, unsurprisingly, it performs poorly.
Consider a separate test
import timeit
reps = 100
pythonTest = timeit.Timer('a = [0.] * 1000000')
numpyTest = timeit.Timer('a = numpy.zeros(1000000)', setup='import numpy')
uninitialised = timeit.Timer('a = numpy.empty(1000000)', setup='import numpy')
# empty simply allocates the memory. Thus the initial contents of the array
# is random noise
print 'python list:', pythonTest.timeit(reps), 'seconds'
print 'numpy array:', numpyTest.timeit(reps), 'seconds'
print 'uninitialised array:', uninitialised.timeit(reps), 'seconds'
And the output is
python list: 1.22042918205 seconds
numpy array: 1.05412316322 seconds
uninitialised array: 0.0016028881073 seconds
It would seem that it is the zeroing of the array that is taking all the time for numpy. So unless you need the array to be initialised then try using empty.
Holy CPU cycles batman!, indeed.
But please rather consider something very fundamental related to numpy; sophisticated linear algebra based functionality (like random numbers or singular value decomposition). Now, consider these seamingly simple calculations:
In []: A= rand(2560000, 3)
In []: %timeit rand(2560000, 3)
1 loops, best of 3: 296 ms per loop
In []: %timeit u, s, v= svd(A, full_matrices= False)
1 loops, best of 3: 571 ms per loop
and please trust me that this kind of performance will not be beaten significantly by any package currently available.
So, please describe your real problem, and I'll try to figure out decent numpy based solution for it.
Update:
Here is some simply code for ray sphere intersection:
import numpy as np
def mag(X):
# magnitude
return (X** 2).sum(0)** .5
def closest(R, c):
# closest point on ray to center and its distance
P= np.dot(c.T, R)* R
return P, mag(P- c)
def intersect(R, P, h, r):
# intersection of rays and sphere
return P- (h* (2* r- h))** .5* R
# set up
c, r= np.array([10, 10, 10])[:, None], 2. # center, radius
n= 5e5
R= np.random.rand(3, n) # some random rays in first octant
R= R/ mag(R) # normalized to unit length
# find rays which will intersect sphere
P, b= closest(R, c)
wi= b<= r
# and for those which will, find the intersection
X= intersect(R[:, wi], P[:, wi], r- b[wi], r)
Apparently we calculated correctly:
In []: allclose(mag(X- c), r)
Out[]: True
And some timings:
In []: % timeit P, b= closest(R, c)
10 loops, best of 3: 93.4 ms per loop
In []: n/ 0.0934
Out[]: 5353319 #=> more than 5 million detection's of possible intersections/ s
In []: %timeit X= intersect(R[:, wi], P[:, wi], r- b[wi])
10 loops, best of 3: 32.7 ms per loop
In []: X.shape[1]/ 0.0327
Out[]: 874037 #=> almost 1 million actual intersections/ s
These timings are done with very modest machine. With modern machine, a significant speed-up can be still expected.
Anyway, this is only a short demonstration how to code with numpy.
Late answer, but could be important for other viewers.
This problem has been considered in the kwant project as well.
Indeed small arrays are not optimized in numpy and quite frequently small arrays are exactly what you need.
In this regard they created a substitute for small arrays which behaves and co-exists with the numpy arrays (any non-implemented operation in the new data-type is processed by numpy).
You should look into this project:
https://pypi.python.org/pypi/tinyarray/1.0.5
which main purpose is to behave nicely for small arrays. Of course some of the more fancy things you can do with numpy is not supported by this. But numerics seems to be your request.
I have made some small tests:
python
I have added numpy import to get the load time correct
import numpy
def main():
for i in xrange(2560000):
a = [0.0, 0.0, 0.0]
main()
numpy
import numpy
def main():
for i in xrange(2560000):
a = numpy.array([0.0, 0.0, 0.0])
main()
numpy-zero
import numpy
def main():
for i in xrange(2560000):
a = numpy.zeros((3,1))
main()
tinyarray
import numpy,tinyarray
def main():
for i in xrange(2560000):
a = tinyarray.array([0.0, 0.0, 0.0])
main()
tinyarray-zero
import numpy,tinyarray
def main():
for i in xrange(2560000):
a = tinyarray.zeros((3,1))
main()
I ran this:
for f in python numpy numpy_zero tiny tiny_zero ; do
echo $f
for i in `seq 5` ; do
time python ${f}_test.py
done
done
And got:
python
python ${f}_test.py 0.31s user 0.02s system 99% cpu 0.339 total
python ${f}_test.py 0.29s user 0.03s system 98% cpu 0.328 total
python ${f}_test.py 0.33s user 0.01s system 98% cpu 0.345 total
python ${f}_test.py 0.31s user 0.01s system 98% cpu 0.325 total
python ${f}_test.py 0.32s user 0.00s system 98% cpu 0.326 total
numpy
python ${f}_test.py 2.79s user 0.01s system 99% cpu 2.812 total
python ${f}_test.py 2.80s user 0.02s system 99% cpu 2.832 total
python ${f}_test.py 3.01s user 0.02s system 99% cpu 3.033 total
python ${f}_test.py 2.99s user 0.01s system 99% cpu 3.012 total
python ${f}_test.py 3.20s user 0.01s system 99% cpu 3.221 total
numpy_zero
python ${f}_test.py 1.04s user 0.02s system 99% cpu 1.075 total
python ${f}_test.py 1.08s user 0.02s system 99% cpu 1.106 total
python ${f}_test.py 1.04s user 0.02s system 99% cpu 1.065 total
python ${f}_test.py 1.03s user 0.02s system 99% cpu 1.059 total
python ${f}_test.py 1.05s user 0.01s system 99% cpu 1.064 total
tiny
python ${f}_test.py 0.93s user 0.02s system 99% cpu 0.955 total
python ${f}_test.py 0.98s user 0.01s system 99% cpu 0.993 total
python ${f}_test.py 0.93s user 0.02s system 99% cpu 0.953 total
python ${f}_test.py 0.92s user 0.02s system 99% cpu 0.944 total
python ${f}_test.py 0.96s user 0.01s system 99% cpu 0.978 total
tiny_zero
python ${f}_test.py 0.71s user 0.03s system 99% cpu 0.739 total
python ${f}_test.py 0.68s user 0.02s system 99% cpu 0.711 total
python ${f}_test.py 0.70s user 0.01s system 99% cpu 0.721 total
python ${f}_test.py 0.70s user 0.02s system 99% cpu 0.721 total
python ${f}_test.py 0.67s user 0.01s system 99% cpu 0.687 total
Now these tests are (as already pointed out) not the best tests. However, they still show that tinyarray is better suited for small arrays.
Another fact is that the most common operations should be faster with tinyarray. So it might have better benefits of usage than just data creations.
I have never tried it in a fully fledged project, but the kwant project is using it
Of course numpy consumes more time in this case, since: a = np.array([0.0, 0.0, 0.0]) <=~=> a = [0.0, 0.0, 0.0]; a = np.array(a), it took two steps. But numpy-array has many good qualities, its high speed can be seen in the operations on them, not the creation of them. Part of my personal thoughts:).