How to store the result from %%timeit cell magic? - python

I can't figure out how to store the result from cell magic - %%timeit? I've read:
Can you capture the output of ipython's magic methods?
Capture the result of an IPython magic function
and in this questions answers only about line magic. In line mode (%) this works:
In[1]: res = %timeit -o np.linalg.inv(A)
But in cell mode (%%) it does not:
In[2]: res = %%timeit -o
A = np.mat('1 2 3; 7 4 9; 5 6 1')
np.linalg.inv(A)
It simply executes the cell, no magic. Is it a bug or I'm doing something wrong?

You can use the _ variable (stores the last result) after the %%timeit -o cell and assign it to some reusable variable:
In[2]: %%timeit -o
A = np.mat('1 2 3; 7 4 9; 5 6 1')
np.linalg.inv(A)
Out[2]: blabla
<TimeitResult : 1 loop, best of 3: 588 µs per loop>
In[3]: res = _
In[4]: res
Out[4]: <TimeitResult : 1 loop, best of 3: 588 µs per loop>
I don't think it's a bug because cell mode commands must be the first command in that cell so you can't put anything (not even res = ...) in front of that command.
However you still need the -o because otherwise the _ variable contains None.

If you just care about the output of the cell magic, e.g. for recording purposes - and you don't need the extra metadata included in the TimeitResult object, you could also just combine it with %%capture:
%%capture result
%%timeit
A = np.mat('1 2 3; 7 4 9; 5 6 1')
np.linalg.inv(A)
Then you can grab the output from result.stdout, which will yield whatever the output of the cell is - including the timing result.
print(result.stdout)
'26.4 us +- 329 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)\n'
This works for arbitrary cell magic, and can work as a fallback if the underscore solution isn't working.

Related

iPython timeit - only time part of the operation

I was attempting to determine, via iPython's %%timeit mechanism, whether set.remove is faster than list.remove when a conundrum came up.
I could do
In [1]: %%timeit
a_list = list(range(100))
a_list.remove(50)
and then do the same thing but with a set. However, this would include the overhead from the list/set construction. Is there a way to re-build the list/set each iteration but only time the remove method?
Put your setup code on the same line to create any names or precursor operations you need!
https://ipython.org/ipython-doc/dev/interactive/magics.html#magic-timeit
In cell mode, the statement in the first line is used as setup code (executed but not timed) and the body of the cell is timed. The cell body has access to any variables created in the setup code.
%%timeit setup_code
...
Unfortunately only a single run can be done as it does not re-run the setup code
%%timeit -n1 x = list(range(100))
x.remove(50)
Surprisingly, this doesn't accept a string like the timeit module, so combined with the single run requirement, I'd still defer to timeit with a string setup= and repeat it if lots of setup or a statistically higher precision is needed
See #Kelly Bundy's much more precise answer for more!
Alternatively, using the timeit module with more repetitions and some statistics:
list: 814 ns ± 3.7 ns
set: 152 ns ± 1.6 ns
list: 815 ns ± 4.3 ns
set: 154 ns ± 1.6 ns
list: 817 ns ± 4.3 ns
set: 153 ns ± 1.6 ns
Code (Try it online!):
from timeit import repeat
from statistics import mean, stdev
for _ in range(3):
for kind in 'list', 'set':
ts = repeat('data.remove(50)', f'data = {kind}(range(100))', number=1, repeat=10**5)
ts = [t * 1e9 for t in sorted(ts)[:1000]]
print('%4s: %3d ns ± %.1f ns' % (kind, mean(ts), stdev(ts)))

Why is a.insert(0,0) much slower than a[0:0]=[0]?

Using a list's insert function is much slower than achieving the same effect using slice assignment:
> python -m timeit -n 100000 -s "a=[]" "a.insert(0,0)"
100000 loops, best of 5: 19.2 usec per loop
> python -m timeit -n 100000 -s "a=[]" "a[0:0]=[0]"
100000 loops, best of 5: 6.78 usec per loop
(Note that a=[] is only the setup, so a starts empty but then grows to 100,000 elements.)
At first I thought maybe it's the attribute lookup or function call overhead or so, but inserting near the end shows that that's negligible:
> python -m timeit -n 100000 -s "a=[]" "a.insert(-1,0)"
100000 loops, best of 5: 79.1 nsec per loop
Why is the presumably simpler dedicated "insert single element" function so much slower?
I can also reproduce it at repl.it:
from timeit import repeat
for _ in range(3):
for stmt in 'a.insert(0,0)', 'a[0:0]=[0]', 'a.insert(-1,0)':
t = min(repeat(stmt, 'a=[]', number=10**5))
print('%.6f' % t, stmt)
print()
# Example output:
#
# 4.803514 a.insert(0,0)
# 1.807832 a[0:0]=[0]
# 0.012533 a.insert(-1,0)
#
# 4.967313 a.insert(0,0)
# 1.821665 a[0:0]=[0]
# 0.012738 a.insert(-1,0)
#
# 5.694100 a.insert(0,0)
# 1.899940 a[0:0]=[0]
# 0.012664 a.insert(-1,0)
I use Python 3.8.1 32-bit on Windows 10 64-bit.
repl.it uses Python 3.8.1 64-bit on Linux 64-bit.
I think it's probably just that they forgot to use memmove in list.insert. If you take a look at the code list.insert uses to shift elements, you can see it's just a manual loop:
for (i = n; --i >= where; )
items[i+1] = items[i];
while list.__setitem__ on the slice assignment path uses memmove:
memmove(&item[ihigh+d], &item[ihigh],
(k - ihigh)*sizeof(PyObject *));
memmove typically has a lot of optimization put into it, such as taking advantage of SSE/AVX instructions.

Efficient double for loop over large matrices

I have the following code which I need to runt it more than one time. Currently, it takes too long. Is there an efficient way to write these two for loops.
ErrorEst=[]
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
if cooccurrenceCount[i][j]>0:
#print(coaccuranceCount[i][j]/ count_max)
weighting_factor = np.min(
[1.0,
math.pow(np.float32(cooccurrenceCount[i][j]/ count_max), scaling_factor)])
embedding_product = (np.multiply(emedingEnt[j], embedingFea[i]), 1)
#tf.log(tf.to_float(self.__cooccurrence_count))
log_cooccurrences =np.log (np.float32(cooccurrenceCount[i][j]))
distance_expr = np.square(([
embedding_product+
focal_bias[i],
context_bias[j],
-(log_cooccurrences)]))
single_losses =(weighting_factor* distance_expr)
temp.append(single_losses)
ErrorEst.append(np.sum(temp))
You can use Numba or Cython
At first make sure to avoid lists where ever possible and write a simple and readable code with explicit loops like you would do for example in C. All input and outputs are only numpy-arrays or scalars.
Your Code
import numpy as np
import numba as nb
import math
def your_func(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias):
ErrorEst=[]
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
if cooccurrenceCount[i][j]>0:
weighting_factor = np.min([1.0,math.pow(np.float32(cooccurrenceCount[i][j]/ count_max), scaling_factor)])
embedding_product = (np.multiply(emedingEnt[j], embedingFea[i]), 1)
log_cooccurrences =np.log (np.float32(cooccurrenceCount[i][j]))
distance_expr = np.square(([embedding_product+focal_bias[i],context_bias[j],-(log_cooccurrences)]))
single_losses =(weighting_factor* distance_expr)
temp.append(single_losses)
ErrorEst.append(np.sum(temp))
return ErrorEst
Numba Code
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def your_func_2(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias):
ErrorEst=np.empty((embedingFea.shape[0],2))
for i in nb.prange(embedingFea.shape[0]):
temp_1=0.
temp_2=0.
for j in range(emedingEnt.shape[0]):
if cooccurrenceCount[i,j]>0:
weighting_factor=(cooccurrenceCount[i,j]/ count_max)**scaling_factor
if weighting_factor>1.:
weighting_factor=1.
embedding_product = emedingEnt[j]*embedingFea[i]
log_cooccurrences =np.log(cooccurrenceCount[i,j])
temp_1+=weighting_factor*(embedding_product+focal_bias[i])**2
temp_1+=weighting_factor*(context_bias[j])**2
temp_1+=weighting_factor*(log_cooccurrences)**2
temp_2+=weighting_factor*(1.+focal_bias[i])**2
temp_2+=weighting_factor*(context_bias[j])**2
temp_2+=weighting_factor*(log_cooccurrences)**2
ErrorEst[i,0]=temp_1
ErrorEst[i,1]=temp_2
return ErrorEst
Timings
embedingFea=np.random.rand(1700)+1
emedingEnt=np.random.rand(1500)+1
cooccurrenceCount=np.random.rand(1700,1500)+1
focal_bias=np.random.rand(1700)
context_bias=np.random.rand(1500)
count_max=100
scaling_factor=2.5
%timeit res_1=your_func(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias)
1min 1s ± 346 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=your_func_2(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias)
17.6 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you need to increase the performance of your code you should write it in low level language like C and try to avoid the usage of floating point numbers.
Possible solution: Can we use C code in Python?
You could try using numba and wrapping your code with the #jit decorator. Usually the first execution needs to compile some stuff, and will thus not see much speedup, but subsequent iterations will be much faster.
You may need to put your loop in a function for this to work.
from numba import jit
#jit(nopython=True)
def my_double_loop(some, arguments):
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
# ...

Reading data from CSV into dataframe with multiple delimiters efficiently

I have an awkward CSV file which has multiple delimiters: the delimiter for the non-numeric part is ',', for the numeric part ';'. I want to construct a dataframe only out of the numeric part as efficiently as possible.
I have made 5 attempts: among them, utilising the converters argument of pd.read_csv, using regex with engine='python', using str.replace. They are all more than 2x slower than reading the entire CSV file with no conversions. This is prohibitively slow for my use case.
I understand the comparison isn't like-for-like, but it does demonstrate the overall poor performance is not driven by I/O. Is there a more efficient way to read in the data into a numeric Pandas dataframe? Or the equivalent NumPy array?
The below string can be used for benchmarking purposes.
# Python 3.7.0, Pandas 0.23.4
from io import StringIO
import pandas as pd
import csv
# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6
def csv_reader_1(x):
df = pd.read_csv(x, usecols=[3], header=None, delimiter=',',
converters={3: lambda x: x.split(';')})
return df.join(pd.DataFrame(df.pop(3).values.tolist(), dtype=float))
def csv_reader_2(x):
df = pd.read_csv(x, header=None, delimiter=';',
converters={0: lambda x: x.rsplit(',')[-1]}, dtype=float)
return df.astype(float)
def csv_reader_3(x):
return pd.read_csv(x, usecols=[3, 4, 5], header=None, sep=',|;', engine='python')
def csv_reader_4(x):
with x as fin:
reader = csv.reader(fin, delimiter=',')
L = [i[-1].split(';') for i in reader]
return pd.DataFrame(L, dtype=float)
def csv_reader_5(x):
with x as fin:
return pd.read_csv(StringIO(fin.getvalue().replace(';',',')),
sep=',', header=None, usecols=[3, 4, 5])
Checks:
res1 = csv_reader_1(StringIO(x))
res2 = csv_reader_2(StringIO(x))
res3 = csv_reader_3(StringIO(x))
res4 = csv_reader_4(StringIO(x))
res5 = csv_reader_5(StringIO(x))
print(res1.head(3))
# 0 1 2
# 0 34.23 562.45 213.5432
# 1 56.23 63.45 625.2340
# 2 34.23 562.45 213.5432
assert all(np.array_equal(res1.values, i.values) for i in (res2, res3, res4, res5))
Benchmarking results:
%timeit csv_reader_1(StringIO(x)) # 5.31 s per loop
%timeit csv_reader_2(StringIO(x)) # 6.69 s per loop
%timeit csv_reader_3(StringIO(x)) # 18.6 s per loop
%timeit csv_reader_4(StringIO(x)) # 5.68 s per loop
%timeit csv_reader_5(StringIO(x)) # 7.01 s per loop
%timeit pd.read_csv(StringIO(x)) # 1.65 s per loop
Update
I'm open to using command-line tools as a last resort. To that extent, I have included such an answer. My hope is there is a pure-Python or Pandas solution with comparable efficiency.
Use a command-line tool
By far the most efficient solution I've found is to use a specialist command-line tool to replace ";" with "," and then read into Pandas. Pandas or pure Python solutions do not come close in terms of efficiency.
Essentially, using CPython or a tool written in C / C++ is likely to outperform Python-level manipulations.
For example, using Find And Replace Text:
import os
os.chdir(r'C:\temp') # change directory location
os.system('fart.exe -c file.csv ";" ","') # run FART with character to replace
df = pd.read_csv('file.csv', usecols=[3, 4, 5], header=None) # read file into Pandas
How about using a generator to do the replacement, and combining it with an appropriate decorator to get a file-like object suitable for pandas?
import io
import pandas as pd
# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def replacementgenerator(haystack, needle, replace):
for s in haystack:
if s == needle:
yield str.encode(replace);
else:
yield str.encode(s);
csv = pd.read_csv(iterstream(replacementgenerator(x, ";", ",")), usecols=[3, 4, 5])
Note that we convert the string (or its constituent characters) to bytes through str.encode, as this is required for use by Pandas.
This approach is functionally identical to the answer by Daniele except for the fact that we replace values "on-the-fly", as they are requested instead of all in one go.
If this is an option, substituting the character ; with , in the string is faster.
I have written the string x to a file test.dat.
def csv_reader_4(x):
with open(x, 'r') as f:
a = f.read()
return pd.read_csv(StringIO(unicode(a.replace(';', ','))), usecols=[3, 4, 5])
The unicode() function was necessary to avoid a TypeError in Python 2.
Benchmarking:
%timeit csv_reader_2('test.dat') # 1.6 s per loop
%timeit csv_reader_4('test.dat') # 1.2 s per loop
A very very very fast one, 3.51 is the result, simply just make csv_reader_4 the below, it simply converts StringIO to str, then replaces ; with ,, and reads the dataframe with sep=',':
def csv_reader_4(x):
with x as fin:
reader = pd.read_csv(StringIO(fin.getvalue().replace(';',',')), sep=',',header=None)
return reader
The benchmark:
%timeit csv_reader_4(StringIO(x)) # 3.51 s per loop
Python has powerfull features to manipulate data, but don't expect performance using python.When performance is needed , C and C++ are your friend .
Any fast library in python is written in C/C++. It is quite easy to use C/C++ code in python, have a look at swig utility (http://www.swig.org/tutorial.html) . You can write a c++ class that may contain some fast utilities that you will use in your python code when needed.
In my environment (Ubuntu 16.04, 4GB RAM, Python 3.5.2) the fastest method was (the prototypical1) csv_reader_5 (taken from U9-Forward's answer) which ran only less than 25% slower than reading the entire CSV file with no conversions. I improved that approach by implementing a filter/wrapper that replaces the char in the read() method:
class SingleCharReplacingFilter:
def __init__(self, reader, oldchar, newchar):
def proxy(obj, attr):
a = getattr(obj, attr)
if attr in ('read'):
def f(*args):
return a(*args).replace(oldchar, newchar)
return f
else:
return a
for a in dir(reader):
if not a.startswith("_") or a == '__iter__':
setattr(self, a, proxy(reader, a))
def csv_reader_6(x):
with x as fin:
return pd.read_csv(SingleCharReplacingFilter(fin, ";", ","),
sep=',', header=None, usecols=[3, 4, 5])
The result is a little better performance compared to reading the entire CSV file with no conversions:
In [3]: %timeit pd.read_csv(StringIO(x))
605 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit csv_reader_5(StringIO(x))
733 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit csv_reader_6(StringIO(x))
568 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1 I call it prototypical because it assumes that the input stream is of StringIO type (since it calls .getvalue() on it).

Interpretation vs dynamic dispatch penalty in Python

I watched Brandon Rhodes' talk about Cython - "The Day of the EXE Is Upon Us".
Brandon mentions at 09:30 that for a specific short piece of code, skipping interpretation gave 40% speedup, while skipping the allocation and dispatch gave 574% speedup (10:10).
My question is - how is this measured for a specific piece of code? Does one need to manually extract the underlying c commands and then somehow make the runtime run them?
This is a very interesting observation, but how do I recreate the experiment?
Let's take a look at this python function:
def py_fun(i,N,step):
res=0.0
while i<N:
res+=i
i+=step
return res
and use ipython-magic to time it:
In [11]: %timeit py_fun(0.0,1.0e5,1.0)
10 loops, best of 3: 25.4 ms per loop
The interpreter will be running through the resulting bytecode and interpret it. However, we could cut out the interpreter by using cython for/cythonizing the very same code:
%load_ext Cython
%%cython
def cy_fun(i,N,step):
res=0.0
while i<N:
res+=i
i+=step
return res
We get a speed up of 50% for it:
In [13]: %timeit cy_fun(0.0,1.0e5,1.0)
100 loops, best of 3: 10.9 ms per loop
When we look into the produced c-code, we see that the right functions are called directly without the need of being interpreted/calling ceval, here after stripping down the boilerplate code:
static PyObject *__pyx_pf_4test_cy_fun(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_i, PyObject *__pyx_v_N, PyObject *__pyx_v_step) {
...
while (1) {
__pyx_t_1 = PyObject_RichCompare(__pyx_v_i, __pyx_v_N, Py_LT);
...
__pyx_t_2 = __Pyx_PyObject_IsTrue(__pyx_t_1);
...
if (!__pyx_t_2) break;
...
__pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_res, __pyx_v_i);
...
__pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_i, __pyx_v_step);
}
...
return __pyx_r;
}
However, this cython function handles python-objects and not c-style floats, so in the function PyNumber_InPlaceAdd it is necessary to figure out, what these objects (integer, float, something else?) really are and to dispatch this call to right functions which would do the job.
With help of cython we could also eliminate the need for this dispatch and to call directly the multiplication for floats:
%%cython
def c_fun(double i,double N, double step):
cdef double res=0.0
while i<N:
res+=i
i+=step
return res
In this version, i, N, step and res are c-style doubles and no longer python objects. So there is no longer need to call dispatch-functions like PyNumber_InPlaceAdd but we can directly call +-operator for double:
static PyObject *__pyx_pf_4test_c_fun(CYTHON_UNUSED PyObject *__pyx_self, double __pyx_v_i, double __pyx_v_N, double __pyx_v_step) {
...
__pyx_v_res = 0.0;
...
while (1) {
__pyx_t_1 = ((__pyx_v_i < __pyx_v_N) != 0);
if (!__pyx_t_1) break;
__pyx_v_res = (__pyx_v_res + __pyx_v_i);
__pyx_v_i = (__pyx_v_i + __pyx_v_step);
}
...
return __pyx_r;
}
And the result is:
In [15]: %timeit c_fun(0.0,1.0e5,1.0)
10000 loops, best of 3: 148 µs per loop
Now, this is a speed-up of almost 100 compared to the version without interpreter but with dispatch.
Actually, to say, that dispatch+allocation is the bottle neck here (because eliminating it caused a speed-up of almost factor 100) is a fallacy: the interpreter is responsible for more than 50% of the running time (15 ms) and dispatch and allocation "only" for 10ms.
However, there are more problems than "interpreter" and dynamic dispatch for the performance: Float is immutable, so every time it changes a new object must be created and registered/unregistered in garbage collector.
We can introduce mutable floats, which are changed in place and don't need registering/unregistering:
%%cython
cdef class MutableFloat:
cdef double x
def __cinit__(self, x):
self.x=x
def __iadd__(self, MutableFloat other):
self.x=self.x+other.x
return self
def __lt__(MutableFloat self, MutableFloat other):
return self.x<other.x
def __gt__(MutableFloat self, MutableFloat other):
return self.x>other.x
def __repr__(self):
return str(self.x)
The timings (now I use a different machine, so the timings a little bit different):
def py_fun(i,N,step,acc):
while i<N:
acc+=i
i+=step
return acc
%timeit py_fun(1.0, 5e5,1.0,0.0)
30.2 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
%timeit cy_fun(1.0, 5e5,1.0,0.0)
16.9 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
...: .0),MutableFloat(0.0); py_fun(i,N,step,acc)
23 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
...: .0),MutableFloat(0.0); cy_fun(i,N,step,acc)
11 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Don't forget to reinitialize i because it is mutable! The results
immutable mutable
py_fun 30ms 23ms
cy_fun 17ms 11ms
So up to 7ms (about 20%) are needed for registering/unregistering floats (I'm not sure there is not something else playing a role) in the version with the interpreter and more then 33% in the version without the interpreter.
As it looks now:
40% (13/30) of the time is used by interpreter
up to 33% of the time is used for the dynamic dispatch
up to 20% of the time is used for creating/deleting temporary objects
about 1% for the arithmetical operations
Another problem is the locality of the data, which becomes obvious for memory band-width bound problems: The modern caches work well for if data processed linearly one consecutive memory address after another. This is true for looping over std::vector<> (or array.array), but not for looping over python lists, because this list consists of pointers which can point to any place in the memory.
Consider the following python scripts:
#list.py
N=int(1e7)
lst=[0]*int(N)
for i in range(N):
lst[i]=i
print(sum(lst))
and
#byte
N=int(1e7)
b=bytearray(8*N)
m=memoryview(b).cast('L') #reinterpret as an array of unsigned longs
for i in range(N):
m[i]=i
print(sum(m))
they both create 1e7 integers, the first version Python-integers and the second the lowly c-ints which are placed continuously in the memory.
The interesting part is, how many cache misses (D) these scripts produce:
valgrind --tool=cachegrind python list.py
...
D1 misses: 33,964,276 ( 27,473,138 rd + 6,491,138 wr)
versus
valgrind --tool=cachegrind python bytearray.py
...
D1 misses: 4,796,626 ( 2,140,357 rd + 2,656,269 wr)
That means 8 time more cache misses for the python-integers. Some part of it is due to the fact, that python integers need more than 8 bytes (probably 32bytes, i.e. factor 4) memory and (maybe, not 100% sure, because neighboring integers are created after each other, so the chances are high, they are stored after each other somewhere in memory, further investigation needed) some due to the fact, that they aren't aligned in memory as it is the case for c-integers of bytearray.

Categories