Cython: size attribute of memoryviews

Cython: size attribute of memoryviews - python

I'm using a lot of 3D memoryviews in Cython, e.g.
cython.declare(a='double[:, :, ::1]')
a = np.empty((10, 20, 30), dtype='double')
I often want to loop over all elements of a. I can do this using a triple loop like
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for k in range(a.shape[2]):
a[i, j, k] = ...
If I do not care about the indices i, j and k, it is more efficient to do a flat loop, like
cython.declare(a_ptr='double*')
a_ptr = cython.address(a[0, 0, 0])
for i in range(size):
a_ptr[i] = ...
Here I need to know the number of elements (size) in the array. This is given by the product of the elements in the shape attribute, i.e. size = a.shape[0]*a.shape[1]*a.shape[2], or more generally size = np.prod(np.asarray(a).shape). I find both of these ugly to write, and the (albeit small) computational overhead bothers me. The nice way to do it is to use the builtin size attribute of memoryviews, size = a.size. However, for reasons I cannot fathom, this leads to unoptimized C code, as evident from the annotations html file generated by Cython. Specifically, the C code generated by size = a.shape[0]*a.shape[1]*a.shape[2] is simply
__pyx_v_size = (((__pyx_v_a.shape[0]) * (__pyx_v_a.shape[1])) * (__pyx_v_a.shape[2]));
where the C code generated from size = a.size is
__pyx_t_10 = __pyx_memoryview_fromslice(__pyx_v_a, 3, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_10)) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_10);
__pyx_t_14 = __Pyx_PyObject_GetAttrStr(__pyx_t_10, __pyx_n_s_size); if (unlikely(!__pyx_t_14)) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_14);
__Pyx_DECREF(__pyx_t_10); __pyx_t_10 = 0;
__pyx_t_7 = __Pyx_PyIndex_AsSsize_t(__pyx_t_14); if (unlikely((__pyx_t_7 == (Py_ssize_t)-1) && PyErr_Occurred())) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_DECREF(__pyx_t_14); __pyx_t_14 = 0;
__pyx_v_size = __pyx_t_7;
To generate the above code, I have enabled all possible optimizations through compiler directives, meaning that the unwieldy C code generated by a.size cannot be optimized away. It looks to me as though the size "attribute" is not really a pre-computed attribute, but actually carries out a computation upon lookup. Furthermore, this computation is quite a bit more involved than simply taking the product over the shape attribute. I cannot find any hint of an explanation in the docs.
What is the explanation of this behavior, and do I have a better choice than writing out a.shape[0]*a.shape[1]*a.shape[2], if I really care about this micro optimization?

Already by looking at the produced C-code, you can already see that size is a property and not a simple C-member. Here is the original Cython-code for memory-views:
#cname('__pyx_memoryview')
cdef class memoryview(object):
...
cdef object _size
...
#property
def size(self):
if self._size is None:
result = 1
for length in self.view.shape[:self.view.ndim]:
result *= length
self._size = result
return self._size
It is easy to see, that the product is calculated only once and then cached. Clearly it doesn't play a big role for 3 dimensional arrays, but for a higher number of dimensions caching could become pretty important (as we will see, there are at most 8 dimensions, so it is not that clearly cut, whether this caching is really worth it).
One can understand the decision to lazily calculate the size - after all, size is not always needed/used and one doesn't want to pay for it. Clearly, there is a price to pay for this laziness if you use the size a lot - that is the trade off cython makes.
I would not dwell too long on the overhead of calling a.size - it is nothing compared to the overhead of calling a cython-function from python.
For example, the measurements of #danny measure only this python-call overhead and not the actual performance of the different approaches. To show this, I throw a third function into the mix:
%%cython
...
def both():
a.size+a.shape[0]*a.shape[1]*a.shape[2]
which does double amount of the work, but
>>> %timeit mv_size
22.5 ns ± 0.0864 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit mv_product
20.7 ns ± 0.087 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>>%timeit both
21 ns ± 0.39 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
is just as fast. On the other hand:
%%cython
...
def nothing():
pass
isn't faster:
%timeit nothing
24.3 ns ± 0.854 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In a nutshell: I would use a.size because of the readability, assuming that optimizing that would not speed up my application, unless profiling proves something different.
The whole story: the variable a is of type __Pyx_memviewslice and not of type __pyx_memoryview as one could think. The struct __Pyx_memviewslice has the following definition:
struct __pyx_memoryview_obj;
typedef struct {
struct __pyx_memoryview_obj *memview;
char *data;
Py_ssize_t shape[8];
Py_ssize_t strides[8];
Py_ssize_t suboffsets[8];
} __Pyx_memviewslice;
that means, shape can be accessed very efficiently by the Cython-code, as it is a simple C-array (btw. I ask my self, what happens if there are more than 8 dimensions? - the answer is: you cannot have more than 8 dimensions).
The member memview is where the memory is hold and __pyx_memoryview_obj is the C-Extension which is produce from the cython-code we saw above and looks as follows:
/* "View.MemoryView":328
*
* #cname('__pyx_memoryview')
* cdef class memoryview(object): # <<<<<<<<<<<<<<
*
* cdef object obj
*/
struct __pyx_memoryview_obj {
PyObject_HEAD
struct __pyx_vtabstruct_memoryview *__pyx_vtab;
PyObject *obj;
PyObject *_size;
PyObject *_array_interface;
PyThread_type_lock lock;
__pyx_atomic_int acquisition_count[2];
__pyx_atomic_int *acquisition_count_aligned_p;
Py_buffer view;
int flags;
int dtype_is_object;
__Pyx_TypeInfo *typeinfo;
};
So, Pyx_memviewslice is not really a Python object -it is kind of convenience wrapper, which caches important data, like shape and stride so this information can be accessed fast and cheap.
What happens when we call a.size? First, __pyx_memoryview_fromslice is called which does some additional reference counting and some further stuff and returns the member memview from the __Pyx_memviewslice-object.
Then the property size is called on this returned memoryview, which accesses the cached value in _size as have been shown in the Cython code above.
It looks as if the python-programmers introduced a shortcut for such important information as shape, strides and suboffsets, but not for the size which is probably not so important - this is the reason for cleaner C-code in the case of shape.

The generated C code for a.size looks fine.
It has to interface with Python because memory views are python extension types. size on the memory view is a python attribute and gets converted to ssize_t. That is all the C code does. The conversion can be avoided by typing the size variable as Py_ssize_t rather than ssize_t.
So there is not anything in the C code that looks unoptimised - it's just looking up an attribute on a python object, size on a memory view in this case.
Here are results of micro-benchmark for the two methods.
Setup:
cimport numpy as np
import numpy as np
cimport cython
cython.declare(a='double[:, :, ::1]')
a = np.empty((10, 20, 30), dtype='double')
def mv_size():
return a.size
def mv_product():
return a.shape[0]*a.shape[1]*a.shape[2]
Results:
%timeit mv_size
10000000 loops, best of 3: 23.4 ns per loop
%timeit mv_product
10000000 loops, best of 3: 23.4 ns per loop
Performance is pretty much identical.
The product method is purely C code which matters if it needs to be executed in parallel, but otherwise there is no performance benefit over memory view size.

Related

Cython loop over array of indexes

I would like to do a series of operations on particular elements of matrices. I need to define the indices of these elements in an external object (self.indices in the example below).
Here is a stupid example of implementation in cython :
%%cython -f -c=-O2 -I./
import numpy as np
cimport numpy as np
cimport cython
cdef class Test:
cdef double[:, ::1] a, b
cdef Py_ssize_t[:, ::1] indices
def __cinit__(self, a, b, indices):
self.a = a
self.b = b
self.indices = indices
#cython.boundscheck(False)
#cython.nonecheck(False)
#cython.wraparound(False)
#cython.initializedcheck(False)
cpdef void run1(self):
""" Use of external structure of indices. """
cdef Py_ssize_t idx, ix, iy
cdef int n = self.indices.shape[0]
for idx in range(n):
ix = self.indices[idx, 0]
iy = self.indices[idx, 1]
self.b[ix, iy] = ix * iy * self.a[ix, iy]
#cython.boundscheck(False)
#cython.nonecheck(False)
#cython.wraparound(False)
#cython.initializedcheck(False)
cpdef void run2(self):
""" Direct formulation """
cdef Py_ssize_t idx, ix, iy
cdef int nx = self.a.shape[0]
cdef int ny = self.a.shape[1]
for ix in range(nx):
for iy in range(ny):
self.b[ix, iy] = ix * iy * self.a[ix, iy]
with this on the python side :
import itertools
import numpy as np
N = 256
a = np.random.rand(N, N)
b = np.zeros_like(a)
indices = np.array([[i, j] for i, j in itertools.product(range(N), range(N))], dtype=int)
test = Test(a, b, indices)
and the results :
%timeit test.run1()
75.6 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit test.run2()
41.4 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Why does the Test.run1() method run much slower than the Test.run2() method?
What are the possibilities to keep a similar level of performance as with Test.run2() by using an external list, array, or any other kind of structure of indices?

Because run1 is significantly more complicated...
run1 is having to read from two separate bits of memory which almost certainly makes the CPU cache less efficient.
It's fairly trivial for the compiler to work out exactly what order it's accessing the array elements in run2. In contrast in run1 it could be accessing them in any order. That likely allows for significant optimizations.
Your current performance is probably as good as it gets.

In addition of the good #DavidW answer, note that run2 is SIMD-friendly as opposed to run1. This means a compiler can easily generate SIMD instruction in run2 so to read multiple packed items from memory, multiply multiple items in a row thanks to packed SIMD instructions and write the packed items into memory. If the array is small enough to fit in CPU caches, which is the case here, the SIMD computation can be very fast. Indeed, nearly all modern x86-64 processors support the 256-bit wide AVX/AVX-2 instruction set that can operate on 8 32-bit integers in a row and 4 double-precision floating-point numbers. Additionally, such a code can easily be unrolled and well pipelined by modern processors. Hardware prefetchers are also optimized for this kind of use-case.
Meanwhile, run1 do indirect memory accesses. Compilers can hardly assume they are actually contiguous and generate packed loads/stores (this is very unlikely in most codes and this is up to developers to write this kind optimization). The indirection require multiple load instructions that saturate the load ports and make the overall computation at least twice slower. AVX-2 have a gather instruction that can theoretically help for such a case. That being said, the instruction is currently not well efficiently implemented on current Intel/AMD processors (it basically does scalar loads internally, saturating the load ports). Still, it should certainly make run1 runs as fast as run2 if the later is not vectorized (otherwise run2 should sharply outperform run1 even with gather instructions). Compilers unfortunately have a hard time using such instructions yet.
In fact, regarding the code and the timing, run2 should be even faster if SIMD instruction would be used. I think this is not the case and this is certainly because the -O2 optimization level is currently set in your code and compilers like GCC does not yet automatically vectorize the code (unless with the very last version AFAIK). Please consider using -O3. Please also consider enabling -mavx and -mavx2 if possible (this assume the target processor are not too old) as it should make the code faster.

Reasons for differences in memory consumption and performances of np.zeros and np.full

When measuring memory consumption of np.zeros:
import psutil
import numpy as np
process = psutil.Process()
N=10**8
start_rss = process.memory_info().rss
a = np.zeros(N, dtype=np.float64)
print("memory for a", process.memory_info().rss - start_rss)
the result is unexpected 8192 bytes, i.e almost 0, while 1e8 doubles would need 8e8 bytes.
When replacing np.zeros(N, dtype=np.float64) by np.full(N, 0.0, dtype=np.float64) the memory needed for a are 800002048 bytes.
There are similar discrepancies in running times:
import numpy as np
N=10**8
%timeit np.zeros(N, dtype=np.float64)
# 11.8 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.full(N, 0.0, dtype=np.float64)
# 419 ms ± 7.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I.e. np.zeros is up to 40 times faster for big sizes.
Not sure these differences are for all architectures/operating systems, but I've observed it at least for x86-64 Windows and Linux.
Which differences between np.zeros and np.full can explain different memory consumption and different running times?

I don't trust psutil for these memory benchmarks, and rss (Resident Set Size) may not be the right metric in the first place.
Using stdlib tracemalloc you can get correct looking numbers for memory allocation - it should be approx an 800000000 bytes delta for this N and float64 dtype:
>>> import numpy as np
>>> import tracemalloc
>>> N = 10**8
>>> tracemalloc.start()
>>> tracemalloc.get_traced_memory() # current, peak
(159008, 1874350)
>>> a = np.zeros(N, dtype=np.float64)
>>> tracemalloc.get_traced_memory()
(800336637, 802014880)
For the timing differences between np.full and np.zeros, compare the man pages for malloc and calloc, i.e. the np.zeros is able to go to an allocation routine which gets zeroed pages. See PyArray_Zeros --> calls PyArray_NewFromDescr_int passing in 1 for the zeroed argument, which then has a special case for allocating zeros faster:
if (zeroed || PyDataType_FLAGCHK(descr, NPY_NEEDS_INIT)) {
data = npy_alloc_cache_zero(nbytes);
}
else {
data = npy_alloc_cache(nbytes);
}
It looks like np.full does not have this fast path. There the performance will be similar to first doing an init and then doing a copy O(n):
a = np.empty(N, dtype=np.float64)
a[:] = np.float64(0.0)
numpy devs could presumably have added a fast path to np.full if the fill value was zero, but why bother to add another way to do the same thing - users could just use np.zeros in the first place.

The numpy.zeros function straight uses the C code layer of the Numpy library while the functions ones and full works as same by initializing an array of values and copying the desired value in it.
Then the zeros function doesn't need any language interpretation while for the others, ones and full, the Python code need to be interpreted as C code.
Have a look on the source code to figure it out by yourself: https://github.com/numpy/numpy/blob/master/numpy/core/numeric.py

Can idle threads in OpenMP/Cython be used to parallelize remaining section of the work block?

I am new to OpenMP and using it to parallelize a for-loop (to be accurate, I am using prange in Cython).
However, the operations are very uneven, and, as a result, there are quite a few idle threads till one block of the for-loop is completed.
I wanted to know whether there is a way to access the idle threads so that I can use them to parallelize the bottleneck operations.

This question boils down to the question of perfect scheduling of tasks, which is quite hard for a general case, so usually one fall back to heuristics.
OpenMP offers different heuristics for scheduling, which can be chosen via schedule-argument to prange (documentation).
Let's look at the following example:
%%cython -c=/openmp --link-args=/openmp
cdef double calc(int n) nogil:
cdef double d=0.0
cdef int i
for i in range(n):
d+=0.1*i*n
return d
def single_sum(int n):
cdef int i
cdef double sum = 0.0
for i in range(n):
sum += calc(i)
return sum
The evaluation of calc takes O(n), because a IEEE 754 complying compiler is not able to optimize the for-loop.
Now let's replace range through prange:
...
from cython.parallel import prange
def default_psum(int n):
cdef int i
cdef double sum = 0.0
for i in prange(n, nogil=True, num_threads=2):
sum += calc(i)
return sum
I have chosen to limit the number of threads to 2, to make the effect more dramatic. Now, comparing the running times we see:
N=4*10**4
%timeit single_sum(N) #no parallelization
# 991 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit default_psum(N) #parallelization
# 751 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
not as much improvement as we would like (i.e. we would like speed-up 2)!
It is an implementation detail of OpenMP-provider, which schedule is chosen when it is not explicitly set - but most probably it will be "static" without defining chunksize. In this case, the range is halved and one threads becomes the first, fast half, while another the second, where almost all of the work must be done - so a big part of the work isn't parallelized in the end.
A better strategy to achieve a better balance is to give i=0 to the first thread, i=1 to the second, i=2 again to the first and so on. This can be achieved for "static"-schedule by setting chunksize to 1:
def static_psum1(int n):
cdef int i
cdef double sum = 0.0
for i in prange(n, nogil=True, num_threads=2, schedule="static", chunksize=1):
sum += calc(i)
return sum
we almost reach the maximally possible speed-up of 2:
%timeit static_psum1(N)
# 511 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Choosing best schedule is a trade-off between scheduling overhead (not very high in the example above) and best work-balance - and the best trade-off can only be achieved only after analyzing the problem (and hardware!) at hand.
Here are some timings for the above example for different scheduling strategies and different number of threads:
(schedule,chunksize) N=2 N=8
single-threaded 991 ms 991 ms
(default) 751 ms 265 ms
static 757 ms 274 ms
static,1 511 ms 197 ms
static,10 512 ms 166 ms
dynamic,1 509 ms 158 ms
dynamic,10 509 ms 156 ms
guided 508 ms 158 ms
Trying to use different schedules makes only sense, when there is at least a theoretical possibility to achieve a good balance.
If there is a task, which takes 90% of running time, then no matter which schedule-strategy is used - it will not be possible to improve the performance. In this case the big task itself should be parallelized, sadly Cython's support for OpenMP is somewhat lacking (see for example this SO-post), so possible it is better to code in pure C and then wrap the resulting functionality with Cython.

Why python broadcasting in the example below is slower than a simple loop?

I have an array of vectors and compute the norm of their diffs vs the first one.
When using python broadcasting, the calculation is significantly slower than doing it via a simple loop. Why?
import numpy as np
def norm_loop(M, v):
n = M.shape[0]
d = np.zeros(n)
for i in range(n):
d[i] = np.sum((M[i] - v)**2)
return d
def norm_bcast(M, v):
n = M.shape[0]
d = np.zeros(n)
d = np.sum((M - v)**2, axis=1)
return d
M = np.random.random_sample((1000, 10000))
v = M[0]
%timeit norm_loop(M, v)
25.9 ms
%timeit norm_bcast(M, v)
38.5 ms
I have Python 3.6.3 and Numpy 1.14.2
To run the example in google colab:
https://drive.google.com/file/d/1GKzpLGSqz9eScHYFAuT8wJt4UIZ3ZTru/view?usp=sharing

Memory access.
First off, the broadcast version can be simplified to
def norm_bcast(M, v):
return np.sum((M - v)**2, axis=1)
This still runs slightly slower than the looped version.
Now, conventional wisdom says that vectorized code using broadcasting should always be faster, which in many cases isn't true (I'll shamelessly plug another of my answers here). So what's happening?
As I said, it comes down to memory access.
In the broadcast version every element of M is subtracted from v. By the time the last row of M is processed the results of processing the first row have been evicted from cache, so for the second step these differences are again loaded into cache memory and squared. Finally, they are loaded and processed a third time for the summation. Since M is quite large, parts of the cache are cleared on each step to acomodate all of the data.
In the looped version each row is processed completely in one smaller step, leading to fewer cache misses and overall faster code.
Lastly, it is possible to avoid this with some array operations by using einsum.
This function allows mixing matrix multiplications and summations.
First, I'll point out it's a function that has rather unintuitive syntax compared to the rest of numpy, and potential improvements often aren't worth the extra effort to understand it.
The answer may also be slightly different due to rounding errors.
In this case it can be written as
def norm_einsum(M, v):
tmp = M-v
return np.einsum('ij,ij->i', tmp, tmp)
This reduces it to two operations over the entire array - a subtraction, and calling einsum, which performs the squaring and summation.
This gives a slight improvement:
%timeit norm_bcast(M, v)
30.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_loop(M, v)
25.1 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_einsum(M, v)
21.7 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Squeezing out maximum performance
On vectorized operations you clearly have a bad cache behaviour. But the calculation itsef is also slow due to not exploiting modern SIMD instructions (AVX2,FMA). Fortunately it isn't really complicated to overcome this issues.
Example
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def norm_loop_improved(M, v):
n = M.shape[0]
d = np.empty(n,dtype=M.dtype)
#enables SIMD-vectorization
#if the arrays are not aligned
M=np.ascontiguousarray(M)
v=np.ascontiguousarray(v)
for i in nb.prange(n):
dT=0.
for j in range(v.shape[0]):
dT+=(M[i,j]-v[j])*(M[i,j]-v[j])
d[i]=dT
return d
Performance
M = np.random.random_sample((1000, 1000))
norm_loop_improved: 0.11 ms**, 0.28ms
norm_loop: 6.56 ms
norm_einsum: 3.84 ms
M = np.random.random_sample((10000, 10000))
norm_loop_improved:34 ms
norm_loop: 223 ms
norm_einsum: 379 ms
** Be careful when measuring performance
The first result (0.11ms) comes from calling the function repeadedly with the same data. This would need 77 GB/s reading-throuput from RAM, which is far more than my DDR3 Dualchannel-RAM is capable of. Due to the fact that calling a function with the same input parameters successively isn't realistic at all, we have to modify the measurement.
To avoid this issue we have to call the same function with different data at least twice (8MB L3-cache, 8MB data) and than divide the result by two to clear all the caches.
The relative performance of this methods also differ on array sizes (have a look at the einsum results).

Why are Python's arrays slow?

I expected array.array to be faster than lists, as arrays seem to be unboxed.
However, I get the following result:
In [1]: import array
In [2]: L = list(range(100000000))
In [3]: A = array.array('l', range(100000000))
In [4]: %timeit sum(L)
1 loop, best of 3: 667 ms per loop
In [5]: %timeit sum(A)
1 loop, best of 3: 1.41 s per loop
In [6]: %timeit sum(L)
1 loop, best of 3: 627 ms per loop
In [7]: %timeit sum(A)
1 loop, best of 3: 1.39 s per loop
What could be the cause of such a difference?

The storage is "unboxed", but every time you access an element Python has to "box" it (embed it in a regular Python object) in order to do anything with it. For example, your sum(A) iterates over the array, and boxes each integer, one at a time, in a regular Python int object. That costs time. In your sum(L), all the boxing was done at the time the list was created.
So, in the end, an array is generally slower, but requires substantially less memory.
Here's the relevant code from a recent version of Python 3, but the same basic ideas apply to all CPython implementations since Python was first released.
Here's the code to access a list item:
PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
/* error checking omitted */
return ((PyListObject *)op) -> ob_item[i];
}
There's very little to it: somelist[i] just returns the i'th object in the list (and all Python objects in CPython are pointers to a struct whose initial segment conforms to the layout of a struct PyObject).
And here's the __getitem__ implementation for an array with type code l:
static PyObject *
l_getitem(arrayobject *ap, Py_ssize_t i)
{
return PyLong_FromLong(((long *)ap->ob_item)[i]);
}
The raw memory is treated as a vector of platform-native C long integers; the i'th C long is read up; and then PyLong_FromLong() is called to wrap ("box") the native C long in a Python long object (which, in Python 3, which eliminates Python 2's distinction between int and long, is actually shown as type int).
This boxing has to allocate new memory for a Python int object, and spray the native C long's bits into it. In the context of the original example, this object's lifetime is very brief (just long enough for sum() to add the contents into a running total), and then more time is required to deallocate the new int object.
This is where the speed difference comes from, always has come from, and always will come from in the CPython implementation.

To add to Tim Peters' excellent answer, arrays implement the buffer protocol, while lists do not. This means that, if you are writing a C extension (or the moral equivalent, such as writing a Cython module), then you can access and work with the elements of an array much faster than anything Python can do. This will give you considerable speed improvements, possibly well over an order of magnitude. However, it has a number of downsides:
You are now in the business of writing C instead of Python. Cython is one way to ameliorate this, but it does not eliminate many fundamental differences between the languages; you need to be familiar with C semantics and understand what it is doing.
PyPy's C API works to some extent, but isn't very fast. If you are targeting PyPy, you should probably just write simple code with regular lists, and then let the JITter optimize it for you.
C extensions are harder to distribute than pure Python code because they need to be compiled. Compilation tends to be architecture and operating-system dependent, so you will need to ensure you are compiling for your target platform.
Going straight to C extensions may be using a sledgehammer to swat a fly, depending on your use case. You should first investigate NumPy and see if it is powerful enough to do whatever math you're trying to do. It will also be much faster than native Python, if used correctly.

Tim Peters answered why this is slow, but let's see how to improve it.
Sticking to your example of sum(range(...)) (factor 10 smaller than your example to fit into memory here):
import numpy
import array
L = list(range(10**7))
A = array.array('l', L)
N = numpy.array(L)
%timeit sum(L)
10 loops, best of 3: 101 ms per loop
%timeit sum(A)
1 loop, best of 3: 237 ms per loop
%timeit sum(N)
1 loop, best of 3: 743 ms per loop
This way also numpy needs to box/unbox, which has additional overhead. To make it fast one has to stay within the numpy c code:
%timeit N.sum()
100 loops, best of 3: 6.27 ms per loop
So from the list solution to the numpy version this is a factor 16 in runtime.
Let's also check how long creating those data structures takes
%timeit list(range(10**7))
1 loop, best of 3: 283 ms per loop
%timeit array.array('l', range(10**7))
1 loop, best of 3: 884 ms per loop
%timeit numpy.array(range(10**7))
1 loop, best of 3: 1.49 s per loop
%timeit numpy.arange(10**7)
10 loops, best of 3: 21.7 ms per loop
Clear winner: Numpy
Also note that creating the data structure takes about as much time as summing, if not more. Allocating memory is slow.
Memory usage of those:
sys.getsizeof(L)
90000112
sys.getsizeof(A)
81940352
sys.getsizeof(N)
80000096
So these take 8 bytes per number with varying overhead. For the range we use 32bit ints are sufficient, so we can safe some memory.
N=numpy.arange(10**7, dtype=numpy.int32)
sys.getsizeof(N)
40000096
%timeit N.sum()
100 loops, best of 3: 8.35 ms per loop
But it turns out that adding 64bit ints is faster than 32bit ints on my machine, so this is only worth it if you are limited by memory/bandwidth.

I noticed that typecode L is faster than l, and it also works in I and Q.
Python 3.8.5
Here is the code of the test.
Check it out d_d.
#!/usr/bin/python3
import inspect
from tqdm import tqdm
from array import array
def get_var_name(var):
"""
Gets the name of var. Does it from the out most frame inner-wards.
:param var: variable to get name from.
:return: string
"""
for fi in reversed(inspect.stack()):
names = [var_name for var_name, var_val in fi.frame.f_locals.items() if var_val is var]
if len(names) > 0:
return names[0]
def performtest(func, n, *args, **kwargs):
times = array('f')
times_append = times.append
for i in tqdm(range(n)):
st = time.time()
func(*args, **kwargs)
times_append(time.time() - st)
print(
f"Func {func.__name__} with {[get_var_name(i) for i in args]} run {n} rounds consuming |"
f" Mean: {sum(times)/len(times)}s | Max: {max(times)}s | Min: {min(times)}s"
)
def list_int(start, end, step=1):
return [i for i in range(start, end, step)]
def list_float(start, end, step=1):
return [i + 1e-1 for i in range(start, end, step)]
def array_int(start, end, step=1):
return array("I", range(start, end, step)) # speed I > i, H > h, Q > q, I~=H~=Q
def array_float(start, end, step=1):
return array("f", [i + 1e-1 for i in range(start, end, step)]) # speed f > d
if __name__ == "__main__":
performtest(list_int, 1000, 0, 10000)
performtest(array_int, 1000, 0, 10000)
performtest(list_float, 1000, 0, 10000)
performtest(array_float, 1000, 0, 10000)
Results
Result of the test

please note that 100000000 equals to 10^8 not to 10^7, and my results are as the folowwing:
100000000 == 10**8
# my test results on a Linux virtual machine:
#<L = list(range(100000000))> Time: 0:00:03.263585
#<A = array.array('l', range(100000000))> Time: 0:00:16.728709
#<L = list(range(10**8))> Time: 0:00:03.119379
#<A = array.array('l', range(10**8))> Time: 0:00:18.042187
#<A = array.array('l', L)> Time: 0:00:07.524478
#<sum(L)> Time: 0:00:01.640671
#<np.sum(L)> Time: 0:00:20.762153

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.