Why is `np.sum(range(N))` very slow?

Why is `np.sum(range(N))` very slow? - python

I saw a video about speed of loops in python, where it was explained that doing sum(range(N)) is much faster than manually looping through range and adding the variables together, since the former runs in C due to built-in functions being used, while in the latter the summation is done in (slow) python. I was curious what happens when adding numpy to the mix. As I expected np.sum(np.arange(N)) is the fastest, but sum(np.arange(N)) and np.sum(range(N)) are even slower than doing the naive for loop.
Why is this?
Here's the script I used to test, some comments about the supposed cause of slowing done where I know (taken mostly from the video) and the results I got on my machine (python 3.10.0, numpy 1.21.2):
updated script:
import numpy as np
from timeit import timeit
N = 10_000_000
repetition = 10
def sum0(N = N):
s = 0
i = 0
while i < N: # condition is checked in python
s += i
i += 1 # both additions are done in python
return s
def sum1(N = N):
s = 0
for i in range(N): # increment in C
s += i # addition in python
return s
def sum2(N = N):
return sum(range(N)) # everything in C
def sum3(N = N):
return sum(list(range(N)))
def sum4(N = N):
return np.sum(range(N)) # very slow np.array conversion
def sum5(N = N):
# much faster np.array conversion
return np.sum(np.fromiter(range(N),dtype = int))
def sum5v2_(N = N):
# much faster np.array conversion
return np.sum(np.fromiter(range(N),dtype = np.int_))
def sum6(N = N):
# possibly slow conversion to Py_long from np.int
return sum(np.arange(N))
def sum7(N = N):
# list returns a list of np.int-s
return sum(list(np.arange(N)))
def sum7v2(N = N):
# tolist conversion to python int seems faster than the implicit conversion
# in sum(list()) (tolist returns a list of python int-s)
return sum(np.arange(N).tolist())
def sum8(N = N):
return np.sum(np.arange(N)) # everything in numpy (fortran libblas?)
def sum9(N = N):
return np.arange(N).sum() # remove dispatch overhead
def array_basic(N = N):
return np.array(range(N))
def array_dtype(N = N):
return np.array(range(N),dtype = np.int_)
def array_iter(N = N):
# np.sum's source code mentions to use fromiter to convert from generators
return np.fromiter(range(N),dtype = np.int_)
print(f"while loop: {timeit(sum0, number = repetition)}")
print(f"for loop: {timeit(sum1, number = repetition)}")
print(f"sum_range: {timeit(sum2, number = repetition)}")
print(f"sum_rangelist: {timeit(sum3, number = repetition)}")
print(f"npsum_range: {timeit(sum4, number = repetition)}")
print(f"npsum_iterrange: {timeit(sum5, number = repetition)}")
print(f"npsum_iterrangev2: {timeit(sum5, number = repetition)}")
print(f"sum_arange: {timeit(sum6, number = repetition)}")
print(f"sum_list_arange: {timeit(sum7, number = repetition)}")
print(f"sum_arange_tolist: {timeit(sum7v2, number = repetition)}")
print(f"npsum_arange: {timeit(sum8, number = repetition)}")
print(f"nparangenpsum: {timeit(sum9, number = repetition)}")
print(f"array_basic: {timeit(array_basic, number = repetition)}")
print(f"array_dtype: {timeit(array_dtype, number = repetition)}")
print(f"array_iter: {timeit(array_iter, number = repetition)}")
print(f"npsumarangeREP: {timeit(lambda : sum8(N/1000), number = 100000*repetition)}")
print(f"npsumarangeREP: {timeit(lambda : sum9(N/1000), number = 100000*repetition)}")
# Example output:
#
# while loop: 11.493371912998555
# for loop: 7.385945574002108
# sum_range: 2.4605720699983067
# sum_rangelist: 4.509678105998319
# npsum_range: 11.85120212900074
# npsum_iterrange: 4.464334709002287
# npsum_iterrangev2: 4.498494338993623
# sum_arange: 9.537815956995473
# sum_list_arange: 13.290120724996086
# sum_arange_tolist: 5.231948580003518
# npsum_arange: 0.241889145996538
# nparangenpsum: 0.21876695199898677
# array_basic: 11.736577274998126
# array_dtype: 8.71628468400013
# array_iter: 4.303306431000237
# npsumarangeREP: 21.240833958996518
# npsumarangeREP: 16.690092379001726

np.sum(range(N)) is slow mostly because the current Numpy implementation do not use enough informations about the exact type/content of the values provided by the generator range(N). The heart of the general problem is inherently due to dynamic typing of Python and big integers although Numpy could optimize this specific case.
First of all, range(N) returns a dynamically-typed Python object which is a (special kind of) Python generator. The object provided by this generator are also dynamically-typed. It is in practice a pure-Python integer.
The thing is Numpy is written in the statically-typed language C and so it cannot efficiently work on dynamically-typed pure-Python objects. The strategy of Numpy is to convert such objects into C types when it can. One big problem in this case is that the integers provided by the generator can theorically be huge: Numpy do not know if the values can overflow a np.int32 or even a np.int64 type. Thus, Numpy first detect the good type to use and then compute the result using this type.
This translation process can be quite expensive and appear not to be needed here since all the values provided by range(10_000_000). However, range(5_000_000_000) returns the same object type with pure-Python integers overflowing np.int32 and Numpy needs to automatically detect this case not to return wrong results. The thing is also the input type can be correctly identified (np.int32 on my machine), it does not means that the output result will be correct because overflows can appear in during the computation of the sum. This is sadly the case on my machine.
Numpy developers decided to deprecate such a use and put in the documentation that np.fromiter should be used instead. np.fromiter has a dtype required parameter to let the user define what is the good type to use.
One way to check this behaviour in practice is to simply use create a temporary list:
tmp = list(range(10_000_000))
# Numpy implicitly convert the list in a Numpy array but
# still automatically detect the input type to use
np.sum(tmp)
A faster implementation is the following:
tmp = list(range(10_000_000))
# The array is explicitly converted using a well-defined type and
# thus there is no need to perform an automatic detection
# (note that the result is still wrong since it does not fit in a np.int32)
tmp2 = np.array(tmp, dtype=np.int32)
result = np.sum(tmp2)
The first case takes 476 ms on my machine while the second takes 289 ms. Note that np.sum takes only 4 ms. Thus, a large part of the time is spend in the conversion of pure-Python integer objects to internal int32 types (more specifically the management of pure-Python integers). list(range(10_000_000)) is expensive too as it takes 205 ms. This is again due to the overhead of pure-Python integers (ie. allocations, deallocations, reference counting, increment of variable-sized integers, memory indirections and conditions due to the dynamic typing) as well as the overhead of the generator.
sum(np.arange(N)) is slow because sum is a pure-Python function working on a Numpy-defined object. The CPython interpreter needs to call Numpy functions to perform basic additions. Moreover, Numpy-defined integer object are still Python object and so they are subject to reference counting, allocation, deallocation, etc. Not to mention Numpy and CPython add many checks in the functions aiming to finally just add two native numbers together. A Numpy-aware just-in-time compiler such as Numba can solve this issue. Indeed, Numba takes 23 ms on my machine to compute the sum of np.arange(10_000_000) (with code still written in Python) while the CPython interpreter takes 556 ms.

Let's see if I can summarize the results.
sum can work with any iterable, repeatedly asking for the next value and adding it. range is a generator, that's happy to supply the next value
# sum_range: 1.4830789409988938
Making a list from a range takes time:
# sum_rangelist: 3.6745876889999636
Summing a pregenerated list is actually faster than summing the range:
%%timeit x = list(range(N))
...: sum(x)
np.sum is designed to sum arrays. It's a wrapper to np.add.reduce.
np.sum has a deprecation warning for np.sum(generator), recommending the use of fromiter or Python sum:
# npsum_range: 16.216972655000063
fromiter is the best way of making an array from a generator. Using np.array on range is legacy code and may go away in the future. I think it's the only generator that np.array will accept.
np.array is a general purpose function that can handle many cases, including nested arrays, and conversion to various dtypes. As such it has to process the whole input argument, deducing both shape and dtype.
# npsum_fromiterrange:3.47655400199983
Iteration on a numpy array is slower than a list, since it has to "unbox" each element.
# sum_arange: 16.656015603000924
Similarly making a list from an array is slow; same sort of python level iteration.
# sum_list_arange: 19.500842117000502
arr.tolist() is relatively fast, creating a pure python list in compiled code. So speed is similar to making a list from range.
# sum_arange_tolist: 4.004777374000696
np.sum of an array is pure numpy and quite fast. np.sum(x) where x=np.arange(N) is even faster (by about 4x)
# npsum_arange: 0.2332638230000157
np.sum from range or list is dominated by the cost of creating the array first:
# array_basic: 16.1631146109994
# array_dtype: 16.550737804000164
# array_iter: 3.9803170430004684

From the cpython source code for sum sum initially seems to attempt a fast path that assumes all inputs are the same type. If that fails it will just iterate:
/* Fast addition by keeping temporary sums in C instead of new Python objects.
Assumes all inputs are the same type. If the assumption fails, default
to the more general routine.
*/
I'm not entirely certain what is happening under the hood, but it is likely the repeated creation/conversion of C types to Python objects that is causing these slow-downs. It's worth noting that both sum and range are implemented in C.
This next bit is not really an answer to the question, but I wondered if we could speed up sum for python ranges as range is quite a smart object.
To do this I've used functools.singledispatch to override the built-in sum function specifically for the range type; then implemented a small function to calculate the sum of an arithmetic progression.
from functools import singledispatch
def sum_range(range_, /, start=0):
"""Overloaded `sum` for range, compute arithmetic sum"""
n = len(range_)
if not n:
return start
return int(start + (n * (range_[0] + range_[-1]) / 2))
sum = singledispatch(sum)
sum.register(range, sum_range)
def test():
"""
>>> sum(range(0, 100))
4950
>>> sum(range(0, 10, 2))
20
>>> sum(range(0, 9, 2))
20
>>> sum(range(0, -10, -1))
-45
>>> sum(range(-10, 10))
-10
>>> sum(range(-1, -100, -2))
-2500
>>> sum(range(0, 10, 100))
0
>>> sum(range(0, 0))
0
>>> sum(range(0, 100), 50)
5000
>>> sum(range(0, 0), 10)
10
"""
if __name__ == "__main__":
import doctest
doctest.testmod()
I'm not sure if this is complete, but it's definitely faster than looping.

Related

The fastest way possible to split a long string

I have a very long string in Python:
x = "12;14;14;14;18;12;17;19" # I only show a small part of it : there are 10 millions of ;
The goal is to transform it into:
y = array([12, 14, 14, 14, 18, 12, 17, 19], dtype=int)
One way to do this is to use array(x.split(";")) or numpy.fromtostring.
But both are extremely slow.
Is there quicker way to do it in python?
Thank you very much and have a nice day.

String parsing is often slow. Unicode decoding often make things slower (especially when there are non-ASCII character) unless it is carefully optimized (hard). CPython is slow, especially loops. Numpy is not really design to (efficiently) deal with strings. I do not think Numpy can do this faster than fromstring yet. The only solutions I can come up with are using Numba, Cython or even basic C extensions. The simplest solution is to use Numba, the fastest is to use Cython/C-extensions.
Unfortunately Numba is very slow for strings/bytes so far (this is an open issue that is not planed to be solved any time soon). Some tricks are needed so that Numba can compute this efficiently: the string needs to be converted to a Numpy array. This means it must be first encoded to a byte-array first to avoid any variable-sized encoding (like UTF-8). np.frombuffer seems the fastest solution to convert the buffer to a Numpy array. Since the input is a read-only array (unusual, but efficient), the Numba signature is not very easy to read.
Here is the final solution:
import numpy as np
import numba as nb
#nb.njit(nb.int32[::1](nb.types.Array(nb.uint8, 1, 'C', readonly=True,)))
def compute(arr):
sep = ord(';')
base = ord('0')
minus = ord('-')
count = 1
for c in arr:
count += c == sep
res = np.empty(count, np.int32)
val = 0
positive = True
cur = 0
for c in arr:
if c != sep and c != minus:
val = (val * 10) + c - base
elif c == minus:
positive = False
else:
res[cur] = val if positive else -val
cur += 1
val = 0
positive = True
if cur < count:
res[cur] = val if positive else -val
return res
x = ';'.join(np.random.randint(0, 200, 10_000_000).astype(str))
result = compute(np.frombuffer(x.encode('ascii'), np.uint8))
Note that the Numba solution performs no checks for sake of performance. It also assume the numbers are positive ones. Thus, you must ensure the input is valid. Alternatively, you can perform additional checks in it (at the expense of a slower code).
Here are performance results on my machine with a i5-9600KF processor (with Numpy 1.22.4 on Windows):
np.fromstring(x, dtype=np.int32, sep=';'): 8927 ms
np.array(re.split(";", x), dtype=np.int32): 1419 ms
np.array(x.split(";"), dtype=np.int32): 1196 ms
Numba implementation: 78 ms
Numba implementation (without negative numbers): 67 ms
This solution is 114 times faster than np.fromstring and 15 times faster than the fastest solution (based on split). Note that removing the support for negative numbers makes the Numba function 18% faster. Also, note that 10~12% of the time is spent in encode. The rest of the time comes from the main loop in the Numba function. More specifically, the conditionals in the loop are the main source of the slowdown because they can hardly predicted by the processor and they prevent the use of fast (SIMD) instructions. This is often why string parsing is slow.
A possible improvement is to use a branchless implementation operating on chunks. Another possible improvement is to compute the chunks using multiple threads. However, both optimizations are tricky to do and they both make the resulting code significantly harder to read (and so to maintain).

summing over a list of int overflow(?) python

Let's consider a list of large integers, for example one given by:
def primesfrom2to(n):
# http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
""" Input n>=6, Returns a array of primes, 2 <= p < n """
sieve = np.ones(n/3 + (n%6==2), dtype=np.bool)
sieve[0] = False
for i in xrange(int(n**0.5)/3+1):
if sieve[i]:
k=3*i+1|1
sieve[ ((k*k)/3) ::2*k] = False
sieve[(k*k+4*k-2*k*(i&1))/3::2*k] = False
return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]
primesfrom2to(2000000)
I want to calculate the sum of that, and the expected result is 142913828922.
But if I do:
sum(primesfrom2to(2000000))
I get 1179908154, which is clearly wrong. The problem is that I have an int overflow, but I don't understand why. Let's me explain.Consider this testing code:
a=primesfrom2to(2000000)
b=[float(i) for i in a]
c=[long(i) for i in a]
sumI=0
sumF=0
sumL=0
m=0
for i,j,k in zip(a,b,c):
m=m+1
sumI=sumI+i
sumF=sumF+j
sumL=sumL+k
print sumI,sumF,sumL
if sumI<0:
print i,m
break
I found out that the first integer overflow is happening at a[i=20444]=225289
If I do:
>>> sum(a[:20043])+225289
-2147310677
But if I do:
>>> sum(a[:20043])
2147431330
>>> 2147431330+225289
2147656619L
What's happening? Why such a different behaviour? Why can't sum switch automatically to long type and give the correct result?

Look at the types of your results. You are summing a numpy array, which is using numpy datatypes, which can overflow. When you do sum(a[:20043]), you get a numpy object back (some sort of int32 or the like), which overflows when added to another number. When you manually type in the same number, you're creating a Python builtin int, which can auto-promote to long. Numpy arrays cannot autopromote like Python builtin types, because the array type (and its memory layout) have to be fixed when the array is created. This makes operations much faster at the expense of type flexibility.
You may be able to get around the problem by using a different datatype (like np.int64) instead of np.bool. However, it depends how big your numbers are. A simple example:
# Python types ok
>>> 2**62
4611686018427387904L
>>> 2**63
9223372036854775808L
# numpy types overflow
>>> np.int64(2)**62
4611686018427387904
>>> np.int64(2)**63
-9223372036854775808
Your example works correctly for me on 64-bit Python, so I guess you're using 32-bit Python. If you can use 64-bit types you will be able to get past the limit you found, but as my example shows you will eventually overflow 64-bit ints too if your numbers get super huge.

Python: Number ranges that are extremely large?

val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = range(0, val)
shuffle(numbers)
I cannot find a simple way to make this work with extremely large inputs - can anyone help?
I saw a question like this - but I could not implement the range function they described in a way that works with shuffle. Thanks.

To get a random permutation of the range [0, n) in a memory efficient manner; you could use numpy.random.permutation():
import numpy as np
numbers = np.random.permutation(n)
If you need only small fraction of values from the range e.g., to get k random values from [0, n) range:
import random
from functools import partial
def sample(n, k):
# assume n is much larger than k
randbelow = partial(random.randrange, n)
# from random.py
result = [None] * k
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow()
while j in selected:
j = randbelow()
selected_add(j)
result[i] = j
return result
print(sample(10**100, 10))

If you don't need the full list of numbers (and if you are getting billions, its hard to imagine why you would need them all), you might be better off taking a random.sample of your number range, rather than shuffling them all. In Python 3, random.sample can work on a range object too, so your memory use can be quite modest.
For example, here's code that will sample ten thousand random numbers from a range up to whatever maximum value you specify. It should require only a relatively small amount of memory beyond the 10000 result values, even if your maximum is 100 billion (or whatever enormous number you want):
import random
def get10kRandomNumbers(maximum):
pop = range(1, maximum+1) # this is memory efficient in Python 3
sample = random.sample(pop, 10000)
return sample
Alas, this doesn't work as nicely in Python 2, since xrange objects don't allow maximum values greater than the system's integer type can hold.

An important point to note is that it will be impossible for a computer to have the list of numbers in memory if it is larger than a few billion elements: its memory footprint becomes larger than the typical RAM size (as it takes about 4 GB for 1 billion 32-bit numbers).
In the question, val is a long integer, which seems to indicate that you are indeed using more than a billion integer, so this cannot be done conveniently in memory (i.e., shuffling will be slow, as the operating system will swap).
That said, if the number of elements is small enough (let's say smaller than 0.5 billion), then a list of elements can fit in memory thanks to the compact representation offered by the array module, and be shuffled. This can be done with the standard module array:
import array, random
numbers = array.array('I', xrange(10**8)) # or 'L', if the number of bytes per item (numbers.itemsize) is too small with 'I'
random.shuffle(numbers)

Performance of small sets in Python

I am looking for the most efficient way to represent small sets of integers in a given range (say 0-10) in Python. In this case, efficiency means fast construction (from an unsorted list), fast query (a couple of queries on each set), and reasonably fast construction of a sorted version (perhaps once per ten sets or so). A priori the candidates are using Python's builtin set type (fast query), using a sorted array (perhaps faster to constrct?), or using a bit-array (fast everything if I was in C... but I doubt Python will be that efficient (?)). Any advice of which one to choose?
Thanks.

I'd use a bitmapping and store the members of a "set" in an int...which might actually be faster than the built-in set type in this case -- although I haven't tested that. It would definitely require less storage.
Update
I don't have the time right now to do a full set-like implementation and benchmark it against Python's built-in class, but here's what I believe is a working example illustrating my suggestion. As I think you'd agree, the code looks fairly fast as well as memory efficient.
Given Python's almost transparent "unlimited" long integer capabilities, what is written will automatically work with integer values in a much larger range than you need, although doing so would likely slow things down a bit. ;)
class BitSet(object):
def __init__(self, *bitlist):
self._bitmap = 0
for bitnum in bitlist:
self._bitmap |= (1 << bitnum)
def add(self, bitnum):
self._bitmap |= (1 << bitnum)
def remove(self, bitnum):
if self._bitmap & (1 << bitnum):
self._bitmap &= ~(1 << bitnum)
else:
raise KeyError
def discard(self, bitnum):
self._bitmap &= ~(1 << bitnum)
def clear(self):
self._bitmap = 0
def __contains__(self, bitnum):
return bool(self._bitmap & (1 << bitnum))
def __int__(self):
return self._bitmap
if __name__ == '__main__':
bs = BitSet()
print '28 in bs:', 28 in bs
print 'bs.add(28)'
bs.add(28)
print '28 in bs:', 28 in bs
print
print '5 in bs:', 5 in bs
print 'bs.add(5)'
bs.add(5)
print '5 in bs:', 5 in bs
print
print 'bs.remove(28)'
bs.remove(28)
print '28 in bs:', 28 in bs

In this case you might just use a list of True/False values. The hash table used by set will be doing the same thing, but it will include overhead for hashing, bucket assignment, and collision detection.
myset = [False] * 11
for i in values:
myset[i] = True
mysorted = [i for i in range(11) if myset[i]]
As always you need to time it yourself to know how it works in your circumstances.

My advice is to stick with the built-in set(). It will be very difficult to write Python code that beats the built-in C code for performance. Speed of construction and speed of lookup will be fastest if you are relying on the built-in C code.
For a sorted list, your best bet is to use the built-in sort feature:
x = set(seq) # build set from some sequence
lst = sorted(x) # get sorted list from set
In general, in Python, the less code you write, the faster it is. The more you can rely on the built-in C underpinnings of Python, the faster. Interpreted Python is 20x to 100x slower than C code in many cases, and it is extremely hard to be so clever that you come out ahead vs. just using the built-in features as intended.
If your sets are guaranteed to always be integers in the range of [0, 10], and you want to make sure the memory footprint is as small as possible, then bit-flags inside an integer would be the way to go.
pow2 = [2**i for i in range(32)]
x = 0 # set with no values
def add_to_int_set(x, n):
return x | pow2[n]
def in_int_set(x, n):
return x & pow2[n]
def list_from_int_set(x):
return [i for i in range(32) if x & pow2[i]]
I'll bet this is actually slower than using the built-in set() functions, but you know that each set will just be an int object: 4 bytes, plus the overhead of a Python object.
If you literally needed billions of them, you could save space by using a NumPy array instead of a Python list; the NumPy array will just store bare integers. In fact, NumPy has a 16-bit integer type, so if your sets are really only in the range of [0, 10] you could get the storage size down to two bytes each using a NumPy array.
http://www.scipy.org/FAQ#head-16a621f03792969969e44df8a9eb360918ce9613

Even for small collections, 'contains' checks turn out quite a bit faster with sets.
>>> Timer("3 in values", 'values = [range(10)]').timeit(number = 10**7)
0.5200109481811523
>>> Timer("3 in values", 'values = set(range(10))').timeit(number = 10**7)
0.2755239009857178
On the other hand, as you've indicated, constructing a set takes a little bit longer.
>>> Timer("set(range(10))").timeit(number = 10**7)
5.87517786026001
>>> Timer("list(range(10))").timeit(number = 10**7)
4.129410028457642
There are also some differences when sorting:
>>> Timer("sorted(values)", 'values = set(range(10, 0, -1))').timeit(number = 10**7)
5.277467966079712
>>> Timer("sorted(values)", 'values = list(range(10, 0, -1))').timeit(number = 10**7)
4.3836448192596436
>>> Timer("values.sort()", 'values = list(range(10, 0, -1))').timeit(number = 10**7)
2.073429822921753
Sorting in-place is significantly faster and is only available for lists.
So if you're only doing a small amount of queries per collection, lists are more performant. When doing a lot of queries, I'd go with sets.
In either case, the difference between small collections is small.
Building your own collection type in Python for better performance is not recommended.

What is the fastest way in python to build a c array from a list of tuples of floats?

The context: my Python code pass arrays of 2D vertices to OpenGL.
I tested 2 approaches, one with ctypes, the other with struct, the latter being more than twice faster.
from random import random
points = [(random(), random()) for _ in xrange(1000)]
from ctypes import c_float
def array_ctypes(points):
n = len(points)
return n, (c_float*(2*n))(*[u for point in points for u in point])
from struct import pack
def array_struct(points):
n = len(points)
return n, pack("f"*2*n, *[u for point in points for u in point])
Any other alternative?
Any hint on how to accelerate such code (and yes, this is one bottleneck of my code)?

You can pass numpy arrays to PyOpenGL without incurring any overhead. (The data attribute of the numpy array is a buffer that points to the underlying C data structure that contains the same information as the array you're building)
import numpy as np
def array_numpy(points):
n = len(points)
return n, np.array(points, dtype=np.float32)
On my computer, this is about 40% faster than the struct-based approach.

You could try Cython. For me, this gives:
function usec per loop:
Python Cython
array_ctypes 1370 1220
array_struct 384 249
array_numpy 336 339
So Numpy only gives 15% benefit on my hardware (old laptop running WindowsXP), whereas Cython gives about 35% (without any extra dependency in your distributed code).
If you can loosen your requirement that each point is a tuple of floats, and simply make 'points' a flattened list of floats:
def array_struct_flat(points):
n = len(points)
return pack(
"f"*n,
*[
coord
for coord in points
]
)
points = [random() for _ in xrange(1000 * 2)]
then the resulting output is the same, but the timing goes down further:
function usec per loop:
Python Cython
array_struct_flat 157
Cython might be capable of substantially better than this too, if someone smarter than me wanted to add static type declarations to the code. (Running 'cython -a test.pyx' is invaluable for this, it produces an html file showing you where the slowest (yellow) plain Python is in your code, versus python that has been converted to pure C (white). That's why I spread the code above out onto so many lines, because the coloring is done per-line, so it helps to spread it out like that.)
Full Cython instructions are here:
http://docs.cython.org/src/quickstart/build.html
Cython might produce similar performance benefits across your whole codebase, and in ideal conditions, with proper static typing applied, can improve speed by factors of ten or a hundred.

There's another idea I stumbled across. I don't have time to profile it right now, but in case someone else does:
# untested, but I'm fairly confident it runs
# using 'flattened points' list, i.e. a list of n*2 floats
points = [random() for _ in xrange(1000 * 2)]
c_array = c_float * len(points * 2)
c_array[:] = points
That is, first we create the ctypes array but don't populate it. Then we populate it using the slice notation. People smarter than I tell me that assigning to a slice like this may help performance. It allows us to pass a list or iterable directly on the RHS of the assignment, without having to use the *iterable syntax, which would perform some intermediate wrangling of the iterable. I suspect that this is what happens in the depths of creating pyglet's Batches.
Presumably you could just create c_array once, then just reassign to it (the final line in the above code) every time the points list changes.
There is probably an alternative formulation which accepts the original definition of points (a list of (x,y) tuples.) Something like this:
# very untested, likely contains errors
# using a list of n tuples of two floats
points = [(random(), random()) for _ in xrange(1000)]
c_array = c_float * len(points * 2)
c_array[:] = chain(p for p in points)

If performance is an issue, you do not want to use ctypes arrays with the star operation (e.g., (ctypes.c_float * size)(*t)).
In my test packis fastest followed by the use of the array module with a cast of the address (or using the from_buffer function).
import timeit
repeat = 100
setup="from struct import pack; from random import random; import numpy; from array import array; import ctypes; t = [random() for _ in range(2* 1000)];"
print(timeit.timeit(stmt="v = array('f',t); addr, count = v.buffer_info();x = ctypes.cast(addr,ctypes.POINTER(ctypes.c_float))",setup=setup,number=repeat))
print(timeit.timeit(stmt="v = array('f',t);a = (ctypes.c_float * len(v)).from_buffer(v)",setup=setup,number=repeat))
print(timeit.timeit(stmt='x = (ctypes.c_float * len(t))(*t)',setup=setup,number=repeat))
print(timeit.timeit(stmt="x = pack('f'*len(t), *t);",setup=setup,number=repeat))
print(timeit.timeit(stmt='x = (ctypes.c_float * len(t))(); x[:] = t',setup=setup,number=repeat))
print(timeit.timeit(stmt='x = numpy.array(t,numpy.float32).data',setup=setup,number=repeat))
The array.array approach is slightly faster than Jonathan Hartley's approach in my test while the numpy approach has about half the speed:
python3 convert.py
0.004665990360081196
0.004661010578274727
0.026358536444604397
0.0028003649786114693
0.005843495950102806
0.009067213162779808
The net winner is pack.

You can use array (notice also the generator expression instead of the list comprehension):
array("f", (u for point in points for u in point)).tostring()
Another optimization would be to keep the points flattened from the beginning.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.