What is the rationale behind the advocated use of the for i in xrange(...)-style looping constructs in Python? For simple integer looping, the difference in overheads is substantial. I conducted a simple test using two pieces of code:
File idiomatic.py:
#!/usr/bin/env python
M = 10000
N = 10000
if __name__ == "__main__":
x, y = 0, 0
for x in xrange(N):
for y in xrange(M):
pass
File cstyle.py:
#!/usr/bin/env python
M = 10000
N = 10000
if __name__ == "__main__":
x, y = 0, 0
while x < N:
while y < M:
y += 1
x += 1
Profiling results were as follows:
bash-3.1$ time python cstyle.py
real 0m0.109s
user 0m0.015s
sys 0m0.000s
bash-3.1$ time python idiomatic.py
real 0m4.492s
user 0m0.000s
sys 0m0.031s
I can understand why the Pythonic version is slower -- I imagine it has a lot to do with calling xrange N times, perhaps this could be eliminated if there was a way to rewind a generator. However, with this deal of difference in execution time, why would one prefer to use the Pythonic version?
Edit: I conducted the tests again using the code Mr. Martelli provided, and the results were indeed better now:
I thought I'd enumerate the conclusions from the thread here:
1) Lots of code at the module scope is a bad idea, even if the code is enclosed in an if __name__ == "__main__": block.
2) *Curiously enough, modifying the code that belonged to thebadone to my incorrect version (letting y grow without resetting) produced little difference in performance, even for larger values of M and N.
Here's the proper comparison, e.g. in loop.py:
M = 10000
N = 10000
def thegoodone():
for x in xrange(N):
for y in xrange(M):
pass
def thebadone():
x = 0
while x < N:
y = 0
while y < M:
y += 1
x += 1
All substantial code should always be in functions -- putting a hundred million loops at a module's top level shows reckless disregard for performance and makes a mockery of any attempts at measuring said performance.
Once you've done that, you see:
$ python -mtimeit -s'import loop' 'loop.thegoodone()'
10 loops, best of 3: 3.45 sec per loop
$ python -mtimeit -s'import loop' 'loop.thebadone()'
10 loops, best of 3: 10.6 sec per loop
So, properly measured, the bad way that you advocate is about 3 times slower than the good way which Python promotes. I hope this makes you reconsider your erroneous advocacy.
You forgot to reset y to 0 after the inner loop.
#!/usr/bin/env python
M = 10000
N = 10000
if __name__ == "__main__":
x, y = 0, 0
while x < N:
while y < M:
y += 1
x += 1
y = 0
ed: 20.63s after fix vs. 6.97s using xrange
good for iterating over data structures
The for i in ... syntax is great for iterating over data structures. In a lower-level language, you would generally be iterating over an array indexed by an int, but with the python syntax you can eliminate the indexing step.
this is not a direct answer to the question, but i want to open the dialog a bit more on xrange(). two things:
A. there is something wrong with one of the OP statements that no one has corrected yet (yes, in addition to the bug in the code of not resetting y):
"I imagine it has a lot to do with calling xrange N times...."
unlike traditional counting for loops, Python's is more like a shell's foreach... looping over an iterable. therefore, xrange() is called exactly once, not "N times."
B. xrange() is the name of this function in Python 2. it replaces and is renamed to range() in Python 3, so keep this in mind when porting. if you didn't know already, xrange() returns an iterator(-like object) while range() returns lists. since the latter is more inefficient, it has been deprecated in favor of xrange() which is more memory-friendly. the workaround in Python 3, for all those who need to have a list is list(range(N)).
I've repeated the test from #Alex Martelli's answer. The idiomatic for loop is 5 times faster than the while loop:
python -mtimeit -s'from while_vs_for import while_loop as loop' 'loop(10000)'
10 loops, best of 3: 9.6 sec per loop
python -mtimeit -s'from while_vs_for import for_loop as loop' 'loop(10000)'
10 loops, best of 3: 1.83 sec per loop
while_vs_for.py:
def while_loop(N):
x = 0
while x < N:
y = 0
while y < N:
pass
y += 1
x += 1
def for_loop(N):
for x in xrange(N):
for y in xrange(N):
pass
At module level:
$ time -p python for.py
real 4.38
user 4.37
sys 0.01
$ time -p python while.py
real 14.28
user 14.28
sys 0.01
Related
I'm trying to solve a Rosalind basic problem of counting nucleotides in a given sequence, and returning the results in a list. For those ones not familiar with bioinformatics it's just counting the number of occurrences of 4 different characters ('A','C','G','T') inside a string.
I expected collections.Counter to be the fastest method (first because they claim to be high-performance, and second because I saw a lot of people using it for this specific problem).
But to my surprise this method is the slowest!
I compared three different methods, using timeit and running two types of experiments:
Running a long sequence few times
Running a short sequence a lot of times.
Here is my code:
import timeit
from collections import Counter
# Method1: using count
def method1(seq):
return [seq.count('A'), seq.count('C'), seq.count('G'), seq.count('T')]
# method 2: using a loop
def method2(seq):
r = [0, 0, 0, 0]
for i in seq:
if i == 'A':
r[0] += 1
elif i == 'C':
r[1] += 1
elif i == 'G':
r[2] += 1
else:
r[3] += 1
return r
# method 3: using Collections.counter
def method3(seq):
counter = Counter(seq)
return [counter['A'], counter['C'], counter['G'], counter['T']]
if __name__ == '__main__':
# Long dummy sequence
long_seq = 'ACAGCATGCA' * 10000000
# Short dummy sequence
short_seq = 'ACAGCATGCA' * 1000
# Test 1: Running a long sequence once
print timeit.timeit("method1(long_seq)", setup='from __main__ import method1, long_seq', number=1)
print timeit.timeit("method2(long_seq)", setup='from __main__ import method2, long_seq', number=1)
print timeit.timeit("method3(long_seq)", setup='from __main__ import method3, long_seq', number=1)
# Test2: Running a short sequence lots of times
print timeit.timeit("method1(short_seq)", setup='from __main__ import method1, short_seq', number=10000)
print timeit.timeit("method2(short_seq)", setup='from __main__ import method2, short_seq', number=10000)
print timeit.timeit("method3(short_seq)", setup='from __main__ import method3, short_seq', number=10000)
Results:
Test1:
Method1: 0.224009990692
Method2: 13.7929501534
Method3: 18.9483819008
Test2:
Method1: 0.224207878113
Method2: 13.8520510197
Method3: 18.9861831665
Method 1 is way faster than method 2 and 3 for both experiments!!
So I have a set of questions:
Am I doing something wrong or it is indeed slower than the other two approaches? Could someone run the same code and share the results?
In case my results are correct, (and maybe this should be another question) is there a faster method to solve this problem than using method 1?
If count is faster, then what's the deal with collections.Counter?
It's not because collections.Counter is slow, it's actually quite fast, but it's a general purpose tool, counting characters is just one of many applications.
On the other hand str.count just counts characters in strings and it's heavily optimized for its one and only task.
That means that str.count can work on the underlying C-char array while it can avoid creating new (or looking up existing) length-1-python-strings during the iteration (which is what for and Counter do).
Just to add some more context to this statement.
A string is stored as C array wrapped as python object. The str.count knows that the string is a contiguous array and thus converts the character you want to co to a C-"character", then iterates over the array in native C code and checks for equality and finally wraps and returns the number of found occurrences.
On the other hand for and Counter use the python-iteration-protocol. Each character of your string will be wrapped as python-object and then it (hashes and) compares them within python.
So the slowdown is because:
Each character has to be converted to a Python object (this is the major reason for the performance loss)
The loop is done in Python (not applicable to Counter in python 3.x because it was rewritten in C)
Each comparison has to be done in Python (instead of just comparing numbers in C - characters are represented by numbers)
The counter needs to hash the values and your loop needs to index your list.
Note the reason for the slowdown is similar to the question about Why are Python's arrays slow?.
I did some additional benchmarks to find out at which point collections.Counter is to be preferred over str.count. To this end I created random strings containing differing numbers of unique characters and plotted the performance:
from collections import Counter
import random
import string
characters = string.printable # 100 different printable characters
results_counter = []
results_count = []
nchars = []
for i in range(1, 110, 10):
chars = characters[:i]
string = ''.join(random.choice(chars) for _ in range(10000))
res1 = %timeit -o Counter(string)
res2 = %timeit -o {char: string.count(char) for char in chars}
nchars.append(len(chars))
results_counter.append(res1)
results_count.append(res2)
and the result was plotted using matplotlib:
import matplotlib.pyplot as plt
plt.figure()
plt.plot(nchars, [i.best * 1000 for i in results_counter], label="Counter", c='black')
plt.plot(nchars, [i.best * 1000 for i in results_count], label="str.count", c='red')
plt.xlabel('number of different characters')
plt.ylabel('time to count the chars in a string of length 10000 [ms]')
plt.legend()
Results for Python 3.5
The results for Python 3.6 are very similar so I didn't list them explicitly.
So if you want to count 80 different characters Counter becomes faster/comparable because it traverses the string only once and not multiple times like str.count. This will be weakly dependent on the length of the string (but testing showed only a very weak difference +/-2%).
Results for Python 2.7
In Python-2.7 collections.Counter was implemented using python (instead of C) and is much slower. The break-even point for str.count and Counter can only be estimated by extrapolation because even with 100 different characters the str.count is still 6 times faster.
The time difference here is pretty simple to explain. It all comes down to what runs within Python and what runs as native code. The latter will always be faster since it does not come with lots of evaluation overhead.
Now that’s already the reason why calling str.count() four times is faster than anything else. Although this iterates the string four times, these loops run in native code. str.count is implemented in C, so this has very little overhead, making this very fast. It’s really difficult to beat this, especially when the task is that simple (looking only for simple character equality).
Your second method, of collecting the counts in an array is actually a less performant version of the following:
def method4 (seq):
a, c, g, t = 0, 0, 0, 0
for i in seq:
if i == 'A':
a += 1
elif i == 'C':
c += 1
elif i == 'G':
g += 1
else:
t += 1
return [a, c, g, t]
Here, all four values are individual variables, so updating them is very fast. This is actually a bit faster than mutating list items.
The overall performance “problem” here is however that this iterates the string within Python. So this creates a string iterator and then produces every character individually as an actual string object. That’s a lot overhead and the main reason why every solution that works by iterating the string in Python will be slower.
The same problem is with collection.Counter. It’s implemented in Python so even though it’s very efficient and flexible, it suffers from the same issue that it’s just never near native in terms of speed.
As others have already noted, you are comparing fairly specific code against fairly general one.
Consider that something as trivial as spelling out the loop over the characters you are interested in is already buying you a factor 2, i.e.
def char_counter(text, chars='ACGT'):
return [text.count(char) for char in chars]
%timeit method1(short_seq)
# 100000 loops, best of 3: 18.8 µs per loop
%timeit char_counter(short_seq)
# 10000 loops, best of 3: 40.8 µs per loop
%timeit method1(long_seq)
# 10 loops, best of 3: 172 ms per loop
%timeit char_counter(long_seq)
# 1 loop, best of 3: 374 ms per loop
Your method1() is the fastest but not the most efficient, as the input is looped through entirely for each char you are inspecting, thereby not taking advantage of the fact that you could easily short-circuit your looping as soon as a character gets assigned to one of the character classes.
Unfortunately, Python does not offer a fast method to take advantage of the specific conditions of your problem.
However, you could use Cython for this, and you would then be able to outperform your method1():
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cdef void _count_acgt(
const unsigned char[::1] text,
unsigned long len_text,
unsigned long[::1] counts):
for i in range(len_text):
if text[i] == b'A':
counts[0] += 1
elif text[i] == b'C':
counts[1] += 1
elif text[i] == b'G':
counts[2] += 1
else:
counts[3] += 1
cpdef ascii_count_acgt(text):
counts = np.zeros(4, dtype=np.uint64)
bin_text = text.encode()
return _count_acgt(bin_text, len(bin_text), counts)
%timeit ascii_count_acgt(short_seq)
# 100000 loops, best of 3: 12.6 µs per loop
%timeit ascii_count_acgt(long_seq)
# 10 loops, best of 3: 140 ms per loop
Given a number of players n, I need to find H, the list of all tuples where each tuple is a combination of coalitions (of the players, e.g. (1,2,3) is the coalition of players 1, 2 and 3. ((1,2,3),(4,5),(6,)) is a combination of coalitions - which are also tuples) that respects this rule: each player appears only and exactly once (i.e. in only one coalition).
P.S. Each combination of coalitions is called layout in the code.
At the beginning I wrote a snippet in which I computed all combinations of all coalitions and for each combination I checked the rule. Problem is that for 5-6 players the number of combinations of coalitions was already so big that my computer went phut.
In order to avoid a a big part of the computation (all possible combinations, the loop and the ifs) I wrote the following (which I tested and it's equivalent to the previous snippet):
from itertools import combinations, combinations_with_replacement, product, permutations
players = range(1,n+1)
coalitions = [[coal for coal in list(combinations(players,length))] for length in players]
H = [tuple(coalitions[0]),(coalitions[-1][0],)]
combs = [comb for length in xrange(2,n) for comb in combinations_with_replacement(players,length) if sum(comb) == n]
perms = list(permutations(players))
layouts = set(frozenset(frozenset(perm[i:i+x]) for (i,x) in zip([0]+[sum(comb[:y]) for y in xrange(1,len(comb))],comb)) for comb in combs for perm in perms)
H.extend(tuple(tuple(tuple(coal) for coal in layout) for layout in layouts))
print H
EXPLANATION: say n = 3
First I create all possible coalitions:
coalitions = [[(1,),(2,),(3,)],[(1,2),(1,3),(2,3)],[(1,2,3)]]
Then I initialize H with the obvious combinations: each player in his own coalition and every player in the biggest coalition.
H = [((1,),(2,),(3,)),((1,2,3),)]
Then I compute all the possible forms of the layouts:
combs = [(1,2)] #(1,2) represents a layout in which there is
#one 1-player coalition and one 2-player coalition.
I compute the permutations (perms).
Finally for each perm and for each comb I calculate the different possible layouts. I set the result (layouts) in order to delete duplicates and add to H.
H = [((1,),(2,),(3,)),((1,2,3),),((1,2),(3,)),((1,3),(2,)),((2,3),(1,))]
Here's the comparison:
python script.py
4: 0.000520944595337 seconds
5: 0.0038321018219 seconds
6: 0.0408189296722 seconds
7: 0.431486845016 seconds
8: 6.05224680901 seconds
9: 76.4520540237 seconds
pypy script.py
4: 0.00342392921448 seconds
5: 0.0668039321899 seconds
6: 0.311077833176 seconds
7: 1.13124799728 seconds
8: 11.5973010063 seconds
9: went phut
Why is pypy that slower? What should I change?
First, I want to point out that you are studying the Bell numbers, which might ease the next part of your work, after you're done generating all the subsets. For example, it's easy to know how large each Bell set will be; OEIS has the sequence of Bell numbers already.
I hand-wrote the loops to generate the Bell sets; here is my code:
cache = {0: (), 1: ((set([1]),),)}
def bell(x):
# Change these lines to alter memoization.
if x in cache:
return cache[x]
previous = bell(x - 1)
new = []
for sets in previous:
r = []
for mark in range(len(sets)):
l = [s | set([x]) if i == mark else s for i, s in enumerate(sets)]
r.append(tuple(l))
new.extend(r)
new.append(sets + (set([x]),))
cache[x] = tuple(new)
return new
I included some memoization here for practical purposes. However, by commenting out some code, and writing some other code, you can obtain the following un-memoized version, which I used for benchmarks:
def bell(x):
if x == 0:
return ()
if x == 1:
return ((set([1]),),)
previous = bell(x - 1)
new = []
for sets in previous:
r = []
for mark in range(len(sets)):
l = [s | set([x]) if i == mark else s for i, s in enumerate(sets)]
r.append(tuple(l))
new.extend(r)
new.append(sets + (set([x]),))
cache[x] = tuple(new)
return new
My numbers are based on a several-year-old Thinkpad that I do most of my work on. Most of the smaller cases are way too fast to measure reliably (not even a single millisecond per trial for the first few), so my benchmarks are testing bell(9) through bell(11).
Benchmarks for CPython 2.7.11, using the standard timeit module:
$ python -mtimeit -s 'from derp import bell' 'bell(9)'
10 loops, best of 3: 31.5 msec per loop
$ python -mtimeit -s 'from derp import bell' 'bell(10)'
10 loops, best of 3: 176 msec per loop
$ python -mtimeit -s 'from derp import bell' 'bell(11)'
10 loops, best of 3: 1.07 sec per loop
And on PyPy 4.0.1, also using timeit:
$ pypy -mtimeit -s 'from derp import bell' 'bell(9)'
100 loops, best of 3: 14.3 msec per loop
$ pypy -mtimeit -s 'from derp import bell' 'bell(10)'
10 loops, best of 3: 90.8 msec per loop
$ pypy -mtimeit -s 'from derp import bell' 'bell(11)'
10 loops, best of 3: 675 msec per loop
So, the conclusion that I've come to is that itertools is not very fast when you try to use it outside of its intended idioms. Bell numbers are interesting combinatorically but they do not naturally arise from any simple composition of itertools widgets that I can find.
In response to your original query of what to do to make it faster: Just open-code it. Hope this helps!
~ C.
Here's a Pypy issue on itertools.product.
https://bitbucket.org/pypy/pypy/issues/1677/itertoolsproduct-slower-than-nested-fors
Note that our goal is to ensure that itertools is not massively slower than
plain Python, but we don't really care about making it exactly as fast (or
faster) as plain Python. As long as it's not massively slower, it's fine. (At
least I don't agree with you about whether a) or b) is easier to read :-)
Without studying your code in detail, it looks like it makes heavy use of the itertools combinations, permutations and product functions. In regular CPython those are written in compiled C code, with the intention of making them fast. Pypy does not implement the C code, so it shouldn't be surprising that these functions are slower.
I am implementing a reverse(s) function in Python 2.7 and I made a code like this:
# iterative version 1
def reverse(s):
r = ""
for c in range(len(s)-1, -1, -1):
r += s[c];
return r
print reverse("Be sure to drink your Ovaltine")
But for each iteration, it gets the length of the string even though it's been deducted.
I made another version that
# iterative version 2
def reverse(s):
r = ""
l = len(s)-1
for c in range(l, -1, -1):
r += s[c];
return r
print reverse("Be sure to drink your Ovaltine")
This version remembers the length of the string and doesn't ask for it every iteration, is this faster for longer strings (like a string that has the length of 1024) than the first version or does it have no effect at all?
In [12]: %timeit reverse("Be sure to drink your Ovaltine")
100000 loops, best of 3: 2.53 µs per loop
In [13]: %timeit reverse1("Be sure to drink your Ovaltine")
100000 loops, best of 3: 2.55 µs per loop
reverse is your first method, reverse1 is the second.
As you can see from timing there is very little difference in the performance.
You can use Ipython to time your code with the above syntax, just def your functions and use %timeit and then your function and whatever parameters .
In the line
for c in range(len(s)-1, -1, -1):
len(s) is evaluated only once, and the result (minus one) passed as an argument to range. Therefore the two versions are almost identical - if anything, the latter may be (very) slightly slower, as it creates a new name to assign the result of the subtraction.
Prelude
I have two implementations for a particular problem, one recursive and one iterative, and I want to know what causes the iterative solution to be ~30% slower than the recursive one.
Given the recursive solution, I write an iterative solution making the stack explicit.
Clearly, I simply mimic what the recursion is doing, so of course the Python engine is better optimized to handle the bookkeeping. But can we write an iterative method with similar performance?
My case study is Problem #14 on Project Euler.
Find the longest Collatz chain with a starting number below one million.
Code
Here is a parsimonious recursive solution (credit due to veritas in the problem thread plus an optimization from jJjjJ):
def solve_PE14_recursive(ub=10**6):
def collatz_r(n):
if not n in table:
if n % 2 == 0:
table[n] = collatz_r(n // 2) + 1
elif n % 4 == 3:
table[n] = collatz_r((3 * n + 1) // 2) + 2
else:
table[n] = collatz_r((3 * n + 1) // 4) + 3
return table[n]
table = {1: 1}
return max(xrange(ub // 2 + 1, ub, 2), key=collatz_r)
Here's my iterative version:
def solve_PE14_iterative(ub=10**6):
def collatz_i(n):
stack = []
while not n in table:
if n % 2 == 0:
x, y = n // 2, 1
elif n % 4 == 3:
x, y = (3 * n + 1) // 2, 2
else:
x, y = (3 * n + 1) // 4, 3
stack.append((n, y))
n = x
ysum = table[n]
for x, y in reversed(stack):
ysum += y
table[x] = ysum
return ysum
table = {1: 1}
return max(xrange(ub // 2 + 1, ub, 2), key=collatz_i)
And the timings on my machine (i7 machine with lots of memory) using IPython:
In [3]: %timeit solve_PE14_recursive()
1 loops, best of 3: 942 ms per loop
In [4]: %timeit solve_PE14_iterative()
1 loops, best of 3: 1.35 s per loop
Comments
The recursive solution is awesome:
Optimized to skip a step or two depending on the two least significant bits.
My original solution didn't skip any Collatz steps and took ~1.86 s
It is difficult to hit Python's default recursion limit of 1000.
collatz_r(9780657630) returns 1133 but requires less than 1000 recursive calls.
Memoization avoids retracing
collatz_r length calculated on-demand for max
Playing around with it, timings seem to be precise to +/- 5 ms.
Languages with static typing like C and Haskell can get timings below 100 ms.
I put the initialization of the memoization table in the method by design for this question, so that timings would reflect the "re-discovery" of the table values on each invocation.
collatz_r(2**1002) raises RuntimeError: maximum recursion depth exceeded.
collatz_i(2**1002) happily returns with 1003.
I am familiar with generators, coroutines, and decorators.
I am using Python 2.7. I am also happy to use Numpy (1.8 on my machine).
What I am looking for
an iterative solution that closes the performance gap
discussion on how Python handles recursion
the finer details of the performance penalties associated with an explicit stack
I'm looking mostly for the first, though the second and third are very important to this problem and would increase my understanding of Python.
Here's my shot at a (partial) explanation after running some benchmarks, which confirm your figures.
While recursive function calls are expensive in CPython, they aren't nearly as expensive as emulating a call stack using lists. The stack for a recursive call is a compact structure implemented in C (see Eli Bendersky's explanation and the file Python/ceval.c in the source code).
By contrast, your emulated stack is a Python list object, i.e. a heap-allocated, dynamically growing array of pointers to tuple objects, which in turn point to the actual values; goodbye, locality of reference, hello cache misses. You then use Python's notoriously slow iteration on these objects. A line-by-line profiling with kernprof confirms that iteration and list handling are taking a lot of time:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
16 #profile
17 def collatz_i(n):
18 750000 339195 0.5 2.4 stack = []
19 3702825 1996913 0.5 14.2 while not n in table:
20 2952825 1329819 0.5 9.5 if n % 2 == 0:
21 864633 416307 0.5 3.0 x, y = n // 2, 1
22 2088192 906202 0.4 6.4 elif n % 4 == 3:
23 1043583 617536 0.6 4.4 x, y = (3 * n + 1) // 2, 2
24 else:
25 1044609 601008 0.6 4.3 x, y = (3 * n + 1) // 4, 3
26 2952825 1543300 0.5 11.0 stack.append((n, y))
27 2952825 1150867 0.4 8.2 n = x
28 750000 352395 0.5 2.5 ysum = table[n]
29 3702825 1693252 0.5 12.0 for x, y in reversed(stack):
30 2952825 1254553 0.4 8.9 ysum += y
31 2952825 1560177 0.5 11.1 table[x] = ysum
32 750000 305911 0.4 2.2 return ysum
Interestingly, even n = x takes around 8% of the total running time.
(Unfortunately, I couldn't get kernprof to produce something similar for the recursive version.)
Iterative code is sometimes faster than recursive because it avoids function call overhead. However, stack.append is also a function call (and an attribute lookup on top) and adds similar overhead. Counting the append calls, the iterative version makes just as many function calls as the recursive version.
Comparing the first two and the last two timings here...
$ python -m timeit pass
10000000 loops, best of 3: 0.0242 usec per loop
$ python -m timeit -s "def f(n): pass" "f(1)"
10000000 loops, best of 3: 0.188 usec per loop
$ python -m timeit -s "def f(n): x=[]" "f(1)"
1000000 loops, best of 3: 0.234 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append" "f(1)"
1000000 loops, best of 3: 0.336 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append(1)" "f(1)"
1000000 loops, best of 3: 0.499 usec per loop
...confirms that the append call excluding attribute lookup takes approximately the same time as calling a minimal pure Python function, ~170 ns.
From the above I conclude that the iterative version does not enjoy an inherent advantage. The next question to consider is which one does more work. To get a (very) rough estimate, we can look at the number of lines executed in each version. I did a quick experiment to find out that:
collatz_r is called 1234275 times, and the body of the if block executes 984275 times.
collatz_i is called 250000 times, and the while loop runs 984275 times
Now, let's say collatz_r has 2 lines outside the if and 4 lines inside (that are executed in the worst case, when the else is hit). That adds up to 6.4 million lines to execute. Comparable figures for collatz_i could be 5 and 9, which add up to 10.0 million.
Even if that was just a rough estimate, it is well enough in line with the actual timings.
Python's len() and padding functions like string.ljust() are not tabstop-aware, i.e. they treat '\t' like any other single-width character, and don't round len() up to the nearest multiple of tabstop. Example:
len('Bear\tnecessities\t')
is 17 instead of 24 ( i.e. 4+(8-4)+11+(8-3) )
and say I also want a function pad_with_tabs(s) such that
pad_with_tabs('Bear', 15) = 'Bear\t\t'
Looking for simple implementations of these - compactness and readability first, efficiency second.
This is a basic but irritating question.
#gnibbler - can you show a purely Pythonic solution, even if it's say 20x less efficient?
Sure you could convert back and forth using str.expandtabs(TABWIDTH), but that's clunky.
Importing math to get TABWIDTH * int( math.ceil(len(s)*1.0/TABWIDTH) ) also seems like massive overkill.
I couldn't manage anything more elegant than the following:
TABWIDTH = 8
def pad_with_tabs(s,maxlen):
s_len = len(s)
while s_len < maxlen:
s += '\t'
s_len += TABWIDTH - (s_len % TABWIDTH)
return s
and since Python strings are immutable and unless we want to monkey-patch our function into string module to add it as a method, we must also assign to the result of the function:
s = pad_with_tabs(s, ...)
In particular I couldn't get clean approaches using list-comprehension or string.join(...):
''.join([s, '\t' * ntabs])
without special-casing the cases where len(s) is < an integer multiple of TABWIDTH), or len(s)>=maxlen already.
Can anyone show better len() and pad_with_tabs() functions?
TABWIDTH=8
def my_len(s):
return len(s.expandtabs(TABWIDTH))
def pad_with_tabs(s,maxlen):
return s+"\t"*((maxlen-len(s)-1)/TABWIDTH+1)
Why did I use expandtabs()?
Well it's fast
$ python -m timeit '"Bear\tnecessities\t".expandtabs()'
1000000 loops, best of 3: 0.602 usec per loop
$ python -m timeit 'for c in "Bear\tnecessities\t":pass'
100000 loops, best of 3: 2.32 usec per loop
$ python -m timeit '[c for c in "Bear\tnecessities\t"]'
100000 loops, best of 3: 4.17 usec per loop
$ python -m timeit 'map(None,"Bear\tnecessities\t")'
100000 loops, best of 3: 2.25 usec per loop
Anything that iterates over your string is going to be slower, because just the iteration is ~4 times slower than expandtabs even when you do nothing in the loop.
$ python -m timeit '"Bear\tnecessities\t".split("\t")'
1000000 loops, best of 3: 0.868 usec per loop
Even just splitting on tabs takes longer. You'd still need to iterate over the split and pad each item to the tabstop
I believe gnibbler's is the best for most prectical cases. But anyway, here is a naive (without accounting CR, LF etc) solution to compute the length of string without creating expanded copy:
def tab_aware_len(s, tabstop=8):
pos = -1
extra_length = 0
while True:
pos = s.find('\t', pos+1)
if pos<0:
return len(s) + extra_length
extra_length += tabstop - (pos+extra_length) % tabstop - 1
Probably it could be useful for some huge strings or even memory mapped files. And here is padding function a bit optimized:
def pad_with_tabs(s, max_len, tabstop=8):
length = tab_aware_len(s, tabstop)
if length<max_len:
s += '\t' * ((max_len-1)//tabstop + 1 - length//tabstop)
return s
TABWIDTH * int( math.ceil(len(s)*1.0/TABWIDTH) ) is indeed a massive over-kill; you can get the same result much more simply. For positive i and n, use:
def round_up_positive_int(i, n):
return ((i + n - 1) // n) * n
This procedure works in just about any language I've ever used, after appropriate translation.
Then you can do next_pos = round_up_positive_int(len(s), TABWIDTH)
For a slight increase in the elegance of your code, instead of
while(s_len < maxlen):
use this:
while s_len < maxlen:
Unfortunately I was unable to make use of accepted answer "as is" so here goes slightly modified version just in case someone would run into same problem and discovers this post via search:
from decimal import Decimal, ROUND_HALF_UP
TABWIDTH = 4
def pad_with_tabs(src, max_len):
return src + "\t" * int(
Decimal((max_len - len(src.expandtabs(TABWIDTH))) / TABWIDTH + 1).quantize(0, ROUND_HALF_UP))
def pad_fields(input):
result = []
longest = max(len(x) for x in input)
for row in input:
result.append(pad_with_tabs(row, longest))
return result
Output list contains properly padded rows having tab count rounded so the resulting data will have same indentation level regardless of corner .5 cases when no tab gets added in the original answer.