I benchmarked these two functions (they unzip pairs back into source lists, came from here):
n = 10**7
a = list(range(n))
b = list(range(n))
pairs = list(zip(a, b))
def f1(a, b, pairs):
a[:], b[:] = zip(*pairs)
def f2(a, b, pairs):
for i, (a[i], b[i]) in enumerate(pairs):
pass
Results with timeit.timeit (five rounds, numbers are seconds):
f1 1.06 f2 1.57
f1 0.96 f2 1.69
f1 1.00 f2 1.85
f1 1.11 f2 1.64
f1 0.95 f2 1.63
So clearly f1 is a lot faster than f2, right?
But then I also measured with timeit.default_timer and got a completely different picture:
f1 7.28 f2 1.92
f1 5.34 f2 1.66
f1 6.46 f2 1.70
f1 6.82 f2 1.59
f1 5.88 f2 1.63
So clearly f2 is a lot faster, right?
Sigh. Why do the timings totally differ like that, and which timing method should I believe?
Full benchmark code:
from timeit import timeit, default_timer
n = 10**7
a = list(range(n))
b = list(range(n))
pairs = list(zip(a, b))
def f1(a, b, pairs):
a[:], b[:] = zip(*pairs)
def f2(a, b, pairs):
for i, (a[i], b[i]) in enumerate(pairs):
pass
print('timeit')
for _ in range(5):
for f in f1, f2:
t = timeit(lambda: f(a, b, pairs), number=1)
print(f.__name__, '%.2f' % t, end=' ')
print()
print('default_timer')
for _ in range(5):
for f in f1, f2:
t0 = default_timer()
f(a, b, pairs)
t = default_timer() - t0
print(f.__name__, '%.2f' % t, end=' ')
print()
As Martijn commented, the difference is Python's garbage collection, which timeit.timeit disables during its run. And zip creates 10 million iterator objects, one for each of the 10 million iterables it's given.
So, garbage-collecting 10 million objects simply takes a lot of time, right? Mystery solved!
Well... no. That's not really what happens, and it's way more interesting than that. And there's a lesson to be learned to make such code faster in real life.
Python's main way to discard objects no longer needed is reference counting. The garbage collector, which is being disabled here, is for reference cycles, which the reference counting won't catch. And there aren't any cycles here, so it's all discarded by reference counting and the garbage collector doesn't actually collect any garbage.
Let's look at a few things. First, let's reproduce the much faster time by disabling the garbage collector ourselves.
Common setup code (all further blocks of code should be run directly after this in a fresh run, don't combine them):
import gc
from timeit import default_timer as timer
n = 10**7
a = list(range(n))
b = list(range(n))
pairs = list(zip(a, b))
Timing with garbage collection enabled (the default):
t0 = timer()
a[:], b[:] = zip(*pairs)
t1 = timer()
print(t1 - t0)
I ran it three times, took 7.09, 7.03 and 7.09 seconds.
Timing with garbage collection disabled:
t0 = timer()
gc.disable()
a[:], b[:] = zip(*pairs)
gc.enable()
t1 = timer()
print(t1 - t0)
Took 0.96, 1.02 and 0.99 seconds.
So now we know it's indeed the garbage collection that somehow takes most of the time, even though it's not collecting anything.
Here's something interesting: Already just the creation of the zip iterator is responsible for most of the time:
t0 = timer()
z = zip(*pairs)
t1 = timer()
print(t1 - t0)
That took 6.52, 6.51 and 6.50 seconds.
Note that I kept the zip iterator in a variable, so there isn't even anything to discard yet, neither by reference counting nor by garbage collecting!
What?! Where does the time go, then?
Well... as I said, there are no reference cycles, so the garbage collector won't actually collect any garbage. But the garbage collector doesn't know that! In order to figure that out, it needs to check!
Since the iterators could become part of a reference cycle, they're registered for garbage collection tracking. Let's see how many more objects get tracked due to the zip creation (doing this just after the common setup code):
gc.collect()
tracked_before = len(gc.get_objects())
z = zip(*pairs)
print(len(gc.get_objects()) - tracked_before)
The output: 10000003 new objects tracked. I believe that's the zip object itself, its internal tuple to hold the iterators, its internal result holder tuple, and the 10 million iterators.
Ok, so the garbage collector tracks all these objects. But what does that mean? Well, every now and then, after a certain number of new object creations, the collector goes through the tracked objects to see whether some are garbage and can be discarded. The collector keeps three "generations" of tracked objects. New objects go into generation 0. If they survive a collection run there, they're moved into generation 1. If they survive a collection there, they're moved into generation 2. If they survive further collection runs there, they remain in generation 2. Let's check the generations before and after:
gc.collect()
print('collections:', [stats['collections'] for stats in gc.get_stats()])
print('objects:', [len(gc.get_objects(i)) for i in range(3)])
z = zip(*pairs)
print('collections:', [stats['collections'] for stats in gc.get_stats()])
print('objects:', [len(gc.get_objects(i)) for i in range(3)])
Output (each line shows values for the three generations):
collections: [13111, 1191, 2]
objects: [17, 0, 13540]
collections: [26171, 2378, 20]
objects: [317, 2103, 10011140]
The 10011140 shows that most of the 10 million iterators were not just registered for tracking, but are already in generation 2. So they were part of at least two garbage collection runs. And the number of generation 2 collections went up from 2 to 20, so our millions of iterators were part of up to 20 garbage collection runs (two to get into generation 2, and up to 18 more while already in generation 2). We can also register a callback to count more precisely:
checks = 0
def count(phase, info):
if phase == 'start':
global checks
checks += len(gc.get_objects(info['generation']))
gc.callbacks.append(count)
z = zip(*pairs)
gc.callbacks.remove(count)
print(checks)
That told me 63,891,314 checks total (i.e., on average, each iterator was part of over 6 garbage collection runs). That's a lot of work. And all this just to create the zip iterator, before even using it.
Meanwhile, the loop
for i, (a[i], b[i]) in enumerate(pairs):
pass
creates almost no new objects at all. Let's check how much tracking enumerate causes:
gc.collect()
tracked_before = len(gc.get_objects())
e = enumerate(pairs)
print(len(gc.get_objects()) - tracked_before)
Output: 3 new objects tracked (the enumerate iterator object itself, the single iterator it creates for iterating over pairs, and the result tuple it'll use (code here)).
I'd say that answers the question "Why do the timings totally differ like that?". The zip solution creates millions of objects that go through multiple garbage collection runs, while the loop solution doesn't. So disabling the garbage collector helps the zip solution tremendously, while the loop solution doesn't care.
Now about the second question: "Which timing method should I believe?". Here's what the documentation has to say about it (emphasis mine):
By default, timeit() temporarily turns off garbage collection during the timing. The advantage of this approach is that it makes independent timings more comparable. The disadvantage is that GC may be an important component of the performance of the function being measured. If so, GC can be re-enabled as the first statement in the setup string. For example:
timeit.Timer('for i in range(10): oct(i)', 'gc.enable()').timeit()
In our case here, the cost of garbage collection doesn't stem from some other unrelated code. It's directly caused by the zip call. And you do pay this price in reality, when you run that. So in this case, I do consider it an "important component of the performance of the function being measured". To directly answer the question as asked: Here I'd believe the default_timer method, not the timeit method. Or put differently: Here the timeit method should be used with enabling garbage collection as suggested in the documentatiion.
Or... alternatively, we could actually disable garbage collection as part of the solution (not just for benchmarking):
def f1(a, b, pairs):
gc.disable()
a[:], b[:] = zip(*pairs)
gc.enable()
But is that a good idea? Here's what the gc documentation says:
Since the collector supplements the reference counting already used in Python, you can disable the collector if you are sure your program does not create reference cycles.
Sounds like it's an ok thing to do. But I'm not sure I don't create reference cycles elsewhere in my program, so I finish with gc.enable() to turn garbage collection back on after I'm done. At that point, all those temporary objects have already been discarded thanks to reference counting. So all I'm doing is avoiding lots of pointless garbage collection checks. I find this a valuable lesson and I might actually do that in the future, if I know I only temporarily create a lot of objects.
Finally, I highly recommend reading the gc module documentation and the Design of CPython’s Garbage Collector in Python's developer guide. Most of it is easy to understand, and I found it quite interesting and enlightening.
Related
I want to parallelise the operation of a function on each element of a list using ray. A simplified snippet is below
import numpy as np
import time
import ray
import psutil
num_cpus = psutil.cpu_count(logical=False)
ray.init(num_cpus=num_cpus)
#ray.remote
def f(a, b, c):
return a * b - c
def g(a, b, c):
return a * b - c
def my_func_par(large_list):
# arguments a and b are constant just to illustrate
# argument c is is each element of a list large_list
[f.remote(1.5, 2, i) for i in large_list]
def my_func_seq(large_list):
# arguments a anf b are constant just to illustrate
# argument c is is each element of a list large_list
[g(1.5, 2, i) for i in large_list]
my_list = np.arange(1, 10000)
s = time.time()
my_func_par(my_list)
print(time.time() - s)
>>> 2.007
s = time.time()
my_func_seq(my_list)
print(time.time() - s)
>>> 0.0372
The problem is, when I time my_func_par, it is much slower (~54x as can be seen above) than my_func_seq. One of the authors of ray does answer a comment on this blog that seems to explain what I am doing is setting up len(large_list) different tasks, which is incorrect.
How do I use ray and modify the code above to run it in parallel? (maybe by splitting large_list into chunks with the number of chunks being equal to the number of cpus)
EDIT: There are two important criteria in this question
The function f needs to accept multiple arguments
It may be necessarry to use ray.put(large_list) so that the larg_list variable can be stored in shared memory rather than copied to each processor
To add to what Sang said above:
Ray Distributed multiprocessing.Pool supports a fixed-size pool of Ray Actors for easier parallelization.
import numpy as np
import time
import ray
from ray.util.multiprocessing import Pool
pool = Pool()
def f(x):
# time.sleep(1)
return 1.5 * 2 - x
def my_func_par(large_list):
pool.map(f, large_list)
def my_func_seq(large_list):
[f(i) for i in large_list]
my_list = np.arange(1, 10000)
s = time.time()
my_func_par(my_list)
print('Parallel time: ' + str(time.time() - s))
s = time.time()
my_func_seq(my_list)
print('Sequential time: ' + str(time.time() - s))
With the above code, my_func_par runs much faster (about 0.1 sec). If you play with the code and make f(x) slower by something like time.sleep, you can see the clear advantage of multiprocessing.
The reason why the parallized version is slower is that running ray tasks unavoidably have overhead to run (although it puts lots of effort to optimize it). It is because running things in parallel requires to have inter-process communication, serialization, and things like that.
That being said, if your function is really fast (as fast as the running function takes less time than other overhead in distributed computation, in which your code is perfectly the case because the function f is really really tiny. I assume it will take less than a microsecond to run that function).
This means you should make f function more computationally heavier in order to get benefit from parallelization. Your proposed solution might not work because even after that, the function f might be still lightweight enough depending on your list size.
This question already has answers here:
Time complexity of python set operations?
(3 answers)
Closed 3 years ago.
Say we add a group of long strings to a hashset, and then test if some string already exists in this hashset. Is the time complexity going to be constant for adding and retrieving operations? Or does it depend on the length of the strings?
For example, if we have three strings.
s1 = 'abcdefghijklmn'
s2 = 'dalkfdboijaskjd'
s3 = 'abcdefghijklmn'
Then we do:
pool = set()
pool.add(s1)
pool.add(s2)
print s3 in pool # => True
print 'zzzzzzzzzz' in pool # => False
Would time complexity of the above operations be a factor of the string length?
Another question is that what if we are hashing a tuple? Something like (1,2,3,4,5,6,7,8,9)?
I appreciate your help!
==================================
I understand that there are resources around like this one that is talking about why hashing is constant time and collision issues. However, they usually assumed that the length of the key can be neglected. This question asks if hashing still has constant time when the key has a length that cannot be neglected. For example, if we are to judge N times if a key of length K is in the set, is the time complexity O(N) or O(N*K).
One of the best ways to answer something like this is to dig into the implementation :)
Notwithstanding some of that optimization magic described in the header of setobject.c, adding an object into a set reuses hashes from strings where hash() has already been once called (recall, strings are immutable), or calls the type's hash implementation.
For Unicode/bytes objects, we end up via here to _Py_HashBytes, which seems to have an optimization for small strings, otherwise it uses the compile-time configured hash function, all of which naturally are somewhat O(n)-ish. But again, this seems to only happen once per string object.
For tuples, the hash implementation can be found here – apparently a simplified, non-cached xxHash.
However, once the hash has been computed, the time complexity for sets should be around O(1).
EDIT: A quick, not very scientific benchmark:
import time
def make_string(c, n):
return c * n
def make_tuple(el, n):
return (el,) * n
def hashtest(gen, n):
# First compute how long generation alone takes
gen_time = time.perf_counter()
for x in range(n):
gen()
gen_time = time.perf_counter() - gen_time
# Then compute how long hashing and generation takes
hash_and_gen_time = time.perf_counter()
for x in range(n):
hash(gen())
hash_and_gen_time = time.perf_counter() - hash_and_gen_time
# Return the two
return (hash_and_gen_time, gen_time)
for gen in (make_string, make_tuple):
for obj_length in (10000, 20000, 40000):
t = f"{gen.__name__} x {obj_length}"
# Using `b'hello'.decode()` here to avoid any cached hash shenanigans
hash_and_gen_time, gen_time = hashtest(
lambda: gen(b"hello".decode(), obj_length), 10000
)
hash_time = hash_and_gen_time - gen_time
print(t, hash_time, obj_length / hash_time)
outputs
make_string x 10000 0.23490356100000004 42570.66158311665
make_string x 20000 0.47143921999999994 42423.284172241765
make_string x 40000 0.942087403 42458.905482254915
make_tuple x 10000 0.45578034300000025 21940.393335480014
make_tuple x 20000 0.9328520900000008 21439.62608263008
make_tuple x 40000 1.8562772150000004 21548.505620158674
which basically says hashing sequences, be they strings or tuples, is linear time, yet hashing strings is a lot faster than hashing tuples.
EDIT 2: this proves strings and bytestrings cache their hashes:
import time
s = ('x' * 500_000_000)
t0 = time.perf_counter()
a = hash(s)
t1 = time.perf_counter()
print(t1 - t0)
t0 = time.perf_counter()
b = hash(s)
t2 = time.perf_counter()
assert a == b
print(t2 - t0)
outputs
0.26157095399999997
1.201999999977943e-06
Strictly speaking it depends on the implementation of the hash set and the way you're using it (there may be cleverness that will optimize some of the time away in specialized circumstances), but in general, yes, you should expect that it will take O(n) time to hash a key to do an insert or lookup where n is the size of the key. Usually hash sets are assumed to be O(1), but there's an implicit assumption there that the keys are of fixed size and that hashing them is a O(1) operation (in other words, there's an assumption that the key size is negligible compared to the number of items in the set).
Optimizing the storage and retrieval of really big chunks of data is why databases are a thing. :)
Average case is O(1).
However, the worst case is O(n), with n being the number of elements in the set. This case is caused by hashing collisions.
you can read more about it in here
https://www.geeksforgeeks.org/internal-working-of-set-in-python/
Wiki is your friend
https://wiki.python.org/moin/TimeComplexity
for the operations above it seems that they are all O(1) for a set
So the question regarding the speed of for loops vs while loops has been asked many times before. The for loop is supposed to be faster.
However, when I tested it in Python 3.5.1 the results were as follows:
timeit.timeit('for i in range(10000): True', number=10000)
>>> 12.697646026868842
timeit.timeit('while i<10000: True; i+=1',setup='i=0', number=10000)
>>> 0.0032265179766799434
The while loop runs >3000 times faster than the for loop! I've also tried pre-generating a list for the for loop:
timeit.timeit('for i in lis: True',setup='lis = [x for x in range(10000)]', number=10000)
>>> 3.638794646750142
timeit.timeit('while i<10000: True; i+=1',setup='i=0', number=10000)
>>> 0.0032454974941904524
Which made the for loop 3 times faster, but the difference is still 3 orders of magnitude.
Why does this happen?
You are creating 10k range() objects. These take some time to materialise. You then have to create iterator objects for those 10k objects too (for the for loop to iterate over the values). Next, the for loop uses the iterator protocol by calling the __next__ method on the resulting iterator. Those latter two steps also apply to the for loop over a list.
But most of all, you are cheating on the while loop test. The while loop only has to run once, because you never reset i back to 0 (thanks to Jim Fasarakis Hilliard pointing that out). You are in effect running a while loop through a total of 19999 comparisons; the first test runs 10k comparisons, the remaining 9999 tests run one comparison. And that comparison is fast:
>>> import timeit
>>> timeit.timeit('while i<10000: True; i+=1',setup='i=0', number=10000)
0.0008302750065922737
>>> (
... timeit.timeit('while i<10000: True; i+=1', setup='i=0', number=1) +
... timeit.timeit('10000 < 10000', number=9999)
... )
0.0008467709994874895
See how close those numbers are?
My machine is a little faster, so lets create a baseline to compare against; this is using 3.6.1 on a Macbook Pro (Retina, 15-inch, Mid 2015) running on OS X 10.12.5. And lets also fix the while loop to set i = 0 in the test, not the setup (which is run just once):
>>> import timeit
>>> timeit.timeit('for i in range(10000): pass', number=10000)
1.9789885189966299
>>> timeit.timeit('i=0\nwhile i<10000: True; i+=1', number=10000)
5.172155902953818
Oops, so a correctly running while is actually slower here, there goes your premise (and mine!).
I used pass to avoid having to answer question about how fast referencing that object is (it's fast but besides the point). My timings are going to be 6x faster than your machine.
If you wanted to explore why the iteration is faster, you could time the various components of the for loop in Python, starting with creating the range() object:
>>> timeit.timeit('range(10000)', number=10000)
0.0036197409499436617
So creating 10000 range() objects takes more time than running a single while loop that iterates 10k times. range() objects are more expensive to create than integers.
This does involves a global name lookup, which is slower, you could make it faster by using setup='_range = range' then use _range(1000); this shaves of about 1/3rd of the timings.
Next, create an iterator for this; here I'll use a local name for the iter() function, as the for loop doesn't have to do a hash-table lookup and just reaches for the C function instead. Hard-coded references to a memory location in a binary is a lot faster, of course:
>>> timeit.timeit('_iter(r)', setup='_iter = iter; r = range(10000)', number=10000)
0.0009729859884828329
Fairly fast, but; it takes the same amount of time as your single while loop iterating 10k times. So creating iterable objects is cheap. The C implementation is faster still. We haven't iterated yet.
Last, we call __next__ on the iterator object, 10k times. This is again done in C code, with cached references to internal C implementations, but with a functools.partial() object we can at least attempt to get a ball-park figure:
>>> timeit.timeit('n()', setup='from functools import partial; i = iter(range(10000)); n = partial(i.__next__)', number=10000) * 10000
7.759470026940107
Boy, 10k times 10k calls to iter(range(1000)).__next__ takes almost 4x more time than the for loop managed; this goes to show how efficient the actual C implementation really is.
However, it does illustrate that looping in C code is a lot faster, and this is why the while loop is actually slower when executed correctly; summing integers and making boolean comparisons in bytecode takes more time than iterating over range() in C code (where the CPU does the incrementing and comparisons directly in CPU registers):
>>> (
... timeit.timeit('9999 + 1', number=10000 ** 2) +
... timeit.timeit('9999 < 10000', number=10000 ** 2)
... )
3.695550534990616
It is those operations that make the while loop about 3 seconds slower.
TLDR: You didn't actually test a while loop correctly. I should have noticed this earlier too.
You are timing things incorrectly, setup is only executed once and then the value of i is 10000 for all consequent runs. See the documentation on timeit:
Time number executions of the main statement. This executes the setup statement once, and then returns the time it takes to execute the main statement a number of times, measured in seconds as a float.
Additionally verify it by printing i for each repetition:
>>> timeit('print(i)\nwhile i<10000: True; i+=1',setup='i=0', number=5)
0
10000
10000
10000
10000
As a result, all consequent runs merely perform a comparison (which is True) and finish early.
Time correctly and see how the for loop is actually faster:
>>> timeit('i=0\nwhile i<10000: True; i+=1', number=10000)
8.416439056396484
>>> timeit('for i in range(10000): True', number=10000)
5.589155912399292
So I am not a CS major and have hard time answering questions about a program's big(O) complexity.
I wrote the following routine to output the pairs of numbers in an array which sum to 0:
asd=[-3,-2,-3,2,3,2,4,5,8,-8,9,10,-4]
def sum_zero(asd):
for i in range(len(asd)):
for j in range(i,len(asd)):
if asd[i]+asd[j]==0:
print asd[i],asd[j]
Now if someone asks the complexity of this method I know since the first loop goes thorough all the n items it will be more than (unless I am wrong) but can someone explain how to find the correct complexity?
If there is better more efficient way of solving this?
I won't give you a full solution, but will try to guide you.
You should get a pencil and a paper, and ask yourself:
How many times does the statement print asd[i], asd[j] execute? (in worst case, meaning that you shouldn't really care about the condition there)
You'll find that it really depends on the loop above it, which gets executed len(asd) (denote it by n) times.
The only thing you need to know, how many times is the inner loop executed giving that the outer loop has n iterations? (i runs from 0 up to n)
If you still not sure about the result, just take a real example, say n=20, and calculate how many times is the lowest statement executed, this will give you a very good indication about the answer.
def sum_zero(asd):
for i in range(len(asd)): # len called once = o(1), range called once = o(1)
for j in range(i,len(asd)): # len called once per i times = O(n), range called once per i times = O(n)
if asd[i]+asd[j]==0: # asd[i] called twice per j = o(2*n²)
# adding is called once per j = O(n²)
# comparing with 0 is called once per j = O(n²)
print asd[i],asd[j] # asd[i] is called twice per j = O(2*n²)
sum_zero(asd) # called once, o(1)
Assuming the worst case scenario (the if-condition always being true):
Total:
O(1) * 3
O(n) * 2
O(n²) * 6
O(6n² + 2n + 3)
A simple program to demonstrate the complexity:
target= []
quadraditc = []
linear = []
for x in xrange(1,100):
linear.append(x)
target.append(6*(x**2) + 2*x + 3)
quadraditc.append(x**2)
import matplotlib.pyplot as plt
plt.plot(linear,label="Linear")
plt.plot(target,label="Target Function")
plt.plot(quadraditc,label="Quadratic")
plt.ylabel('Complexity')
plt.xlabel('Time')
plt.legend(loc=2)
plt.show()
EDIT:
As pointed out by #Micah Smith, the above answer is the worst case operations, the Big-O is actually O(n^2), since the constants and lower order terms are omitted.
I did this test
import time
def test1():
a=100
b=200
start=time.time()
if (a>b):
c=a
else:
c=b
end=time.time()
print(end-start)
def test2():
a="amisetertzatzaz1111reaet"
b="avieatzfzatzr333333ts"
start=time.time()
if (a>b):
c=a
else:
c=b
end=time.time()
print(end-start)
def test3():
a="100"
b="200"
start=time.time()
if (a>b):
c=a
else:
c=b
end=time.time()
print(end-start)
And obtain as result
1.9073486328125e-06 #test1()
9.5367431640625e-07 #test2()
1.9073486328125e-06 #test3()
Execution times are similar. It's true, use integer instead of string reduce the storage space but what about the execution time?
Timing a single execution of a short piece of code doesn't tell you very much at all. In particular, if you look at the timing numbers from your test1 and test3, you'll see that the numbers are identical. That ought to be a warning sign that, in fact, all that you're seeing here is the resolution of the timer:
>>> 2.0 / 2 ** 20
1.9073486328125e-06
>>> 1.0 / 2 ** 20
9.5367431640625e-07
For better results, you need to run the code many times, and measure and subtract the timing overhead. Python has a built-in module timeit for doing exactly this. Let's time 100 million executions of each kind of comparison:
>>> from timeit import timeit
>>> timeit('100 > 200', number=10**8)
5.98881983757019
>>> timeit('"100" > "200"', number=10**8)
7.528342008590698
so you can see that the difference is not really all that much (string comparison only about 25% slower in this case). So why is string comparison slower? Well, the way to find out is to look at the implementation of the comparison operation.
In Python 2.7, comparison is implemented by the do_cmp function in object.c. (Please open this code in a new window to follow the rest of my analysis.) On line 817, you'll see that if the objects being compared are the same type and if they have a tp_compare function in their class structure, then that function is called. In the case of integer objects, this is what happens, the function being int_compare in intobject.c, which you'll see is very simple.
But strings don't have a tp_compare function, so do_cmp proceeds to call try_rich_to_3way_compare which then calls try_rich_compare_bool up to three times (trying the three comparison operators EQ, LT and GT in turn). This calls try_rich_compare which calls string_richcompare in stringobject.c.
So string comparison is slower because it has to use the complicated "rich comparison" infrastructure, whereas integer comparison is more direct. But even so, it doesn't make all that much difference.
Huh? Since the storage space is reduced, the number of bits that need to be compared is also reduced. Comparing bits is work, doing less work means it goes faster.