Why string comparison is NOT faster then integer comparison in Python? - python

The difference in C++ is huge, but not in Python. I used similar code on C++, and the result is so different -- integer comparison is 20-30 times faster than string comparison.
Here is my example code:
import random, time
rand_nums = []
rand_strs = []
total_num = 1000000
for i in range(total_num):
randint = random.randint(0,total_num*10)
randstr = str(randint)
rand_nums.append(randint)
rand_strs.append(randstr)
start = time.time()
for i in range(total_num-1):
b = rand_nums[i+1]>rand_nums[i]
end = time.time()
print("integer compare:",end-start) # 0.14269232749938965 seconds
start = time.time()
for i in range(total_num-1):
b = rand_strs[i+1]>rand_strs[i]
end = time.time() # 0.15730643272399902 seconds
print("string compare:",end-start)

I can't explain why it's so slow in C++, but in Python, the reason is simple from your test code: random strings usually differ in the first byte, so the comparison time for those cases should be pretty much the same.
Also, not that much of your overhead will be in the loop control and list accesses. You'd get a much more accurate measure if you remove those factors by zipping the lists:
for s1, s2 in zip(rand_strs, rand_strs[1:]):
b = s1 > s2

The difference in C++ is huge, but not in Python.
The time spent in the comparison is minimal compared to the rest of the loop in Python. The actual comparison operation is implemented in Python's standard library C code, while the loop will execute through the interpreter.
As a test, you can run this code that performs all the same operations as the string comparison loop, except without the comparison:
start = time.time()
for i in range(total_num-1):
b = rand_strs[i+1], rand_strs[i]
end = time.time()
print("no compare:",end-start)
The times are pretty close to each other, though for me string comparison is always the slowest of the three loops:
integer compare: 1.2947499752044678
string compare: 1.3821675777435303
no compare: 1.3093421459197998

Related

What is the time complexity of adding and retrieving strings from hashset [duplicate]

This question already has answers here:
Time complexity of python set operations?
(3 answers)
Closed 3 years ago.
Say we add a group of long strings to a hashset, and then test if some string already exists in this hashset. Is the time complexity going to be constant for adding and retrieving operations? Or does it depend on the length of the strings?
For example, if we have three strings.
s1 = 'abcdefghijklmn'
s2 = 'dalkfdboijaskjd'
s3 = 'abcdefghijklmn'
Then we do:
pool = set()
pool.add(s1)
pool.add(s2)
print s3 in pool # => True
print 'zzzzzzzzzz' in pool # => False
Would time complexity of the above operations be a factor of the string length?
Another question is that what if we are hashing a tuple? Something like (1,2,3,4,5,6,7,8,9)?
I appreciate your help!
==================================
I understand that there are resources around like this one that is talking about why hashing is constant time and collision issues. However, they usually assumed that the length of the key can be neglected. This question asks if hashing still has constant time when the key has a length that cannot be neglected. For example, if we are to judge N times if a key of length K is in the set, is the time complexity O(N) or O(N*K).
One of the best ways to answer something like this is to dig into the implementation :)
Notwithstanding some of that optimization magic described in the header of setobject.c, adding an object into a set reuses hashes from strings where hash() has already been once called (recall, strings are immutable), or calls the type's hash implementation.
For Unicode/bytes objects, we end up via here to _Py_HashBytes, which seems to have an optimization for small strings, otherwise it uses the compile-time configured hash function, all of which naturally are somewhat O(n)-ish. But again, this seems to only happen once per string object.
For tuples, the hash implementation can be found here – apparently a simplified, non-cached xxHash.
However, once the hash has been computed, the time complexity for sets should be around O(1).
EDIT: A quick, not very scientific benchmark:
import time
def make_string(c, n):
return c * n
def make_tuple(el, n):
return (el,) * n
def hashtest(gen, n):
# First compute how long generation alone takes
gen_time = time.perf_counter()
for x in range(n):
gen()
gen_time = time.perf_counter() - gen_time
# Then compute how long hashing and generation takes
hash_and_gen_time = time.perf_counter()
for x in range(n):
hash(gen())
hash_and_gen_time = time.perf_counter() - hash_and_gen_time
# Return the two
return (hash_and_gen_time, gen_time)
for gen in (make_string, make_tuple):
for obj_length in (10000, 20000, 40000):
t = f"{gen.__name__} x {obj_length}"
# Using `b'hello'.decode()` here to avoid any cached hash shenanigans
hash_and_gen_time, gen_time = hashtest(
lambda: gen(b"hello".decode(), obj_length), 10000
)
hash_time = hash_and_gen_time - gen_time
print(t, hash_time, obj_length / hash_time)
outputs
make_string x 10000 0.23490356100000004 42570.66158311665
make_string x 20000 0.47143921999999994 42423.284172241765
make_string x 40000 0.942087403 42458.905482254915
make_tuple x 10000 0.45578034300000025 21940.393335480014
make_tuple x 20000 0.9328520900000008 21439.62608263008
make_tuple x 40000 1.8562772150000004 21548.505620158674
which basically says hashing sequences, be they strings or tuples, is linear time, yet hashing strings is a lot faster than hashing tuples.
EDIT 2: this proves strings and bytestrings cache their hashes:
import time
s = ('x' * 500_000_000)
t0 = time.perf_counter()
a = hash(s)
t1 = time.perf_counter()
print(t1 - t0)
t0 = time.perf_counter()
b = hash(s)
t2 = time.perf_counter()
assert a == b
print(t2 - t0)
outputs
0.26157095399999997
1.201999999977943e-06
Strictly speaking it depends on the implementation of the hash set and the way you're using it (there may be cleverness that will optimize some of the time away in specialized circumstances), but in general, yes, you should expect that it will take O(n) time to hash a key to do an insert or lookup where n is the size of the key. Usually hash sets are assumed to be O(1), but there's an implicit assumption there that the keys are of fixed size and that hashing them is a O(1) operation (in other words, there's an assumption that the key size is negligible compared to the number of items in the set).
Optimizing the storage and retrieval of really big chunks of data is why databases are a thing. :)
Average case is O(1).
However, the worst case is O(n), with n being the number of elements in the set. This case is caused by hashing collisions.
you can read more about it in here
https://www.geeksforgeeks.org/internal-working-of-set-in-python/
Wiki is your friend
https://wiki.python.org/moin/TimeComplexity
for the operations above it seems that they are all O(1) for a set

Optimizing Python code for converting list of strings to integers and floats

I'm trying to optimize my Python 2.7.x code. I'm going to perform one operation inside a for loop, possibly millions of times, so I want it to be as quick as possible.
My operation is taking a list of 10 strings and converting them to 2 integers followed by 8 floats.
Here is a MWE of my attempts:
import timeit
words = ["1"] * 10
start_time = timeit.default_timer()
for ii in range(1000000):
values = map(float, words)
values[0] = int(values[0])
values[1] = int(values[1])
print "1", timeit.default_timer() - start_time
start_time = timeit.default_timer()
for ii in range(1000000):
values = map(int, words[:2]) + map(float, words[2:])
print "2", timeit.default_timer() - start_time
start_time = timeit.default_timer()
local_map = map
for ii in range(1000000):
values = local_map(float, words)
values[0] = int(values[0])
values[1] = int(values[1])
print "3", timeit.default_timer() - start_time
1 2.86574220657
2 3.83825802803
3 2.86320781708
The first block of code is the fastest I've managed. The map function seems much quicker than using list comprehension. But there's still some redundancy because I map everything to a float, then change the first two items to integers.
Is there anything quicker than my code?
Why doesn't making the map function local, local_map = map, improve the speed in the third block of code?
I haven't found anything faster, but your fastest code is actually going to be wrong in some cases. Problem is, Python float (which is a C double) has limited precision, for values beyond 2 ** 53 (IIRC; might be off by one on bit count), it can't represent all integer values. By contrast, Python int is arbitrary precision; if you have the memory, it can represent effectively infinite values.
You'd want to change:
values[0] = int(values[0])
values[1] = int(values[1])
to:
values[0] = int(words[0])
values[1] = int(words[1])
to avoid that. The reparsing would make this more dependent on the length of the string being parsed (because converting multiple times costs more for longer inputs).
An alternative that at least on my Python (3.5) works fairly fast is to preconstruct the set of converters so you can call the correct function directly. For example:
words = ["1"] * 10
converters = (int,) * 2 + (float,) * 8
values = [f(v) for f, v in zip(converters, words)]
You want to test with both versions of zip to see if the list generating version of the generator based itertools.izip is faster (for short inputs like these, I really can't say). In Python 3.5 (where zip is always a generator like Py2's itertools.izip) this took about 10% longer than your fastest solution for the same inputs (I used min() of a timeit.repeat run rather than the hand-rolled version you used); it might do better if the inputs are larger (and therefore parsing twice would be more expensive).

Python string comparison performance is inconsistent

I wanted to check how string comparison works (I wanted to see if it is char by char, and if it checks the length of the string before the comparison), so I've used this code:
s1 = 'abc'
s2 = 'abcd'
s3 = 'dbc'
s4 = 'abd'
t1 = time.clock()
s1==s2
print time.clock() - t1
t2 = time.clock()
s1==s3
print time.clock() - t2
t3 = time.clock()
s1==s4
print time.clock() - t3
When I've tried the same thing on very long strings (~30MB text files) it worked great and I found out that it does perform a length check, and also it compares char by char.
But when I've tried it on short string (such as the string in the code above) the performance results were very inconsistent.
Any one has any idea why they were inconsistent or what have I done wrong? (perhaps I was wrong and the comparison doesn't work as I thought?)
Edit: An example for something I've also tried is to compare different lengths of string with a specific string. I thought that the one that takes the longest to perform will be the string with the exact length of the other one because the rest will fall in the length check, but it was inconsistent as well).
lets say that the string I'm checking is 'hello', so I've compared 'a', 'aa', 'aaa', and so on...
I was expecting to see that the longest check will be 'aaaaa' but it was 'a' and I have no idea why.
You are correct that strings compare lengths before comparing contents (at least in 2.7). Here is the relevant portion of string_richcompare:
if (op == Py_EQ) {
/* Supporting Py_NE here as well does not save
much time, since Py_NE is rarely used. */
if (Py_SIZE(a) == Py_SIZE(b)
&& (a->ob_sval[0] == b->ob_sval[0]
&& memcmp(a->ob_sval, b->ob_sval, Py_SIZE(a)) == 0)) {
result = Py_True;
} else {
result = Py_False;
}
goto out;
}
In simple terms, the checks appear to be, in order:
if the strings have the same memory address, they are equal. (not pictured in the above code)
if the strings have a different size, they are not equal.
if the strings have a different first character, they are not equal.
if the strings have identical character arrays, they are equal.
The third check doesn't appear to be strictly necessary, but is probably an optimization if manually checking array contents is faster than calling memcmp.
If your benchmarking suggested to you that comparing strings of different length is slower than comparing strings of the same length, this is probably a false alarm caused by the not entirely dependable behavior of clock, as covered in other answers and comments.
You are liable to get inconsistent results when measuring very small times.
You'll get better result by repeating the operation a great many times so that the difference is substantial:
t1 = time.clock()
for i in range(10**6):
s1 == s2
t2 = time.clock()
Better yet, use the timeit module to handle the repetition (and other details
like turning off garbage collection) for you:
import timeit
s1 = 'abc'
s2 = 'abcd'
s3 = 'dbc'
s4 = 'abd'
t1 = timeit.timeit('s1==s2', 'from __main__ import s1, s2', number=10**8)
t2 = timeit.timeit('s1==s3', 'from __main__ import s1, s3', number=10**8)
t3 = timeit.timeit('s1==s4', 'from __main__ import s1, s4', number=10**8)
for t in (t1, t2, t3):
print(t)
yields
2.82305312157
2.83096408844
3.15551590919
Thus s1==s2 and s1==s3 take essentially the same amount of time. s1==s4 requires a bit more time because more characters have to be compared before the equality can return False.
By the way, while time.clock is used by timeit.default_timer for measuring
time on Windows, time.time is used by timeit.default_timer for measuring
time on Unix. Use timeit.default_timer instead of time.clock or time.time
to make your code more cross-platform compatible.

How to speed up for loop in python using Cython

I am trying to make a sensor using Beaglebone Black(BBB) and Python. I need to get as much data as possible per second from a sensor. The code bellow allows me to collect about 100,000 data points per second.
import Adafruit_BBIO_GPIO as GPIO
import time
GPIO.setup("P8_13", GPIO.IN)
def get_data(n):
my_list = []
start_time = time.time()
for i in range(n):
my_list.append(GPIO.input("P8_13"))
end_time = time.time() - start_time
print "Time: {}".format(end-time)
return my_list
n = 100000
get_data(n)
If n = 1,000,000, it takes around 10 seconds to populate my_list which is the same rate when n = 100,000 and time = 1s.
I decided to try Cython to get better results. I've heard it can significantly speed up python code. I followed the basic Cython tutorial: created data.pyx file with the python code above, then created a setup.py and, finally, built the Cython file.
Unfortunately, that did not help me at all. So, I am wondering if I am using Cython inappropriately or in this case, when there are no "heavy math computations", Cython cannot help too much. Any suggestions on how to speed up my code are highly appreciated!
You can start by adding a static type declaration:
import Adafruit_BBIO_GPIO as GPIO
import time
GPIO.setup("P8_13", GPIO.IN)
def get_data(int n): # declared as an int
my_list = []
start_time = time.time()
for i in range(n):
my_list.append(GPIO.input("P8_13"))
end_time = time.time() - start_time
print "Time: {}".format(end-time)
return my_list
n = 100000
get_data(n)
This allows the loop itself to be converted into a pure C loop, with the disadvantage that n is no longer arbitrary precision (so if you try to pass a value larger than ~2 billion, you'll get undefined behavior). This issue can be mitigated by changing int to unsigned long long, which allows values up to 2**64 - 1, or around 18 quintillion. The unsigned quantifier means you won't be able to pass a negative value.
You'll get a much more substantial speed boost if you can eliminate the list. Try replacing it with an array. Cython can work more efficiently with arrays than with lists.
I tried your same code, but with a different build of Adafruit_BBIO,
the one million count take only about 3 seconds to run on my rev C board.
I thought that the main change in the board from Rev B to Rev C was the fact that the eMMC was increased from 2GB to 4GB.
If you go and get the current Adafruit_BBIO, all you have to change in your above code is the first import statement, it should be Adafruit_BBIO.GPIO as GPIO
What have you tried out next?
Ron

In practice, why compare integer is better than compare string?

I did this test
import time
def test1():
a=100
b=200
start=time.time()
if (a>b):
c=a
else:
c=b
end=time.time()
print(end-start)
def test2():
a="amisetertzatzaz1111reaet"
b="avieatzfzatzr333333ts"
start=time.time()
if (a>b):
c=a
else:
c=b
end=time.time()
print(end-start)
def test3():
a="100"
b="200"
start=time.time()
if (a>b):
c=a
else:
c=b
end=time.time()
print(end-start)
And obtain as result
1.9073486328125e-06 #test1()
9.5367431640625e-07 #test2()
1.9073486328125e-06 #test3()
Execution times are similar. It's true, use integer instead of string reduce the storage space but what about the execution time?
Timing a single execution of a short piece of code doesn't tell you very much at all. In particular, if you look at the timing numbers from your test1 and test3, you'll see that the numbers are identical. That ought to be a warning sign that, in fact, all that you're seeing here is the resolution of the timer:
>>> 2.0 / 2 ** 20
1.9073486328125e-06
>>> 1.0 / 2 ** 20
9.5367431640625e-07
For better results, you need to run the code many times, and measure and subtract the timing overhead. Python has a built-in module timeit for doing exactly this. Let's time 100 million executions of each kind of comparison:
>>> from timeit import timeit
>>> timeit('100 > 200', number=10**8)
5.98881983757019
>>> timeit('"100" > "200"', number=10**8)
7.528342008590698
so you can see that the difference is not really all that much (string comparison only about 25% slower in this case). So why is string comparison slower? Well, the way to find out is to look at the implementation of the comparison operation.
In Python 2.7, comparison is implemented by the do_cmp function in object.c. (Please open this code in a new window to follow the rest of my analysis.) On line 817, you'll see that if the objects being compared are the same type and if they have a tp_compare function in their class structure, then that function is called. In the case of integer objects, this is what happens, the function being int_compare in intobject.c, which you'll see is very simple.
But strings don't have a tp_compare function, so do_cmp proceeds to call try_rich_to_3way_compare which then calls try_rich_compare_bool up to three times (trying the three comparison operators EQ, LT and GT in turn). This calls try_rich_compare which calls string_richcompare in stringobject.c.
So string comparison is slower because it has to use the complicated "rich comparison" infrastructure, whereas integer comparison is more direct. But even so, it doesn't make all that much difference.
Huh? Since the storage space is reduced, the number of bits that need to be compared is also reduced. Comparing bits is work, doing less work means it goes faster.

Categories