What is the fastest way to compare beginning of a string? - python

Imagine list of strings like this one: ('{hello world} is awesome', 'Hello world is less awesome', '{hello world} is {awesome} too'). I want to check each string in for cycle for starting character, I think I have got 4 options:
if re.search(r'^\{', i):
if re.match(r'\{', i):
if i.startswith('{'):
if i[:1] == '{':
Which is the fastest one? Is there some even more faster than these 4 options?
Note: The starting string to compare could be longer, not only one letter, e.g. {hello

The fastest is i[0] == value, since it directly uses a pointer to the underlying array. Regex needs to (at least) parse the pattern, while startsWith has the overhead of a method call and creating a slice of that size before the actual comparison.

As #dsqdfg said in the comments, there is a timing function in python I've never known until now. I tried to measure them and there are some results:
python -m timeit -s 'text="{hello world}"' 'text[:6] == "{hello"'
1000000 loops, best of 3: 0.224 usec per loop
python -m timeit -s 'text="{hello world}"' 'text.startswith("{hello")'
1000000 loops, best of 3: 0.291 usec per loop
python -m timeit -s 'text="{hello world}"' 'import re' 're.match(r"\{hello", text)'
100000 loops, best of 3: 2.53 usec per loop
python -m timeit -s 'text="{hello world}"' 'import re' 're.search(r"^\{hello", text)'
100000 loops, best of 3: 2.86 usec per loop

Related

Assign known values instead of calculating them?

If I am creating a program that does some complex calculations on a data set and I already know what some of the values should be, should I still calculate them? For example if I know that 0 or 1 would always be themselves should I just check if the value is 0 or 1 or actually do the calculations?
Edit:
I don't have code because I was asking as a concept. I was creating a program to return the base 10 log of each number in a data set and I was wondering if it would be more efficient to return values I already knew like 0 for 1, "undefined" for 0, and the number of zeros for numbers divisible by 10. I wasn't sure if it was more efficient and if it would be efficient on a larger scale.
Let's try this simple example
$ python3 -m timeit -s "from math import log; mylog=lambda x: log(x)" "mylog(1)"
10000000 loops, best of 3: 0.152 usec per loop
$ python3 -m timeit -s "from math import log; mylog=lambda x: 0.0 if x==1 else log(x)" "mylog(1)"
10000000 loops, best of 3: 0.0976 usec per loop
So there is some speedup, however. All the non special cases run slower
$ python3 -m timeit -s "from math import log; mylog=lambda x: log(x)" "mylog(2)"
10000000 loops, best of 3: 0.164 usec per loop
$ python3 -m timeit -s "from math import log; mylog=lambda x: 0.0 if x==1 else log(x)" "mylog(2)"
1000000 loops, best of 3: 0.176 usec per loop
And in this case, it's better just to leave the wrapper function out altogether
$ python3 -m timeit -s "from math import log" "log(2)"
10000000 loops, best of 3: 0.0804 usec per loop

Why does Python's set difference method take time with an empty set?

Here is what I mean:
> python -m timeit "set().difference(xrange(0,10))"
1000000 loops, best of 3: 0.624 usec per loop
> python -m timeit "set().difference(xrange(0,10**4))"
10000 loops, best of 3: 170 usec per loop
Apparently python iterates through the whole argument, even if the result is known to be the empty set beforehand. Is there any good reason for this? The code was run in python 2.7.6.
(Even for nonempty sets, if you find that you've removed all of the first set's elements midway through the iteration, it makes sense to stop right away.)
Is there any good reason for this?
Having a special path for the empty set had not come up before.
Even for nonempty sets, if you find that you've removed all of the first set's elements midway through the iteration, it makes sense to stop right away.
This is a reasonable optimization request. I've made a patch and will apply it shortly. Here are the new timings with the patch applied:
$ py -m timeit -s "r = range(10 ** 4); s = set()" "s.difference(r)"
10000000 loops, best of 3: 0.104 usec per loop
$ py -m timeit -s "r = set(range(10 ** 4)); s = set()" "s.difference(r)"
10000000 loops, best of 3: 0.105 usec per loop
$ py -m timeit -s "r = range(10 ** 4); s = set()" "s.difference_update(r)"
10000000 loops, best of 3: 0.0659 usec per loop
$ py -m timeit -s "r = set(range(10 ** 4)); s = set()" "s.difference_update(r)"
10000000 loops, best of 3: 0.0684 usec per loop
IMO it's a matter of specialisation, consider:
In [18]: r = range(10 ** 4)
In [19]: s = set(range(10 ** 4))
In [20]: %time set().difference(r)
CPU times: user 387 µs, sys: 0 ns, total: 387 µs
Wall time: 394 µs
Out[20]: set()
In [21]: %time set().difference(s)
CPU times: user 10 µs, sys: 8 µs, total: 18 µs
Wall time: 16.2 µs
Out[21]: set()
Apparently difference has specialised implementation for set - set.
Note that difference operator requires right hand argument to be a set, while difference allows any iterable.
Per #wim implementation is at https://github.com/python/cpython/blob/master/Objects/setobject.c#L1553-L1555
When Python core developers add new features, the first priority is correct code with thorough test coverage. That is hard enough in itself. Speedups often come later as someone has the idea and inclination. I opened a tracker issue 28071 summarizing the proposal and counter-reasons discussed here. I will try to summarize its disposition here.
UPDATE: An early-out for sets that start empty has been added for 3.6.0b1, due in about a day.

python string multiplication implementation

I wonder best practice for string repetition.
I always heard I should not use for i in range(len(x)): string += x[i] pattern for strings joining and should use string = ''.join(x) instead due to inefficiency of addition operator implementation for python strings.
But speedtests are:
$ python -m timeit "100*'string'"
1000000 loops, best of 3: 0.23 usec per loop
$ python -m timeit "''.join(['string' for i in xrange(100)])"
100000 loops, best of 3: 6.45 usec per loop
What about implementation details for string multiplication? I know that str * n equals str.__imul__(n), but how is it implemented I don't know.

Performance of any()

This is mostly an exercise in learning Python. I wrote this function to test if a number is prime:
def p1(n):
for d in xrange(2, int(math.sqrt(n)) + 1):
if n % d == 0:
return False
return True
Then I realized I can make easily rewrite it using any():
def p2(n):
return not any((n % d == 0) for d in xrange(2, int(math.sqrt(n)) + 1))
Performance-wise, I was expecting p2 to be faster than, or at the very least as fast as, p1 because any() is builtin, but for a large-ish prime, p2 is quite a bit slower:
$ python -m timeit -n 100000 -s "import test" "test.p1(999983)"
100000 loops, best of 3: 60.2 usec per loop
$ python -m timeit -n 100000 -s "import test" "test.p2(999983)"
100000 loops, best of 3: 88.1 usec per loop
Am I using any() incorrectly here? Is there a way to write this function using any() so that it's as far as iterating myself?
Update: Numbers for an even larger prime
$ python -m timeit -n 1000 -s "import test" "test.p1(9999999999971)"
1000 loops, best of 3: 181 msec per loop
$ python -m timeit -n 1000 -s "import test" "test.p2(9999999999971)"
1000 loops, best of 3: 261 msec per loop
The performance difference is minimal, but the reason it exists is that any incurs building a generator expression, and an extra function call, compared to the for loop. Both have identical behaviors, though (shortcut evaluation).
As the size of your input grows, the difference won't diminish (I was wrong) because you're using a generator expression, and iterating over it requires calling a method (.next()) on it and an extra stack frame. any does that under the hood, of course.
The for loop is iterating over an xrange object. any is iterating over a generator expression, which itself is iterating over an xrange object.
Either way, use whichever produces the most readable/maintainable code. Choosing one over the other will have little, if any, performance impact on whatever program you're writing.

What is faster for searching items in list, in operator or index()?

From this site, it says that list.index() is a linear search through the list.
And it also seems like in is also linear.
Is there any advantage to using one over the other?
If you want to compare different python approaches, such as the in operator versus .index(), use the timeit module to test the speed differences. Python data type complexities are documented on http://wiki.python.org/moin/TimeComplexity.
Do note that there is a big difference between in and .index(); the first one returns a boolean, the latter the index of the found item (an integer) or it'll raise an exception. It thus is (slightly) slower for the average case:
$ python -mtimeit -s 'a = list(range(10000))' '5000 in a'
10000 loops, best of 3: 107 usec per loop
$ python -mtimeit -s 'a = list(range(10000))' 'a.index(5000)'
10000 loops, best of 3: 111 usec per loop
If you need to optimize for membership testing, use a set() instead:
$ python -mtimeit -s 'a = set(range(10000))' '5000 in a'
10000000 loops, best of 3: 0.108 usec per loop

Categories