python string multiplication implementation - python

I wonder best practice for string repetition.
I always heard I should not use for i in range(len(x)): string += x[i] pattern for strings joining and should use string = ''.join(x) instead due to inefficiency of addition operator implementation for python strings.
But speedtests are:
$ python -m timeit "100*'string'"
1000000 loops, best of 3: 0.23 usec per loop
$ python -m timeit "''.join(['string' for i in xrange(100)])"
100000 loops, best of 3: 6.45 usec per loop
What about implementation details for string multiplication? I know that str * n equals str.__imul__(n), but how is it implemented I don't know.

Related

What is the fastest way to compare beginning of a string?

Imagine list of strings like this one: ('{hello world} is awesome', 'Hello world is less awesome', '{hello world} is {awesome} too'). I want to check each string in for cycle for starting character, I think I have got 4 options:
if re.search(r'^\{', i):
if re.match(r'\{', i):
if i.startswith('{'):
if i[:1] == '{':
Which is the fastest one? Is there some even more faster than these 4 options?
Note: The starting string to compare could be longer, not only one letter, e.g. {hello
The fastest is i[0] == value, since it directly uses a pointer to the underlying array. Regex needs to (at least) parse the pattern, while startsWith has the overhead of a method call and creating a slice of that size before the actual comparison.
As #dsqdfg said in the comments, there is a timing function in python I've never known until now. I tried to measure them and there are some results:
python -m timeit -s 'text="{hello world}"' 'text[:6] == "{hello"'
1000000 loops, best of 3: 0.224 usec per loop
python -m timeit -s 'text="{hello world}"' 'text.startswith("{hello")'
1000000 loops, best of 3: 0.291 usec per loop
python -m timeit -s 'text="{hello world}"' 'import re' 're.match(r"\{hello", text)'
100000 loops, best of 3: 2.53 usec per loop
python -m timeit -s 'text="{hello world}"' 'import re' 're.search(r"^\{hello", text)'
100000 loops, best of 3: 2.86 usec per loop

Converting str numbers in list to int and find out the sum of the list

I have a list, but the numbers in it are strings so I can't find the sum of the list, so I need help in converting the numbers in the list to int.
This is my code
def convertStr(cals):
ret = float(cals)
return ret
TotalCal = sum(cals)
So basically there is list called cals
and it looks like this
(20,45,...etc)
But the numbers in it are strings so when I try finding the sum like this
TotalCal = sum(cals)
And then run it shows an error saying that the list needs to be an int format
so the question is how do I convert all numbers in the list to int format?
If you have a different way of finding the sum of lists it will be good too.
You can use either the python builtin map or a list comprehension for this
def convertStr(cals):
ret = [float(i) for i in (cals)]
return ret
or
def convertStr(cals):
return map(float,cals)
Here are the timeit results for both the approaches
$ python -m timeit "cals = ['1','2','3','4'];[float(i) for i in (cals)]"
1000000 loops, best of 3: 0.804 usec per loop
$ python -m timeit "cals = ['1','2','3','4'];map(float,cals)"
1000000 loops, best of 3: 0.787 usec per loop
As you can see map is faster and more pythonic as compared to the list comprehension. This is discussed in full length here
map may be microscopically faster in some cases (when you're NOT making a lambda for the purpose, but using the same function in map and a listcomp). List comprehensions may be faster in other cases
Another way using itertools.imap. This is the fastest for long lists
from itertools import imap
TotalCal = sum(imap(float,cals)
And using timeit for a list with 1000 entries.
$ python -m timeit "import random;cals = [str(random.randint(0,100)) for r in range(1000)];sum(map(float,cals))"
1000 loops, best of 3: 1.38 msec per loop
$ python -m timeit "import random;cals = [str(random.randint(0,100)) for r in range(1000)];[float(i) for i in (cals)]"
1000 loops, best of 3: 1.39 msec per loop
$ python -m timeit "from itertools import imap;import random;cals = [str(random.randint(0,100)) for r in range(1000)];imap(float,cals)"
1000 loops, best of 3: 1.24 msec per loop
As Padraic mentions below, The imap way is the best way to go! It is fast1 and looks great! Inclusion of a library function has it's bearing on small lists only and not on large lists. Thus for large lists, imap is better suited.
1 List comprehension is still slower than map by 1 micro second!!! Thank god
sum(map(float,cals))
or
sum(float(i) for i in cals)

In python, will converting variables to its own type waste CPU power?

I'm trying to fetch an int id number from database and some ids are mistakenly stored as string, I'm wondering which of the following way is better:
# the first method
new_id = int(old_id) + 1
# second
if isinstance(old_id, str):
new_id = int(old_id) + 1
else:
new_id = old_id +1
So the question is, does it cost to convert a variable to its own type in python?
Let's check!
~/Coding > python -m timeit -s "id1=1;id2='1'" "new_id = int(id1)" "new_id = int(id2)"
1000000 loops, best of 3: 0.755 usec per loop
~/Coding > python -m timeit -s "id1=1;id2='1';f=lambda x: int(x) if isinstance(x, str) else x" "new_id=f(id1)" "new_id=f(id2)"
1000000 loops, best of 3: 1.15 usec per loop
Looks like the most efficient way is simply doing the int conversion without checking.
I'm open to being corrected that the issue here is the lambda or something else I did.
Update:
This may actually not be a fair answer, because the if check itself is much quicker than the type conversion.
~/Coding > python -m timeit "int('3')"
1000000 loops, best of 3: 0.562 usec per loop
~/Coding > python -m timeit "int(3)"
10000000 loops, best of 3: 0.136 usec per loop
~/Coding > python -m timeit "if isinstance('3', str): pass"
10000000 loops, best of 3: 0.0966 usec per loop
This means that it depends on how many of your ids you expect to be strings to see which is worth it.
Update 2:
I've gone a bit overboard here, but we can determine exactly when it's right to switch over using the above timings depending on how many strings you expect to have.
Where z is the total number of ids and s is the percentage of them that are strings, and all values in microseconds,
Always check type: (assuming returning int costs 0 time)
.0966*z + .562*z*s
Always convert without checking:
.136*z*(1-s) + .562*z*s
When we do the math, the z's and string conversions cancel out (since you have to convert the string regardless), and we end up with the following:
s ~= 0.289706
So it looks like 29% strings or so is about the time when you'd cross over from one method to the other.

What is faster for searching items in list, in operator or index()?

From this site, it says that list.index() is a linear search through the list.
And it also seems like in is also linear.
Is there any advantage to using one over the other?
If you want to compare different python approaches, such as the in operator versus .index(), use the timeit module to test the speed differences. Python data type complexities are documented on http://wiki.python.org/moin/TimeComplexity.
Do note that there is a big difference between in and .index(); the first one returns a boolean, the latter the index of the found item (an integer) or it'll raise an exception. It thus is (slightly) slower for the average case:
$ python -mtimeit -s 'a = list(range(10000))' '5000 in a'
10000 loops, best of 3: 107 usec per loop
$ python -mtimeit -s 'a = list(range(10000))' 'a.index(5000)'
10000 loops, best of 3: 111 usec per loop
If you need to optimize for membership testing, use a set() instead:
$ python -mtimeit -s 'a = set(range(10000))' '5000 in a'
10000000 loops, best of 3: 0.108 usec per loop

Efficient way to convert delimiter separated string to numpy array

I have a String as follows :
1|234|4456|789
I have to convert it into numpy array.I would like to know the most efficient way.Since I will be calling this function for more than 50 million times!
The fastest way is to use the numpy.fromstring method:
>>> import numpy
>>> data = "1|234|4456|789"
>>> numpy.fromstring(data, dtype=int, sep="|")
array([ 1, 234, 4456, 789])
#jterrace wins one (1) internet.
In the measurements below the example code has been shortened to allow the tests to fit on one line without scrolling where possible.
For those not familiar with timeit the -s flag allows you to specify a bit of code which will only be executed once.
The fastest and least-cluttered way is to use numpy.fromstring as jterrace suggested:
python -mtimeit -s"import numpy;s='1|2'" "numpy.fromstring(s,dtype=int,sep='|')"
100000 loops, best of 3: 1.85 usec per loop
The following three examples use string.split in combination with another tool.
string.split with numpy.fromiter
python -mtimeit -s"import numpy;s='1|2'" "numpy.fromiter(s.split('|'),dtype=int)"
100000 loops, best of 3: 2.24 usec per loop
string.split with int() cast via generator-expression
python -mtimeit -s"import numpy;s='1|2'" "numpy.array(int(x) for x in s.split('|'))"
100000 loops, best of 3: 3.12 usec per loop
string.split with NumPy array of type int
python -mtimeit -s"import numpy;s='1|2'" "numpy.array(s.split('|'),dtype=int)"
100000 loops, best of 3: 9.22 usec per loop
Try this:
import numpy as np
s = '1|234|4456|789'
array = np.array([int(x) for x in s.split('|')])
... Assuming that the numbers are all ints. if not, replace int with float in the above snippet of code.
EDIT 1:
Alternatively, you can do this, it will only create one intermediate list (the one generated by split()):
array = np.array(s.split('|'), dtype=int)
EDIT 2:
And yet another way, possibly faster (thanks for all the comments, guys!):
array = np.fromiter(s.split("|"), dtype=int)

Categories