Efficient way to convert delimiter separated string to numpy array - python

I have a String as follows :
1|234|4456|789
I have to convert it into numpy array.I would like to know the most efficient way.Since I will be calling this function for more than 50 million times!

The fastest way is to use the numpy.fromstring method:
>>> import numpy
>>> data = "1|234|4456|789"
>>> numpy.fromstring(data, dtype=int, sep="|")
array([ 1, 234, 4456, 789])

#jterrace wins one (1) internet.
In the measurements below the example code has been shortened to allow the tests to fit on one line without scrolling where possible.
For those not familiar with timeit the -s flag allows you to specify a bit of code which will only be executed once.
The fastest and least-cluttered way is to use numpy.fromstring as jterrace suggested:
python -mtimeit -s"import numpy;s='1|2'" "numpy.fromstring(s,dtype=int,sep='|')"
100000 loops, best of 3: 1.85 usec per loop
The following three examples use string.split in combination with another tool.
string.split with numpy.fromiter
python -mtimeit -s"import numpy;s='1|2'" "numpy.fromiter(s.split('|'),dtype=int)"
100000 loops, best of 3: 2.24 usec per loop
string.split with int() cast via generator-expression
python -mtimeit -s"import numpy;s='1|2'" "numpy.array(int(x) for x in s.split('|'))"
100000 loops, best of 3: 3.12 usec per loop
string.split with NumPy array of type int
python -mtimeit -s"import numpy;s='1|2'" "numpy.array(s.split('|'),dtype=int)"
100000 loops, best of 3: 9.22 usec per loop

Try this:
import numpy as np
s = '1|234|4456|789'
array = np.array([int(x) for x in s.split('|')])
... Assuming that the numbers are all ints. if not, replace int with float in the above snippet of code.
EDIT 1:
Alternatively, you can do this, it will only create one intermediate list (the one generated by split()):
array = np.array(s.split('|'), dtype=int)
EDIT 2:
And yet another way, possibly faster (thanks for all the comments, guys!):
array = np.fromiter(s.split("|"), dtype=int)

Related

What is the fastest way to compare beginning of a string?

Imagine list of strings like this one: ('{hello world} is awesome', 'Hello world is less awesome', '{hello world} is {awesome} too'). I want to check each string in for cycle for starting character, I think I have got 4 options:
if re.search(r'^\{', i):
if re.match(r'\{', i):
if i.startswith('{'):
if i[:1] == '{':
Which is the fastest one? Is there some even more faster than these 4 options?
Note: The starting string to compare could be longer, not only one letter, e.g. {hello
The fastest is i[0] == value, since it directly uses a pointer to the underlying array. Regex needs to (at least) parse the pattern, while startsWith has the overhead of a method call and creating a slice of that size before the actual comparison.
As #dsqdfg said in the comments, there is a timing function in python I've never known until now. I tried to measure them and there are some results:
python -m timeit -s 'text="{hello world}"' 'text[:6] == "{hello"'
1000000 loops, best of 3: 0.224 usec per loop
python -m timeit -s 'text="{hello world}"' 'text.startswith("{hello")'
1000000 loops, best of 3: 0.291 usec per loop
python -m timeit -s 'text="{hello world}"' 'import re' 're.match(r"\{hello", text)'
100000 loops, best of 3: 2.53 usec per loop
python -m timeit -s 'text="{hello world}"' 'import re' 're.search(r"^\{hello", text)'
100000 loops, best of 3: 2.86 usec per loop

python string multiplication implementation

I wonder best practice for string repetition.
I always heard I should not use for i in range(len(x)): string += x[i] pattern for strings joining and should use string = ''.join(x) instead due to inefficiency of addition operator implementation for python strings.
But speedtests are:
$ python -m timeit "100*'string'"
1000000 loops, best of 3: 0.23 usec per loop
$ python -m timeit "''.join(['string' for i in xrange(100)])"
100000 loops, best of 3: 6.45 usec per loop
What about implementation details for string multiplication? I know that str * n equals str.__imul__(n), but how is it implemented I don't know.

Converting str numbers in list to int and find out the sum of the list

I have a list, but the numbers in it are strings so I can't find the sum of the list, so I need help in converting the numbers in the list to int.
This is my code
def convertStr(cals):
ret = float(cals)
return ret
TotalCal = sum(cals)
So basically there is list called cals
and it looks like this
(20,45,...etc)
But the numbers in it are strings so when I try finding the sum like this
TotalCal = sum(cals)
And then run it shows an error saying that the list needs to be an int format
so the question is how do I convert all numbers in the list to int format?
If you have a different way of finding the sum of lists it will be good too.
You can use either the python builtin map or a list comprehension for this
def convertStr(cals):
ret = [float(i) for i in (cals)]
return ret
or
def convertStr(cals):
return map(float,cals)
Here are the timeit results for both the approaches
$ python -m timeit "cals = ['1','2','3','4'];[float(i) for i in (cals)]"
1000000 loops, best of 3: 0.804 usec per loop
$ python -m timeit "cals = ['1','2','3','4'];map(float,cals)"
1000000 loops, best of 3: 0.787 usec per loop
As you can see map is faster and more pythonic as compared to the list comprehension. This is discussed in full length here
map may be microscopically faster in some cases (when you're NOT making a lambda for the purpose, but using the same function in map and a listcomp). List comprehensions may be faster in other cases
Another way using itertools.imap. This is the fastest for long lists
from itertools import imap
TotalCal = sum(imap(float,cals)
And using timeit for a list with 1000 entries.
$ python -m timeit "import random;cals = [str(random.randint(0,100)) for r in range(1000)];sum(map(float,cals))"
1000 loops, best of 3: 1.38 msec per loop
$ python -m timeit "import random;cals = [str(random.randint(0,100)) for r in range(1000)];[float(i) for i in (cals)]"
1000 loops, best of 3: 1.39 msec per loop
$ python -m timeit "from itertools import imap;import random;cals = [str(random.randint(0,100)) for r in range(1000)];imap(float,cals)"
1000 loops, best of 3: 1.24 msec per loop
As Padraic mentions below, The imap way is the best way to go! It is fast1 and looks great! Inclusion of a library function has it's bearing on small lists only and not on large lists. Thus for large lists, imap is better suited.
1 List comprehension is still slower than map by 1 micro second!!! Thank god
sum(map(float,cals))
or
sum(float(i) for i in cals)

Are Numpy functions slow?

Numpy is supposed to be fast. However, when comparing Numpy ufuncs with standard Python functions I find that the latter are much faster.
For example,
aa = np.arange(1000000, dtype = float)
%timeit np.mean(aa) # 1000 loops, best of 3: 1.15 ms per loop
%timeit aa.mean # 10000000 loops, best of 3: 69.5 ns per loop
I got similar results with other Numpy functions like max, power. I was under the impression that Numpy has an overhead that makes it slower for small arrays but would be faster for large arrays. In the code above aa is not small: it has 1 million elements. Am I missing something?
Of course, Numpy is fast, only the functions seem to be slow:
bb = range(1000000)
%timeit mean(bb) # 1 loops, best of 3: 551 ms per loop
%timeit mean(list(bb)) # 10 loops, best of 3: 136 ms per loop
Others already pointed out that your comparison is not a real comparison (you are not calling the function + both are numpy).
But to give an answer to the question "Are numpy function slow?": generally speaking, no, numpy function are not slow (or not slower than plain python function). Off course there are some side notes to make:
'Slow' depends off course on what you compare with, and it can always faster. With things like cython, numexpr, numba, calling C-code, ... and others it is in many cases certainly possible to get faster results.
Numpy has a certain overhead, which can be significant in some cases. For example, as you already mentioned, numpy can be slower on small arrays and scalar math. For a comparison on this, see eg Are NumPy's math functions faster than Python's?
To make the comparison you wanted to make:
In [1]: import numpy as np
In [2]: aa = np.arange(1000000)
In [3]: bb = range(1000000)
For the mean (note, there is no mean function in python standard library: Calculating arithmetic mean (average) in Python):
In [4]: %timeit np.mean(aa)
100 loops, best of 3: 2.07 ms per loop
In [5]: %timeit float(sum(bb))/len(bb)
10 loops, best of 3: 69.5 ms per loop
For max, numpy vs plain python:
In [6]: %timeit np.max(aa)
1000 loops, best of 3: 1.52 ms per loop
In [7]: %timeit max(bb)
10 loops, best of 3: 31.2 ms per loop
As a final note, in the above comparison I used a numpy array (aa) for the numpy functions and a list (bb) for the plain python functions. If you would use a list with numpy functions, in this case it would again be slower:
In [10]: %timeit np.max(bb)
10 loops, best of 3: 115 ms per loop
because the list is first converted to an array (which consumes most of the time). So, if you want to rely on numpy in your application, it is important to make use of numpy arrays to store you data (or if you have a list, convert it to an array so this conversion has to be done only once).
You're not calling aa.mean. Put the function call parentheses on the end, to actually call it, and the speed difference will nearly vanish. (Both np.mean(aa) and aa.mean() are NumPy; neither uses Python builtins to do the math.)

What is faster for searching items in list, in operator or index()?

From this site, it says that list.index() is a linear search through the list.
And it also seems like in is also linear.
Is there any advantage to using one over the other?
If you want to compare different python approaches, such as the in operator versus .index(), use the timeit module to test the speed differences. Python data type complexities are documented on http://wiki.python.org/moin/TimeComplexity.
Do note that there is a big difference between in and .index(); the first one returns a boolean, the latter the index of the found item (an integer) or it'll raise an exception. It thus is (slightly) slower for the average case:
$ python -mtimeit -s 'a = list(range(10000))' '5000 in a'
10000 loops, best of 3: 107 usec per loop
$ python -mtimeit -s 'a = list(range(10000))' 'a.index(5000)'
10000 loops, best of 3: 111 usec per loop
If you need to optimize for membership testing, use a set() instead:
$ python -mtimeit -s 'a = set(range(10000))' '5000 in a'
10000000 loops, best of 3: 0.108 usec per loop

Categories