Which pairwise() implementation? - python

The documentation for itertools provides a recipe for a pairwise() function, which I've slightly modified below so that it returns (last_item, None) as the final pair:
from itertools import tee, izip_longest
def pairwise_tee(iterable):
a, b = tee(iterable)
next(b, None)
return izip_longest(a, b)
However, it seemed to me that using tee() might be overkill (given that it's only being used to provide one step of look-ahead), so I tried writing an alternative that avoids it:
def pairwise_zed(iterator):
a = next(iterator)
for b in iterator:
yield a, b
a = b
yield a, None
Note: it so happens that I know my input will be an iterator for my use case; I'm aware that the function above won't work with a regular iterable. The requirement to accept an iterator is also why I'm not using something like izip_longest(iterable, iterable[1:]), by the way.
Testing both functions for speed gave the following results in Python 2.7.3:
>>> import random, string, timeit
>>> for length in range(0, 61, 10):
... text = "".join(random.choice(string.ascii_letters) for n in range(length))
... for variant in "tee", "zed":
... test_case = "list(pairwise_%s(iter('%s')))" % (variant, text)
... setup = "from __main__ import pairwise_%s" % variant
... result = timeit.repeat(test_case, setup=setup, number=100000)
... print "%2d %s %r" % (length, variant, result)
... print
...
0 tee [0.4337780475616455, 0.42563915252685547, 0.42760396003723145]
0 zed [0.21209311485290527, 0.21059393882751465, 0.21039700508117676]
10 tee [0.4933490753173828, 0.4958930015563965, 0.4938509464263916]
10 zed [0.32074403762817383, 0.32239794731140137, 0.32340312004089355]
20 tee [0.6139161586761475, 0.6109561920166016, 0.6153261661529541]
20 zed [0.49281787872314453, 0.49651598930358887, 0.4942781925201416]
30 tee [0.7470319271087646, 0.7446520328521729, 0.7463529109954834]
30 zed [0.7085139751434326, 0.7165200710296631, 0.7171430587768555]
40 tee [0.8083810806274414, 0.8031280040740967, 0.8049719333648682]
40 zed [0.8273730278015137, 0.8248250484466553, 0.8298079967498779]
50 tee [0.8745720386505127, 0.9205660820007324, 0.878741979598999]
50 zed [0.9760301113128662, 0.9776301383972168, 0.978381872177124]
60 tee [0.9913749694824219, 0.9922418594360352, 0.9938201904296875]
60 zed [1.1071209907531738, 1.1063809394836426, 1.1069209575653076]
... so, it turns out that pairwise_tee() starts to outperform pairwise_zed() when there are about forty items. That's fine, as far as I'm concerned - on average, my input is likely to be under that threshold.
My question is: which should I use? pairwise_zed() looks like it'll be a little faster (and to my eyes is slightly easier to follow), but pairwise_tee() could be considered the "canonical" implementation by virtue of being taken from the official docs (to which I could link in a comment), and will work for any iterable - which isn't a consideration at this point, but I suppose could be later.
I was also wondering about potential gotchas if the iterator is interfered with outside the function, e.g.
for a, b in pairwise(iterator):
# do something
q = next(iterator)
... but as far as I can tell, pairwise_zed() and pairwise_tee() behave identically in that situation (and of course it would be a damn fool thing to do in the first place).

The itertools tee implementation is idiomatic for those experienced with itertools, though I'd be tempted to use islice instead of next to advance the leading iterator.
A disadvantage of your version is that it's less easy to extend it to n-wise iteration as your state is stored in local variables; I'd be tempted to use a deque:
def pairwise_deque(iterator, n=2):
it = chain(iterator, repeat(None, n - 1))
d = collections.deque(islice(it, n - 1), maxlen=n)
for a in it:
d.append(a)
yield tuple(d)
A useful idiom is calling iter on the iterator parameter; this is an easy way to ensure your function works on any iterable.

This is a subjective question; both versions are fine.
I would use tee, because it looks simpler to me: I know what tee does, so the first is immediately obvious, whereas with the second I have to think a little about the order in which you overwrite a at the end of each loop. The timings are small enough as to be probably irrelephant, but you're the judge of that.
Regarding your other question, from the tee docs:
Once tee() has made a split, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed.

Related

Concatenate Big List Elements Efficiently

I want to make a list of elements where each element starts with 4 numbers and ends with 4 letters with every possible combination. This is my code
import itertools
def char_range(c1, c2):
"""Generates the characters from `c1` to `c2`"""
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
chars =list()
nums =list()
for combination in itertools.product(char_range('a','b'),repeat=4):
chars.append(''.join(map(str, combination)))
for combination in itertools.product(range(10),repeat=4):
nums.append(''.join(map(str, combination)))
c = [str(x)+y for x,y in itertools.product(nums,chars)]
for dd in c:
print(dd)
This runs fine but when I use a bigger range of characters, such as (a-z) the program hogs the CPU and memory, and the PC becomes unresponsive. So how can I do this in a more efficient way?
The documentation of itertools says that "it is roughly equivalent to nested for-loops in a generator expression". So itertools.product is never an enemy of memory, but if you store its results in a list, that list is. Therefore:
for element in itertools.product(...):
print element
is okay, but
myList = [element for itertools.product(...)]
or the equivalent loop of
for element in itertools.product(...):
myList.append(element)
is not! So you want itertools to generate results for you, but you don't want to store them, rather use them as they are generated. Think about this line of your code:
c = [str(x)+y for x,y in itertools.product(nums,chars)]
Given that nums and chars can be huge lists, building another gigantic list of all combinations on top of them is definitely going to choke your system.
Now, as mentioned in the comments, if you replace all the lists that are too fat to fit into the memory with generators (functions that just yield), memory is not going to be a concern anymore.
Here is my full code. I basically changed your lists of chars and nums to generators, and got rid of the final list of c.
import itertools
def char_range(c1, c2):
"""Generates the characters from `c1` to `c2`"""
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
def char(a):
for combination in itertools.product(char_range(str(a[0]),str(a[1])),repeat=4):
yield ''.join(map(str, combination))
def num(n):
for combination in itertools.product(range(n),repeat=4):
yield ''.join(map(str, combination))
def final(one,two):
for foo in char(one):
for bar in num(two):
print str(bar)+str(foo)
Now let's ask what every combination of ['a','b'] and range(2) is:
final(['a','b'],2)
Produces this:
0000aaaa
0001aaaa
0010aaaa
0011aaaa
0100aaaa
0101aaaa
0110aaaa
0111aaaa
1000aaaa
1001aaaa
1010aaaa
1011aaaa
1100aaaa
1101aaaa
1110aaaa
1111aaaa
0000aaab
0001aaab
0010aaab
0011aaab
0100aaab
0101aaab
0110aaab
0111aaab
1000aaab
1001aaab
1010aaab
1011aaab
1100aaab
1101aaab
1110aaab
1111aaab
0000aaba
0001aaba
0010aaba
0011aaba
0100aaba
0101aaba
0110aaba
0111aaba
1000aaba
1001aaba
1010aaba
1011aaba
1100aaba
1101aaba
1110aaba
1111aaba
0000aabb
0001aabb
0010aabb
0011aabb
0100aabb
0101aabb
0110aabb
0111aabb
1000aabb
1001aabb
1010aabb
1011aabb
1100aabb
1101aabb
1110aabb
1111aabb
0000abaa
0001abaa
0010abaa
0011abaa
0100abaa
0101abaa
0110abaa
0111abaa
1000abaa
1001abaa
1010abaa
1011abaa
1100abaa
1101abaa
1110abaa
1111abaa
0000abab
0001abab
0010abab
0011abab
0100abab
0101abab
0110abab
0111abab
1000abab
1001abab
1010abab
1011abab
1100abab
1101abab
1110abab
1111abab
0000abba
0001abba
0010abba
0011abba
0100abba
0101abba
0110abba
0111abba
1000abba
1001abba
1010abba
1011abba
1100abba
1101abba
1110abba
1111abba
0000abbb
0001abbb
0010abbb
0011abbb
0100abbb
0101abbb
0110abbb
0111abbb
1000abbb
1001abbb
1010abbb
1011abbb
1100abbb
1101abbb
1110abbb
1111abbb
0000baaa
0001baaa
0010baaa
0011baaa
0100baaa
0101baaa
0110baaa
0111baaa
1000baaa
1001baaa
1010baaa
1011baaa
1100baaa
1101baaa
1110baaa
1111baaa
0000baab
0001baab
0010baab
0011baab
0100baab
0101baab
0110baab
0111baab
1000baab
1001baab
1010baab
1011baab
1100baab
1101baab
1110baab
1111baab
0000baba
0001baba
0010baba
0011baba
0100baba
0101baba
0110baba
0111baba
1000baba
1001baba
1010baba
1011baba
1100baba
1101baba
1110baba
1111baba
0000babb
0001babb
0010babb
0011babb
0100babb
0101babb
0110babb
0111babb
1000babb
1001babb
1010babb
1011babb
1100babb
1101babb
1110babb
1111babb
0000bbaa
0001bbaa
0010bbaa
0011bbaa
0100bbaa
0101bbaa
0110bbaa
0111bbaa
1000bbaa
1001bbaa
1010bbaa
1011bbaa
1100bbaa
1101bbaa
1110bbaa
1111bbaa
0000bbab
0001bbab
0010bbab
0011bbab
0100bbab
0101bbab
0110bbab
0111bbab
1000bbab
1001bbab
1010bbab
1011bbab
1100bbab
1101bbab
1110bbab
1111bbab
0000bbba
0001bbba
0010bbba
0011bbba
0100bbba
0101bbba
0110bbba
0111bbba
1000bbba
1001bbba
1010bbba
1011bbba
1100bbba
1101bbba
1110bbba
1111bbba
0000bbbb
0001bbbb
0010bbbb
0011bbbb
0100bbbb
0101bbbb
0110bbbb
0111bbbb
1000bbbb
1001bbbb
1010bbbb
1011bbbb
1100bbbb
1101bbbb
1110bbbb
1111bbbb
Which is the exact result you are looking for. Each element of this result is generated on the fly, hence never creates a memory problem. You can now try and see that much bigger operations such as final(['a','z'],10) are CPU-friendly.

Python: itertools.product consuming too much resources

I've created a Python script that generates a list of words by permutation of characters. I'm using itertools.product to generate my permutations. My char list is composed by letters and numbers 01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ. Here is my code:
#!/usr/bin/python
import itertools, hashlib, math
class Words:
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
def __init__(self, size):
self.make(size)
def getLenght(self, size):
res = []
for i in range(1, size+1):
res.append(math.pow(len(self.chars), i))
return sum(res)
def getMD5(self, text):
m = hashlib.md5()
m.update(text.encode('utf-8'))
return m.hexdigest()
def make(self, size):
file = open('res.txt', 'w+')
res = []
i = 1
for i in range(1, size+1):
prod = list(itertools.product(self.chars, repeat=i))
res = res + prod
j = 1
for r in res:
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % (j/float(self.getLenght(size))*100))
file.write(res+'\n')
j = j + 1
file.close()
Words(3)
This script works fine for list of words with max 4 characters. If I try 5 or 6 characters, my computer consumes 100% of CPU, 100% of RAM and freezes.
Is there a way to restrict the use of those resources or optimize this heavy processing?
Does this do what you want?
I've made all the changes in the make method:
def make(self, size):
with open('res.txt', 'w+') as file_: # file is a builtin function in python 2
# also, use with statements for files used on only a small block, it handles file closure even if an error is raised.
for i in range(1, size+1):
prod = itertools.product(self.chars, repeat=i)
for j, r in enumerate(prod):
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % ((j+1)/float(self.get_length(size))*100))
file_.write(res+'\n')
Be warned this will still chew up gigabytes of memory, but not virtual memory.
EDIT: As noted by Padraic, there is no file keyword in Python 3, and as it is a "bad builtin", it's not too worrying to override it. Still, I'll name it file_ here.
EDIT2:
To explain why this works so much faster and better than the previous, original version, you need to know how lazy evaluation works.
Say we have a simple expression as follows (for Python 3) (use xrange for Python 2):
a = [i for i in range(1e12)]
This immediately evaluates 1 trillion elements into memory, overflowing your memory.
So we can use a generator to solve this:
a = (i for i in range(1e12))
Here, none of the values have been evaluated, just given the interpreter instructions on how to evaluate it. We can then iterate through each item one by one and do work on each separately, so almost nothing is in memory at a given time (only 1 integer at a time). This makes the seemingly impossible task very manageable.
The same is true with itertools: it allows you to do memory-efficient, fast operations by using iterators rather than lists or arrays to do operations.
In your example, you have 62 characters and want to do the cartesian product with 5 repeats, or 62**5 (nearly a billion elements, or over 30 gigabytes of ram). This is prohibitively large."
In order to solve this, we can use iterators.
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
for i in itertools.product(chars, repeat=5):
print(i)
Here, only a single item from the cartesian product is in memory at a given time, meaning it is very memory efficient.
However, if you evaluate the full iterator using list(), it then exhausts the iterator and adds it to a list, meaning the nearly one billion combinations are suddenly in memory again. We don't need all the elements in memory at once: just 1. Which is the power of iterators.
Here are links to the itertools module and another explanation on iterators in Python 2 (mostly true for 3).

Recommended usage of Python dictionary, functions as values

I'm looking for some help understanding best practices regarding dictionaries in Python.
I have an example below:
def convert_to_celsius(temp, source):
conversion_dict = {
'kelvin': temp - 273.15,
'romer': (temp - 7.5) * 40 / 21
}
return conversion_dict[source]
def convert_to_celsius_lambda(temp, source):
conversion_dict = {
'kelvin': lambda x: x - 273.15,
'romer': lambda x: (x - 7.5) * 40 / 21
}
return conversion_dict[source](temp)
Obviously, the two functions achieve the same goal, but via different means. Could someone help me understand the subtle difference between the two, and what the 'best' way to go on about this would be?
If you have both dictionaries being created inside the function, then the former will be more efficient - although the former performs two calculations when only one is needed, there is more overhead in the latter version for creating the lambdas each time it's called:
>>> import timeit
>>> setup = "from __main__ import convert_to_celsius, convert_to_celsius_lambda, convert_to_celsius_lambda_once"
>>> timeit.timeit("convert_to_celsius(100, 'kelvin')", setup=setup)
0.5716437913429102
>>> timeit.timeit("convert_to_celsius_lambda(100, 'kelvin')", setup=setup)
0.6484164544288618
However, if you move the dictionary of lambdas outside the function:
CONVERSION_DICT = {
'kelvin': lambda x: x - 273.15,
'romer': lambda x: (x - 7.5) * 40 / 21
}
def convert_to_celsius_lambda_once(temp, source):
return CONVERSION_DICT[source](temp)
then the latter is more efficient, as the lambda objects are only created once, and the function only does the necessary calculation on each call:
>>> timeit.timeit("convert_to_celsius_lambda_once(100, 'kelvin')", setup=setup)
0.3904035060131186
Note that this will only be a benefit where the function is being called a lot (in this case, 1,000,000 times), so that the overhead of creating the two lambda function objects is less than the time wasted in calculating two results when only one is needed.
The dictionary is totally pointless, since you need to re-create it on each call but all you ever do is a single look-up. Juse use an if:
def convert_to_celsius(temp, source):
if source == "kelvin": return temp - 273.15
elif source == "romer": return (temp - 7.5) * 40 / 21
raise KeyError("unknown temperature source '%s'" % source)
Even though both achieve the same thing, the first part is more readable and faster.
In your first example you have a simple arithmetical operation which is going to be calculated once convert_to_celsius is called.
In the second example you calculate only the required temperature.
If you had the second function do an expensive calculation, then it would probably make sense to use a function instead, but for this particular example it's not required.
As others have pointed out, neither of your options are ideal. The first one does both calculations every time and has an unnecessary dict. The second one has to create the lambdas every time through. If this example is the goal then I agree with unwind to just use an if statement. If the goal is to learn something that can be expanded to other uses, I like this approach:
convert_to_celsius = { 'kelvin' : lambda temp: temp - 273.15 ,
'romer' : lambda temp: (temp-7.5) * 40 / 21}
newtemp = convert_to_celsius[source](temp)
Your calculation defintions are all stored together and your function call is uncluttered and meaningful.

Interpreting Hamming Distance speed in python

I've been working on making my python more pythonic and toying with runtimes of short snippets of code. My goal to improve the readability, but additionally, to speed execution.
This example conflicts with the best practices I've been reading about and I'm interested to find the where the flaw in my thought process is.
The problem is to compute the hamming distance on two equal length strings. For example the hamming distance of strings 'aaab' and 'aaaa' is 1.
The most straightforward implementation I could think of is as follows:
def hamming_distance_1(s_1, s_2):
dist = 0
for x in range(len(s_1)):
if s_1[x] != s_2[x]: dist += 1
return dist
Next I wrote two "pythonic" implementations:
def hamming_distance_2(s_1, s_2):
return sum(i.imap(operator.countOf, s_1, s_2))
and
def hamming_distance_3(s_1, s_2):
return sum(i.imap(lambda s: int(s[0]!=s[1]), i.izip(s_1, s_2)))
In execution:
s_1 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
s_2 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
print 'ham_1 ', timeit.timeit('hamming_distance_1(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_1",number=1000)
print 'ham_2 ', timeit.timeit('hamming_distance_2(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_2",number=1000)
print 'ham_3 ', timeit.timeit('hamming_distance_3(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_3",number=1000)
returning:
ham_1 1.84980392456
ham_2 3.26420593262
ham_3 3.98718094826
I expected that ham_3 would run slower then ham_2, due to the fact that calling a lambda is treated as a function call, which is slower then calling the built in operator.countOf.
I was surprised I couldn't find a way to get a more pythonic version to run faster then ham_1 however. I have trouble believing that ham_1 is the lower bound for pure python.
Thoughts anyone?
The key is making less method lookups and function calls:
def hamming_distance_4(s_1, s_2):
return sum(i != j for i, j in i.izip(s_1, s_2))
runs at ham_4 1.10134792328 in my system.
ham_2 and ham_3 makes lookups inside the loops, so they are slower.
I wonder if this might be a bit more Pythonic, in some broader sense. What if you use http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html ... a module that already implements what you're looking for?

Reprioritizing priority queue (efficient manner)

I'm looking for a more efficient way to reprioritize items in a priority queue. I have a (quite naive) priority queue implementation based on heapq. The relevant parts are like:
from heapq import heapify, heappop
class pq(object):
def __init__(self, init= None):
self.inner, self.item_f= [], {}
if not None is init:
self.inner= [[priority, item] for item, priority in enumerate(init)]
heapify(self.inner)
self.item_f= {pi[1]: pi for pi in self.inner}
def top_one(self):
if not len(self.inner): return None
priority, item= heappop(self.inner)
del self.item_f[item]
return item, priority
def re_prioritize(self, items, prioritizer= lambda x: x+ 1):
for item in items:
if not item in self.item_f: continue
entry= self.item_f[item]
entry[0]= prioritizer(entry[0])
heapify(self.inner)
And here is a simple co-routine to just demonstrate the reprioritize characteristics in my real application.
def fecther(priorities, prioritizer= lambda x: x+ 1):
q= pq(priorities)
for k in xrange(len(priorities)+ 1):
items= (yield k, q.top_one())
if not None is items:
q.re_prioritize(items, prioritizer)
With testing
if __name__ == '__main__':
def gen_tst(n= 3):
priorities= range(n)
priorities.reverse()
priorities= priorities+ range(n)
def tst():
result, f= range(2* n), fecther(priorities)
k, item_t= f.next()
while not None is item_t:
result[k]= item_t[0]
k, item_t= f.send(range(item_t[0]))
return result
return tst
producing:
In []: gen_tst()()
Out[]: [2, 3, 4, 5, 1, 0]
In []: t= gen_tst(123)
In []: %timeit t()
10 loops, best of 3: 26 ms per loop
Now, my question is, does there exist any data-structure which would avoid calls to heapify(.), when repriorizating the priority queue? I'm here willing to trade memory for speed, but it should be possible to implement it in pure Python (obviously with much more better timings than my naive implementation).
Update:
In order to let you to understand more on the specific case, lets assume that no items are added to the queue after initial (batch) pushes and then every fetch (pop) from the queue will generate number of repriorizations roughly like this scheme:
0* n, very seldom
0.05* n, typically
n, very seldom
where n is the current number of itemsin queue. Thus, in any round, there are more or less only relative few items to repriorizate. So I'm hoping that there could exist a data-structure that would be able to exploit this pattern and therefore outperforming the cost of doing mandatory heapify(.) in every round (in order to satisfy the heap invariant).
Update 2:
So far it seems that the heapify(.) approach is quite efficient (relatively speaking) indeed. All the alternatives I have been able to figure out, needs to utilize heappush(.) and it seems to be more expensive what I originally anticipated. (Anyway, if the state of issue remains like this, I'm forced to find a better solution out of the python realm).
Since the new prioritization function may have no relationship to the previous one, you have to pay the cost to get the new ordering (and it's at minimum O(n) just to find the minimum element in the new ordering). If you have a small, fixed number of prioritization functions and switch frequently between them, then you could benefit from keeping a separate heap going for each function (although not with heapq, because it doesn't support cheaply locating and removing and object from the middle of a heap).

Categories