Finding symmetric difference of 2 sets separating them by origin - python

I have two sets:
>>> a = {1,2,3}
>>> b = {2,3,4,5,6}
And I would like to get two new sets with non common elements, first set containing elements from a and second from b, like ({1}, {4,5,6}), or like:
>>> c = a&b # Common elements
>>> d = a^b # Symmetric difference
>>> (a-b, b-a)
({1}, {4, 5, 6})
>>> (a-c, b-c)
({1}, {4, 5, 6})
>>> (a&d, b&d)
({1}, {4, 5, 6})
My problem is that I'm going to use this on large number of sha1 hashes and I'm worried about performance. What is proper way of doing this efficiently?
Note: a and b are going to have around 95% of elements common, 1% will be in a and 4% in b.

Methods I've mentioned in the question has following performance:
>>> timeit.timeit('a-b; b-a', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
135.45828826893307
>>> timeit.timeit('c=a&b; a-c; b-c', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
189.98522938665735
>>> timeit.timeit('d=a^b; a&d; b&d', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
238.35084129583106
So most effective way seems to be using (a-b, b-a) method.
I'm posting this as a reference so other answers would add new methods, not compare the ones I've found.
Python implemented function
Just out of curiosity I've tried implementing own python function to do this (that works on pre-sorted iterators):
def symmetric_diff(l1,l2):
# l1 and l2 has to be sorted and contain comparable elements
r1 = []
r2 = []
i1 = iter(l1)
i2 = iter(l2)
try:
e1 = next(i1)
except StopIteration: return ([], list(i2))
try:
e2 = next(i2)
except StopIteration: return ([e1] + list(i1), [])
try:
while True:
if e1 == e2:
e1 = next(i1)
e2 = next(i2)
elif e1 > e2:
r2.append(e2)
e2 = next(i2)
else:
r1.append(e1)
e1 = next(i1)
except StopIteration:
if e1==e2:
return (r1+list(i1), r2+list(i2))
elif e1 > e2:
return (r1+[e1]+list(i1), r2+list(i2))
else:
return (r1+list(i1), r2+[e2]+list(i2))
Compared to other methods, this one has quite low performance:
t = timeit.Timer(lambda: symmetric_diff(a,b))
>>> t.timeit(1000)
542.3225249653769
So unless some other method is implemented somewhere (some library for working with sets) I think using two sets difference is the most efficient way of doing this.

Related

Best practice to split a list based on a condition [duplicate]

I have some code like:
good = [x for x in mylist if x in goodvals]
bad = [x for x in mylist if x not in goodvals]
The goal is to split up the contents of mylist into two other lists, based on whether or not they meet a condition.
How can I do this more elegantly? Can I avoid doing two separate iterations over mylist? Can I improve performance by doing so?
Iterate manually, using the condition to select a list to which each element will be appended:
good, bad = [], []
for x in mylist:
(bad, good)[x in goodvals].append(x)
good = [x for x in mylist if x in goodvals]
bad = [x for x in mylist if x not in goodvals]
How can I do this more elegantly?
That code is already perfectly elegant.
There might be slight performance improvements using sets, but the difference is trivial. set based approaches will also discard duplicates and will not preserve the order of elements. I find the list comprehension far easier to read, too.
In fact, we could even more simply just use a for loop:
good, bad = [], []
for x in mylist:
if x in goodvals:
good.append(f)
else:
bad.append(f)
This approach makes it easier to add additional logic. For example, the code is easily modified to discard None values:
good, bad = [], []
for x in mylist:
if x is None:
continue
if x in goodvals:
good.append(f)
else:
bad.append(f)
Here's the lazy iterator approach:
from itertools import tee
def split_on_condition(seq, condition):
l1, l2 = tee((condition(item), item) for item in seq)
return (i for p, i in l1 if p), (i for p, i in l2 if not p)
It evaluates the condition once per item and returns two generators, first yielding values from the sequence where the condition is true, the other where it's false.
Because it's lazy you can use it on any iterator, even an infinite one:
from itertools import count, islice
def is_prime(n):
return n > 1 and all(n % i for i in xrange(2, n))
primes, not_primes = split_on_condition(count(), is_prime)
print("First 10 primes", list(islice(primes, 10)))
print("First 10 non-primes", list(islice(not_primes, 10)))
Usually though the non-lazy list returning approach is better:
def split_on_condition(seq, condition):
a, b = [], []
for item in seq:
(a if condition(item) else b).append(item)
return a, b
Edit: For your more specific usecase of splitting items into different lists by some key, heres a generic function that does that:
DROP_VALUE = lambda _:_
def split_by_key(seq, resultmapping, keyfunc, default=DROP_VALUE):
"""Split a sequence into lists based on a key function.
seq - input sequence
resultmapping - a dictionary that maps from target lists to keys that go to that list
keyfunc - function to calculate the key of an input value
default - the target where items that don't have a corresponding key go, by default they are dropped
"""
result_lists = dict((key, []) for key in resultmapping)
appenders = dict((key, result_lists[target].append) for target, keys in resultmapping.items() for key in keys)
if default is not DROP_VALUE:
result_lists.setdefault(default, [])
default_action = result_lists[default].append
else:
default_action = DROP_VALUE
for item in seq:
appenders.get(keyfunc(item), default_action)(item)
return result_lists
Usage:
def file_extension(f):
return f[2].lower()
split_files = split_by_key(files, {'images': IMAGE_TYPES}, keyfunc=file_extension, default='anims')
print split_files['images']
print split_files['anims']
Problem with all proposed solutions is that it will scan and apply the filtering function twice. I'd make a simple small function like this:
def split_into_two_lists(lst, f):
a = []
b = []
for elem in lst:
if f(elem):
a.append(elem)
else:
b.append(elem)
return a, b
That way you are not processing anything twice and also are not repeating code.
My take on it. I propose a lazy, single-pass, partition function,
which preserves relative order in the output subsequences.
1. Requirements
I assume that the requirements are:
maintain elements' relative order (hence, no sets and
dictionaries)
evaluate condition only once for every element (hence not using
(i)filter or groupby)
allow for lazy consumption of either sequence (if we can afford to
precompute them, then the naïve implementation is likely to be
acceptable too)
2. split library
My partition function (introduced below) and other similar functions
have made it into a small library:
python-split
It's installable normally via PyPI:
pip install --user split
To split a list base on condition, use partition function:
>>> from split import partition
>>> files = [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi') ]
>>> image_types = ('.jpg','.jpeg','.gif','.bmp','.png')
>>> images, other = partition(lambda f: f[-1] in image_types, files)
>>> list(images)
[('file1.jpg', 33L, '.jpg')]
>>> list(other)
[('file2.avi', 999L, '.avi')]
3. partition function explained
Internally we need to build two subsequences at once, so consuming
only one output sequence will force the other one to be computed
too. And we need to keep state between user requests (store processed
but not yet requested elements). To keep state, I use two double-ended
queues (deques):
from collections import deque
SplitSeq class takes care of the housekeeping:
class SplitSeq:
def __init__(self, condition, sequence):
self.cond = condition
self.goods = deque([])
self.bads = deque([])
self.seq = iter(sequence)
Magic happens in its .getNext() method. It is almost like .next()
of the iterators, but allows to specify which kind of element we want
this time. Behind the scene it doesn't discard the rejected elements,
but instead puts them in one of the two queues:
def getNext(self, getGood=True):
if getGood:
these, those, cond = self.goods, self.bads, self.cond
else:
these, those, cond = self.bads, self.goods, lambda x: not self.cond(x)
if these:
return these.popleft()
else:
while 1: # exit on StopIteration
n = self.seq.next()
if cond(n):
return n
else:
those.append(n)
The end user is supposed to use partition function. It takes a
condition function and a sequence (just like map or filter), and
returns two generators. The first generator builds a subsequence of
elements for which the condition holds, the second one builds the
complementary subsequence. Iterators and generators allow for lazy
splitting of even long or infinite sequences.
def partition(condition, sequence):
cond = condition if condition else bool # evaluate as bool if condition == None
ss = SplitSeq(cond, sequence)
def goods():
while 1:
yield ss.getNext(getGood=True)
def bads():
while 1:
yield ss.getNext(getGood=False)
return goods(), bads()
I chose the test function to be the first argument to facilitate
partial application in the future (similar to how map and filter
have the test function as the first argument).
I basically like Anders' approach as it is very general. Here's a version that puts the categorizer first (to match filter syntax) and uses a defaultdict (assumed imported).
def categorize(func, seq):
"""Return mapping from categories to lists
of categorized items.
"""
d = defaultdict(list)
for item in seq:
d[func(item)].append(item)
return d
First go (pre-OP-edit): Use sets:
mylist = [1,2,3,4,5,6,7]
goodvals = [1,3,7,8,9]
myset = set(mylist)
goodset = set(goodvals)
print list(myset.intersection(goodset)) # [1, 3, 7]
print list(myset.difference(goodset)) # [2, 4, 5, 6]
That's good for both readability (IMHO) and performance.
Second go (post-OP-edit):
Create your list of good extensions as a set:
IMAGE_TYPES = set(['.jpg','.jpeg','.gif','.bmp','.png'])
and that will increase performance. Otherwise, what you have looks fine to me.
itertools.groupby almost does what you want, except it requires the items to be sorted to ensure that you get a single contiguous range, so you need to sort by your key first (otherwise you'll get multiple interleaved groups for each type). eg.
def is_good(f):
return f[2].lower() in IMAGE_TYPES
files = [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi'), ('file3.gif', 123L, '.gif')]
for key, group in itertools.groupby(sorted(files, key=is_good), key=is_good):
print key, list(group)
gives:
False [('file2.avi', 999L, '.avi')]
True [('file1.jpg', 33L, '.jpg'), ('file3.gif', 123L, '.gif')]
Similar to the other solutions, the key func can be defined to divide into any number of groups you want.
Elegant and Fast
Inspired by DanSalmo's comment, here is a solution that is concise, elegant, and at the same time is one of the fastest solutions.
good_set = set(goodvals)
good, bad = [], []
for item in my_list:
good.append(item) if item in good_set else bad.append(item)
Tip: Turning goodvals into a set gives us an easy speed boost.
Fastest
For maximum speed, we take the fastest answer and turbocharge it by turning good_list into a set. That alone gives us a 40%+ speed boost, and we end up with a solution that is more than 5.5x as fast as the slowest solution, even while it remains readable.
good_list_set = set(good_list) # 40%+ faster than a tuple.
good, bad = [], []
for item in my_origin_list:
if item in good_list_set:
good.append(item)
else:
bad.append(item)
A little shorter
This is a more concise version of the previous answer.
good_list_set = set(good_list) # 40%+ faster than a tuple.
good, bad = [], []
for item in my_origin_list:
out = good if item in good_list_set else bad
out.append(item)
Elegance can be somewhat subjective, but some of the Rube Goldberg style solutions that are cute and ingenious are quite concerning and should not be used in production code in any language, let alone python which is elegant at heart.
Benchmark results:
filter_BJHomer 80/s -- -3265% -5312% -5900% -6262% -7273% -7363% -8051% -8162% -8244%
zip_Funky 118/s 4848% -- -3040% -3913% -4450% -5951% -6085% -7106% -7271% -7393%
two_lst_tuple_JohnLaRoy 170/s 11332% 4367% -- -1254% -2026% -4182% -4375% -5842% -6079% -6254%
if_else_DBR 195/s 14392% 6428% 1434% -- -882% -3348% -3568% -5246% -5516% -5717%
two_lst_compr_Parand 213/s 16750% 8016% 2540% 967% -- -2705% -2946% -4786% -5083% -5303%
if_else_1_line_DanSalmo 292/s 26668% 14696% 7189% 5033% 3707% -- -331% -2853% -3260% -3562%
tuple_if_else 302/s 27923% 15542% 7778% 5548% 4177% 343% -- -2609% -3029% -3341%
set_1_line 409/s 41308% 24556% 14053% 11035% 9181% 3993% 3529% -- -569% -991%
set_shorter 434/s 44401% 26640% 15503% 12303% 10337% 4836% 4345% 603% -- -448%
set_if_else 454/s 46952% 28358% 16699% 13349% 11290% 5532% 5018% 1100% 469% --
The full benchmark code for Python 3.7 (modified from FunkySayu):
good_list = ['.jpg','.jpeg','.gif','.bmp','.png']
import random
import string
my_origin_list = []
for i in range(10000):
fname = ''.join(random.choice(string.ascii_lowercase) for i in range(random.randrange(10)))
if random.getrandbits(1):
fext = random.choice(list(good_list))
else:
fext = "." + ''.join(random.choice(string.ascii_lowercase) for i in range(3))
my_origin_list.append((fname + fext, random.randrange(1000), fext))
# Parand
def two_lst_compr_Parand(*_):
return [e for e in my_origin_list if e[2] in good_list], [e for e in my_origin_list if not e[2] in good_list]
# dbr
def if_else_DBR(*_):
a, b = list(), list()
for e in my_origin_list:
if e[2] in good_list:
a.append(e)
else:
b.append(e)
return a, b
# John La Rooy
def two_lst_tuple_JohnLaRoy(*_):
a, b = list(), list()
for e in my_origin_list:
(b, a)[e[2] in good_list].append(e)
return a, b
# # Ants Aasma
# def f4():
# l1, l2 = tee((e[2] in good_list, e) for e in my_origin_list)
# return [i for p, i in l1 if p], [i for p, i in l2 if not p]
# My personal way to do
def zip_Funky(*_):
a, b = zip(*[(e, None) if e[2] in good_list else (None, e) for e in my_origin_list])
return list(filter(None, a)), list(filter(None, b))
# BJ Homer
def filter_BJHomer(*_):
return list(filter(lambda e: e[2] in good_list, my_origin_list)), list(filter(lambda e: not e[2] in good_list, my_origin_list))
# ChaimG's answer; as a list.
def if_else_1_line_DanSalmo(*_):
good, bad = [], []
for e in my_origin_list:
_ = good.append(e) if e[2] in good_list else bad.append(e)
return good, bad
# ChaimG's answer; as a set.
def set_1_line(*_):
good_list_set = set(good_list)
good, bad = [], []
for e in my_origin_list:
_ = good.append(e) if e[2] in good_list_set else bad.append(e)
return good, bad
# ChaimG set and if else list.
def set_shorter(*_):
good_list_set = set(good_list)
good, bad = [], []
for e in my_origin_list:
out = good if e[2] in good_list_set else bad
out.append(e)
return good, bad
# ChaimG's best answer; if else as a set.
def set_if_else(*_):
good_list_set = set(good_list)
good, bad = [], []
for e in my_origin_list:
if e[2] in good_list_set:
good.append(e)
else:
bad.append(e)
return good, bad
# ChaimG's best answer; if else as a set.
def tuple_if_else(*_):
good_list_tuple = tuple(good_list)
good, bad = [], []
for e in my_origin_list:
if e[2] in good_list_tuple:
good.append(e)
else:
bad.append(e)
return good, bad
def cmpthese(n=0, functions=None):
results = {}
for func_name in functions:
args = ['%s(range(256))' % func_name, 'from __main__ import %s' % func_name]
t = Timer(*args)
results[func_name] = 1 / (t.timeit(number=n) / n) # passes/sec
functions_sorted = sorted(functions, key=results.__getitem__)
for f in functions_sorted:
diff = []
for func in functions_sorted:
if func == f:
diff.append("--")
else:
diff.append(f"{results[f]/results[func]*100 - 100:5.0%}")
diffs = " ".join(f'{x:>8s}' for x in diff)
print(f"{f:27s} \t{results[f]:,.0f}/s {diffs}")
if __name__=='__main__':
from timeit import Timer
cmpthese(1000, 'two_lst_compr_Parand if_else_DBR two_lst_tuple_JohnLaRoy zip_Funky filter_BJHomer if_else_1_line_DanSalmo set_1_line set_if_else tuple_if_else set_shorter'.split(" "))
good.append(x) if x in goodvals else bad.append(x)
This elegant and concise answer by #dansalmo showed up buried in the comments, so I'm just reposting it here as an answer so it can get the prominence it deserves, especially for new readers.
Complete example:
good, bad = [], []
for x in my_list:
good.append(x) if x in goodvals else bad.append(x)
bad = []
good = [x for x in mylist if x in goodvals or bad.append(x)]
append returns None, so it works.
Personally, I like the version you cited, assuming you already have a list of goodvals hanging around. If not, something like:
good = filter(lambda x: is_good(x), mylist)
bad = filter(lambda x: not is_good(x), mylist)
Of course, that's really very similar to using a list comprehension like you originally did, but with a function instead of a lookup:
good = [x for x in mylist if is_good(x)]
bad = [x for x in mylist if not is_good(x)]
In general, I find the aesthetics of list comprehensions to be very pleasing. Of course, if you don't actually need to preserve ordering and don't need duplicates, using the intersection and difference methods on sets would work well too.
If you want to make it in FP style:
good, bad = [ sum(x, []) for x in zip(*(([y], []) if y in goodvals else ([], [y])
for y in mylist)) ]
Not the most readable solution, but at least iterates through mylist only once.
Sometimes, it looks like list comprehension is not the best thing to use !
I made a little test based on the answer people gave to this topic, tested on a random generated list. Here is the generation of the list (there's probably a better way to do, but it's not the point) :
good_list = ('.jpg','.jpeg','.gif','.bmp','.png')
import random
import string
my_origin_list = []
for i in xrange(10000):
fname = ''.join(random.choice(string.lowercase) for i in range(random.randrange(10)))
if random.getrandbits(1):
fext = random.choice(good_list)
else:
fext = "." + ''.join(random.choice(string.lowercase) for i in range(3))
my_origin_list.append((fname + fext, random.randrange(1000), fext))
And here we go
# Parand
def f1():
return [e for e in my_origin_list if e[2] in good_list], [e for e in my_origin_list if not e[2] in good_list]
# dbr
def f2():
a, b = list(), list()
for e in my_origin_list:
if e[2] in good_list:
a.append(e)
else:
b.append(e)
return a, b
# John La Rooy
def f3():
a, b = list(), list()
for e in my_origin_list:
(b, a)[e[2] in good_list].append(e)
return a, b
# Ants Aasma
def f4():
l1, l2 = tee((e[2] in good_list, e) for e in my_origin_list)
return [i for p, i in l1 if p], [i for p, i in l2 if not p]
# My personal way to do
def f5():
a, b = zip(*[(e, None) if e[2] in good_list else (None, e) for e in my_origin_list])
return list(filter(None, a)), list(filter(None, b))
# BJ Homer
def f6():
return filter(lambda e: e[2] in good_list, my_origin_list), filter(lambda e: not e[2] in good_list, my_origin_list)
Using the cmpthese function, the best result is the dbr answer :
f1 204/s -- -5% -14% -15% -20% -26%
f6 215/s 6% -- -9% -11% -16% -22%
f3 237/s 16% 10% -- -2% -7% -14%
f4 240/s 18% 12% 2% -- -6% -13%
f5 255/s 25% 18% 8% 6% -- -8%
f2 277/s 36% 29% 17% 15% 9% --
def partition(pred, iterable):
'Use a predicate to partition entries into false entries and true entries'
# partition(is_odd, range(10)) --> 0 2 4 6 8 and 1 3 5 7 9
t1, t2 = tee(iterable)
return filterfalse(pred, t1), filter(pred, t2)
Check this
I think a generalization of splitting a an iterable based on N conditions is handy
from collections import OrderedDict
def partition(iterable,*conditions):
'''Returns a list with the elements that satisfy each of condition.
Conditions are assumed to be exclusive'''
d= OrderedDict((i,list())for i in range(len(conditions)))
for e in iterable:
for i,condition in enumerate(conditions):
if condition(e):
d[i].append(e)
break
return d.values()
For instance:
ints,floats,other = partition([2, 3.14, 1, 1.69, [], None],
lambda x: isinstance(x, int),
lambda x: isinstance(x, float),
lambda x: True)
print " ints: {}\n floats:{}\n other:{}".format(ints,floats,other)
ints: [2, 1]
floats:[3.14, 1.69]
other:[[], None]
If the element may satisfy multiple conditions, remove the break.
Yet another solution to this problem. I needed a solution that is as fast as possible. That means only one iteration over the list and preferably O(1) for adding data to one of the resulting lists. This is very similar to the solution provided by sastanin, except much shorter:
from collections import deque
def split(iterable, function):
dq_true = deque()
dq_false = deque()
# deque - the fastest way to consume an iterator and append items
deque((
(dq_true if function(item) else dq_false).append(item) for item in iterable
), maxlen=0)
return dq_true, dq_false
Then, you can use the function in the following way:
lower, higher = split([0,1,2,3,4,5,6,7,8,9], lambda x: x < 5)
selected, other = split([0,1,2,3,4,5,6,7,8,9], lambda x: x in {0,4,9})
If you're not fine with the resulting deque object, you can easily convert it to list, set, whatever you like (for example list(lower)). The conversion is much faster, that construction of the lists directly.
This methods keeps order of the items, as well as any duplicates.
If you don't mind using an external library there two I know that nativly implement this operation:
>>> files = [ ('file1.jpg', 33, '.jpg'), ('file2.avi', 999, '.avi')]
>>> IMAGE_TYPES = ('.jpg','.jpeg','.gif','.bmp','.png')
iteration_utilities.partition:
>>> from iteration_utilities import partition
>>> notimages, images = partition(files, lambda x: x[2].lower() in IMAGE_TYPES)
>>> notimages
[('file2.avi', 999, '.avi')]
>>> images
[('file1.jpg', 33, '.jpg')]
more_itertools.partition
>>> from more_itertools import partition
>>> notimages, images = partition(lambda x: x[2].lower() in IMAGE_TYPES, files)
>>> list(notimages) # returns a generator so you need to explicitly convert to list.
[('file2.avi', 999, '.avi')]
>>> list(images)
[('file1.jpg', 33, '.jpg')]
For example, splitting list by even and odd
arr = range(20)
even, odd = reduce(lambda res, next: res[next % 2].append(next) or res, arr, ([], []))
Or in general:
def split(predicate, iterable):
return reduce(lambda res, e: res[predicate(e)].append(e) or res, iterable, ([], []))
Advantages:
Shortest posible way
Predicate applies only once for each element
Disadvantages
Requires knowledge of functional programing paradigm
Inspired by #gnibbler's great (but terse!) answer, we can apply that approach to map to multiple partitions:
from collections import defaultdict
def splitter(l, mapper):
"""Split an iterable into multiple partitions generated by a callable mapper."""
results = defaultdict(list)
for x in l:
results[mapper(x)] += [x]
return results
Then splitter can then be used as follows:
>>> l = [1, 2, 3, 4, 2, 3, 4, 5, 6, 4, 3, 2, 3]
>>> split = splitter(l, lambda x: x % 2 == 0) # partition l into odds and evens
>>> split.items()
>>> [(False, [1, 3, 3, 5, 3, 3]), (True, [2, 4, 2, 4, 6, 4, 2])]
This works for more than two partitions with a more complicated mapping (and on iterators, too):
>>> import math
>>> l = xrange(1, 23)
>>> split = splitter(l, lambda x: int(math.log10(x) * 5))
>>> split.items()
[(0, [1]),
(1, [2]),
(2, [3]),
(3, [4, 5, 6]),
(4, [7, 8, 9]),
(5, [10, 11, 12, 13, 14, 15]),
(6, [16, 17, 18, 19, 20, 21, 22])]
Or using a dictionary to map:
>>> map = {'A': 1, 'X': 2, 'B': 3, 'Y': 1, 'C': 2, 'Z': 3}
>>> l = ['A', 'B', 'C', 'C', 'X', 'Y', 'Z', 'A', 'Z']
>>> split = splitter(l, map.get)
>>> split.items()
(1, ['A', 'Y', 'A']), (2, ['C', 'C', 'X']), (3, ['B', 'Z', 'Z'])]
solution
from itertools import tee
def unpack_args(fn):
return lambda t: fn(*t)
def separate(fn, lx):
return map(
unpack_args(
lambda i, ly: filter(
lambda el: bool(i) == fn(el),
ly)),
enumerate(tee(lx, 2)))
test
[even, odd] = separate(
lambda x: bool(x % 2),
[1, 2, 3, 4, 5])
print(list(even) == [2, 4])
print(list(odd) == [1, 3, 5])
If the list is made of groups and intermittent separators, you can use:
def split(items, p):
groups = [[]]
for i in items:
if p(i):
groups.append([])
groups[-1].append(i)
return groups
Usage:
split(range(1,11), lambda x: x % 3 == 0)
# gives [[1, 2], [3, 4, 5], [6, 7, 8], [9, 10]]
Use Boolean logic to assign data to two arrays
>>> images, anims = [[i for i in files if t ^ (i[2].lower() in IMAGE_TYPES) ] for t in (0, 1)]
>>> images
[('file1.jpg', 33, '.jpg')]
>>> anims
[('file2.avi', 999, '.avi')]
For perfomance, try itertools.
The itertools module standardizes a core set of fast, memory efficient tools that are useful by themselves or in combination. Together, they form an “iterator algebra” making it possible to construct specialized tools succinctly and efficiently in pure Python.
See itertools.ifilter or imap.
itertools.ifilter(predicate, iterable)
Make an iterator that filters elements from iterable returning only those for which the predicate is True
If you insist on clever, you could take Winden's solution and just a bit spurious cleverness:
def splay(l, f, d=None):
d = d or {}
for x in l: d.setdefault(f(x), []).append(x)
return d
Sometimes you won't need that other half of the list.
For example:
import sys
from itertools import ifilter
trustedPeople = sys.argv[1].split(',')
newName = sys.argv[2]
myFriends = ifilter(lambda x: x.startswith('Shi'), trustedPeople)
print '%s is %smy friend.' % (newName, newName not in myFriends 'not ' or '')
Already quite a few solutions here, but yet another way of doing that would be -
anims = []
images = [f for f in files if (lambda t: True if f[2].lower() in IMAGE_TYPES else anims.append(t) and False)(f)]
Iterates over the list only once, and looks a bit more pythonic and hence readable to me.
>>> files = [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi'), ('file1.bmp', 33L, '.bmp')]
>>> IMAGE_TYPES = ('.jpg','.jpeg','.gif','.bmp','.png')
>>> anims = []
>>> images = [f for f in files if (lambda t: True if f[2].lower() in IMAGE_TYPES else anims.append(t) and False)(f)]
>>> print '\n'.join([str(anims), str(images)])
[('file2.avi', 999L, '.avi')]
[('file1.jpg', 33L, '.jpg'), ('file1.bmp', 33L, '.bmp')]
>>>
I'd take a 2-pass approach, separating evaluation of the predicate from filtering the list:
def partition(pred, iterable):
xs = list(zip(map(pred, iterable), iterable))
return [x[1] for x in xs if x[0]], [x[1] for x in xs if not x[0]]
What's nice about this, performance-wise (in addition to evaluating pred only once on each member of iterable), is that it moves a lot of logic out of the interpreter and into highly-optimized iteration and mapping code. This can speed up iteration over long iterables, as described in this answer.
Expressivity-wise, it takes advantage of expressive idioms like comprehensions and mapping.
Not sure if this is a good approach but it can be done in this way as well
IMAGE_TYPES = ('.jpg','.jpeg','.gif','.bmp','.png')
files = [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi')]
images, anims = reduce(lambda (i, a), f: (i + [f], a) if f[2] in IMAGE_TYPES else (i, a + [f]), files, ([], []))
Yet another answer, short but "evil" (for list-comprehension side effects).
digits = list(range(10))
odd = [x.pop(i) for i, x in enumerate(digits) if x % 2]
>>> odd
[1, 3, 5, 7, 9]
>>> digits
[0, 2, 4, 6, 8]

Is there way to sort cases in list comprehension to produce two outputs? [duplicate]

I have some code like:
good = [x for x in mylist if x in goodvals]
bad = [x for x in mylist if x not in goodvals]
The goal is to split up the contents of mylist into two other lists, based on whether or not they meet a condition.
How can I do this more elegantly? Can I avoid doing two separate iterations over mylist? Can I improve performance by doing so?
Iterate manually, using the condition to select a list to which each element will be appended:
good, bad = [], []
for x in mylist:
(bad, good)[x in goodvals].append(x)
good = [x for x in mylist if x in goodvals]
bad = [x for x in mylist if x not in goodvals]
How can I do this more elegantly?
That code is already perfectly elegant.
There might be slight performance improvements using sets, but the difference is trivial. set based approaches will also discard duplicates and will not preserve the order of elements. I find the list comprehension far easier to read, too.
In fact, we could even more simply just use a for loop:
good, bad = [], []
for x in mylist:
if x in goodvals:
good.append(f)
else:
bad.append(f)
This approach makes it easier to add additional logic. For example, the code is easily modified to discard None values:
good, bad = [], []
for x in mylist:
if x is None:
continue
if x in goodvals:
good.append(f)
else:
bad.append(f)
Here's the lazy iterator approach:
from itertools import tee
def split_on_condition(seq, condition):
l1, l2 = tee((condition(item), item) for item in seq)
return (i for p, i in l1 if p), (i for p, i in l2 if not p)
It evaluates the condition once per item and returns two generators, first yielding values from the sequence where the condition is true, the other where it's false.
Because it's lazy you can use it on any iterator, even an infinite one:
from itertools import count, islice
def is_prime(n):
return n > 1 and all(n % i for i in xrange(2, n))
primes, not_primes = split_on_condition(count(), is_prime)
print("First 10 primes", list(islice(primes, 10)))
print("First 10 non-primes", list(islice(not_primes, 10)))
Usually though the non-lazy list returning approach is better:
def split_on_condition(seq, condition):
a, b = [], []
for item in seq:
(a if condition(item) else b).append(item)
return a, b
Edit: For your more specific usecase of splitting items into different lists by some key, heres a generic function that does that:
DROP_VALUE = lambda _:_
def split_by_key(seq, resultmapping, keyfunc, default=DROP_VALUE):
"""Split a sequence into lists based on a key function.
seq - input sequence
resultmapping - a dictionary that maps from target lists to keys that go to that list
keyfunc - function to calculate the key of an input value
default - the target where items that don't have a corresponding key go, by default they are dropped
"""
result_lists = dict((key, []) for key in resultmapping)
appenders = dict((key, result_lists[target].append) for target, keys in resultmapping.items() for key in keys)
if default is not DROP_VALUE:
result_lists.setdefault(default, [])
default_action = result_lists[default].append
else:
default_action = DROP_VALUE
for item in seq:
appenders.get(keyfunc(item), default_action)(item)
return result_lists
Usage:
def file_extension(f):
return f[2].lower()
split_files = split_by_key(files, {'images': IMAGE_TYPES}, keyfunc=file_extension, default='anims')
print split_files['images']
print split_files['anims']
Problem with all proposed solutions is that it will scan and apply the filtering function twice. I'd make a simple small function like this:
def split_into_two_lists(lst, f):
a = []
b = []
for elem in lst:
if f(elem):
a.append(elem)
else:
b.append(elem)
return a, b
That way you are not processing anything twice and also are not repeating code.
My take on it. I propose a lazy, single-pass, partition function,
which preserves relative order in the output subsequences.
1. Requirements
I assume that the requirements are:
maintain elements' relative order (hence, no sets and
dictionaries)
evaluate condition only once for every element (hence not using
(i)filter or groupby)
allow for lazy consumption of either sequence (if we can afford to
precompute them, then the naïve implementation is likely to be
acceptable too)
2. split library
My partition function (introduced below) and other similar functions
have made it into a small library:
python-split
It's installable normally via PyPI:
pip install --user split
To split a list base on condition, use partition function:
>>> from split import partition
>>> files = [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi') ]
>>> image_types = ('.jpg','.jpeg','.gif','.bmp','.png')
>>> images, other = partition(lambda f: f[-1] in image_types, files)
>>> list(images)
[('file1.jpg', 33L, '.jpg')]
>>> list(other)
[('file2.avi', 999L, '.avi')]
3. partition function explained
Internally we need to build two subsequences at once, so consuming
only one output sequence will force the other one to be computed
too. And we need to keep state between user requests (store processed
but not yet requested elements). To keep state, I use two double-ended
queues (deques):
from collections import deque
SplitSeq class takes care of the housekeeping:
class SplitSeq:
def __init__(self, condition, sequence):
self.cond = condition
self.goods = deque([])
self.bads = deque([])
self.seq = iter(sequence)
Magic happens in its .getNext() method. It is almost like .next()
of the iterators, but allows to specify which kind of element we want
this time. Behind the scene it doesn't discard the rejected elements,
but instead puts them in one of the two queues:
def getNext(self, getGood=True):
if getGood:
these, those, cond = self.goods, self.bads, self.cond
else:
these, those, cond = self.bads, self.goods, lambda x: not self.cond(x)
if these:
return these.popleft()
else:
while 1: # exit on StopIteration
n = self.seq.next()
if cond(n):
return n
else:
those.append(n)
The end user is supposed to use partition function. It takes a
condition function and a sequence (just like map or filter), and
returns two generators. The first generator builds a subsequence of
elements for which the condition holds, the second one builds the
complementary subsequence. Iterators and generators allow for lazy
splitting of even long or infinite sequences.
def partition(condition, sequence):
cond = condition if condition else bool # evaluate as bool if condition == None
ss = SplitSeq(cond, sequence)
def goods():
while 1:
yield ss.getNext(getGood=True)
def bads():
while 1:
yield ss.getNext(getGood=False)
return goods(), bads()
I chose the test function to be the first argument to facilitate
partial application in the future (similar to how map and filter
have the test function as the first argument).
I basically like Anders' approach as it is very general. Here's a version that puts the categorizer first (to match filter syntax) and uses a defaultdict (assumed imported).
def categorize(func, seq):
"""Return mapping from categories to lists
of categorized items.
"""
d = defaultdict(list)
for item in seq:
d[func(item)].append(item)
return d
First go (pre-OP-edit): Use sets:
mylist = [1,2,3,4,5,6,7]
goodvals = [1,3,7,8,9]
myset = set(mylist)
goodset = set(goodvals)
print list(myset.intersection(goodset)) # [1, 3, 7]
print list(myset.difference(goodset)) # [2, 4, 5, 6]
That's good for both readability (IMHO) and performance.
Second go (post-OP-edit):
Create your list of good extensions as a set:
IMAGE_TYPES = set(['.jpg','.jpeg','.gif','.bmp','.png'])
and that will increase performance. Otherwise, what you have looks fine to me.
itertools.groupby almost does what you want, except it requires the items to be sorted to ensure that you get a single contiguous range, so you need to sort by your key first (otherwise you'll get multiple interleaved groups for each type). eg.
def is_good(f):
return f[2].lower() in IMAGE_TYPES
files = [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi'), ('file3.gif', 123L, '.gif')]
for key, group in itertools.groupby(sorted(files, key=is_good), key=is_good):
print key, list(group)
gives:
False [('file2.avi', 999L, '.avi')]
True [('file1.jpg', 33L, '.jpg'), ('file3.gif', 123L, '.gif')]
Similar to the other solutions, the key func can be defined to divide into any number of groups you want.
Elegant and Fast
Inspired by DanSalmo's comment, here is a solution that is concise, elegant, and at the same time is one of the fastest solutions.
good_set = set(goodvals)
good, bad = [], []
for item in my_list:
good.append(item) if item in good_set else bad.append(item)
Tip: Turning goodvals into a set gives us an easy speed boost.
Fastest
For maximum speed, we take the fastest answer and turbocharge it by turning good_list into a set. That alone gives us a 40%+ speed boost, and we end up with a solution that is more than 5.5x as fast as the slowest solution, even while it remains readable.
good_list_set = set(good_list) # 40%+ faster than a tuple.
good, bad = [], []
for item in my_origin_list:
if item in good_list_set:
good.append(item)
else:
bad.append(item)
A little shorter
This is a more concise version of the previous answer.
good_list_set = set(good_list) # 40%+ faster than a tuple.
good, bad = [], []
for item in my_origin_list:
out = good if item in good_list_set else bad
out.append(item)
Elegance can be somewhat subjective, but some of the Rube Goldberg style solutions that are cute and ingenious are quite concerning and should not be used in production code in any language, let alone python which is elegant at heart.
Benchmark results:
filter_BJHomer 80/s -- -3265% -5312% -5900% -6262% -7273% -7363% -8051% -8162% -8244%
zip_Funky 118/s 4848% -- -3040% -3913% -4450% -5951% -6085% -7106% -7271% -7393%
two_lst_tuple_JohnLaRoy 170/s 11332% 4367% -- -1254% -2026% -4182% -4375% -5842% -6079% -6254%
if_else_DBR 195/s 14392% 6428% 1434% -- -882% -3348% -3568% -5246% -5516% -5717%
two_lst_compr_Parand 213/s 16750% 8016% 2540% 967% -- -2705% -2946% -4786% -5083% -5303%
if_else_1_line_DanSalmo 292/s 26668% 14696% 7189% 5033% 3707% -- -331% -2853% -3260% -3562%
tuple_if_else 302/s 27923% 15542% 7778% 5548% 4177% 343% -- -2609% -3029% -3341%
set_1_line 409/s 41308% 24556% 14053% 11035% 9181% 3993% 3529% -- -569% -991%
set_shorter 434/s 44401% 26640% 15503% 12303% 10337% 4836% 4345% 603% -- -448%
set_if_else 454/s 46952% 28358% 16699% 13349% 11290% 5532% 5018% 1100% 469% --
The full benchmark code for Python 3.7 (modified from FunkySayu):
good_list = ['.jpg','.jpeg','.gif','.bmp','.png']
import random
import string
my_origin_list = []
for i in range(10000):
fname = ''.join(random.choice(string.ascii_lowercase) for i in range(random.randrange(10)))
if random.getrandbits(1):
fext = random.choice(list(good_list))
else:
fext = "." + ''.join(random.choice(string.ascii_lowercase) for i in range(3))
my_origin_list.append((fname + fext, random.randrange(1000), fext))
# Parand
def two_lst_compr_Parand(*_):
return [e for e in my_origin_list if e[2] in good_list], [e for e in my_origin_list if not e[2] in good_list]
# dbr
def if_else_DBR(*_):
a, b = list(), list()
for e in my_origin_list:
if e[2] in good_list:
a.append(e)
else:
b.append(e)
return a, b
# John La Rooy
def two_lst_tuple_JohnLaRoy(*_):
a, b = list(), list()
for e in my_origin_list:
(b, a)[e[2] in good_list].append(e)
return a, b
# # Ants Aasma
# def f4():
# l1, l2 = tee((e[2] in good_list, e) for e in my_origin_list)
# return [i for p, i in l1 if p], [i for p, i in l2 if not p]
# My personal way to do
def zip_Funky(*_):
a, b = zip(*[(e, None) if e[2] in good_list else (None, e) for e in my_origin_list])
return list(filter(None, a)), list(filter(None, b))
# BJ Homer
def filter_BJHomer(*_):
return list(filter(lambda e: e[2] in good_list, my_origin_list)), list(filter(lambda e: not e[2] in good_list, my_origin_list))
# ChaimG's answer; as a list.
def if_else_1_line_DanSalmo(*_):
good, bad = [], []
for e in my_origin_list:
_ = good.append(e) if e[2] in good_list else bad.append(e)
return good, bad
# ChaimG's answer; as a set.
def set_1_line(*_):
good_list_set = set(good_list)
good, bad = [], []
for e in my_origin_list:
_ = good.append(e) if e[2] in good_list_set else bad.append(e)
return good, bad
# ChaimG set and if else list.
def set_shorter(*_):
good_list_set = set(good_list)
good, bad = [], []
for e in my_origin_list:
out = good if e[2] in good_list_set else bad
out.append(e)
return good, bad
# ChaimG's best answer; if else as a set.
def set_if_else(*_):
good_list_set = set(good_list)
good, bad = [], []
for e in my_origin_list:
if e[2] in good_list_set:
good.append(e)
else:
bad.append(e)
return good, bad
# ChaimG's best answer; if else as a set.
def tuple_if_else(*_):
good_list_tuple = tuple(good_list)
good, bad = [], []
for e in my_origin_list:
if e[2] in good_list_tuple:
good.append(e)
else:
bad.append(e)
return good, bad
def cmpthese(n=0, functions=None):
results = {}
for func_name in functions:
args = ['%s(range(256))' % func_name, 'from __main__ import %s' % func_name]
t = Timer(*args)
results[func_name] = 1 / (t.timeit(number=n) / n) # passes/sec
functions_sorted = sorted(functions, key=results.__getitem__)
for f in functions_sorted:
diff = []
for func in functions_sorted:
if func == f:
diff.append("--")
else:
diff.append(f"{results[f]/results[func]*100 - 100:5.0%}")
diffs = " ".join(f'{x:>8s}' for x in diff)
print(f"{f:27s} \t{results[f]:,.0f}/s {diffs}")
if __name__=='__main__':
from timeit import Timer
cmpthese(1000, 'two_lst_compr_Parand if_else_DBR two_lst_tuple_JohnLaRoy zip_Funky filter_BJHomer if_else_1_line_DanSalmo set_1_line set_if_else tuple_if_else set_shorter'.split(" "))
good.append(x) if x in goodvals else bad.append(x)
This elegant and concise answer by #dansalmo showed up buried in the comments, so I'm just reposting it here as an answer so it can get the prominence it deserves, especially for new readers.
Complete example:
good, bad = [], []
for x in my_list:
good.append(x) if x in goodvals else bad.append(x)
bad = []
good = [x for x in mylist if x in goodvals or bad.append(x)]
append returns None, so it works.
Personally, I like the version you cited, assuming you already have a list of goodvals hanging around. If not, something like:
good = filter(lambda x: is_good(x), mylist)
bad = filter(lambda x: not is_good(x), mylist)
Of course, that's really very similar to using a list comprehension like you originally did, but with a function instead of a lookup:
good = [x for x in mylist if is_good(x)]
bad = [x for x in mylist if not is_good(x)]
In general, I find the aesthetics of list comprehensions to be very pleasing. Of course, if you don't actually need to preserve ordering and don't need duplicates, using the intersection and difference methods on sets would work well too.
If you want to make it in FP style:
good, bad = [ sum(x, []) for x in zip(*(([y], []) if y in goodvals else ([], [y])
for y in mylist)) ]
Not the most readable solution, but at least iterates through mylist only once.
Sometimes, it looks like list comprehension is not the best thing to use !
I made a little test based on the answer people gave to this topic, tested on a random generated list. Here is the generation of the list (there's probably a better way to do, but it's not the point) :
good_list = ('.jpg','.jpeg','.gif','.bmp','.png')
import random
import string
my_origin_list = []
for i in xrange(10000):
fname = ''.join(random.choice(string.lowercase) for i in range(random.randrange(10)))
if random.getrandbits(1):
fext = random.choice(good_list)
else:
fext = "." + ''.join(random.choice(string.lowercase) for i in range(3))
my_origin_list.append((fname + fext, random.randrange(1000), fext))
And here we go
# Parand
def f1():
return [e for e in my_origin_list if e[2] in good_list], [e for e in my_origin_list if not e[2] in good_list]
# dbr
def f2():
a, b = list(), list()
for e in my_origin_list:
if e[2] in good_list:
a.append(e)
else:
b.append(e)
return a, b
# John La Rooy
def f3():
a, b = list(), list()
for e in my_origin_list:
(b, a)[e[2] in good_list].append(e)
return a, b
# Ants Aasma
def f4():
l1, l2 = tee((e[2] in good_list, e) for e in my_origin_list)
return [i for p, i in l1 if p], [i for p, i in l2 if not p]
# My personal way to do
def f5():
a, b = zip(*[(e, None) if e[2] in good_list else (None, e) for e in my_origin_list])
return list(filter(None, a)), list(filter(None, b))
# BJ Homer
def f6():
return filter(lambda e: e[2] in good_list, my_origin_list), filter(lambda e: not e[2] in good_list, my_origin_list)
Using the cmpthese function, the best result is the dbr answer :
f1 204/s -- -5% -14% -15% -20% -26%
f6 215/s 6% -- -9% -11% -16% -22%
f3 237/s 16% 10% -- -2% -7% -14%
f4 240/s 18% 12% 2% -- -6% -13%
f5 255/s 25% 18% 8% 6% -- -8%
f2 277/s 36% 29% 17% 15% 9% --
def partition(pred, iterable):
'Use a predicate to partition entries into false entries and true entries'
# partition(is_odd, range(10)) --> 0 2 4 6 8 and 1 3 5 7 9
t1, t2 = tee(iterable)
return filterfalse(pred, t1), filter(pred, t2)
Check this
I think a generalization of splitting a an iterable based on N conditions is handy
from collections import OrderedDict
def partition(iterable,*conditions):
'''Returns a list with the elements that satisfy each of condition.
Conditions are assumed to be exclusive'''
d= OrderedDict((i,list())for i in range(len(conditions)))
for e in iterable:
for i,condition in enumerate(conditions):
if condition(e):
d[i].append(e)
break
return d.values()
For instance:
ints,floats,other = partition([2, 3.14, 1, 1.69, [], None],
lambda x: isinstance(x, int),
lambda x: isinstance(x, float),
lambda x: True)
print " ints: {}\n floats:{}\n other:{}".format(ints,floats,other)
ints: [2, 1]
floats:[3.14, 1.69]
other:[[], None]
If the element may satisfy multiple conditions, remove the break.
Yet another solution to this problem. I needed a solution that is as fast as possible. That means only one iteration over the list and preferably O(1) for adding data to one of the resulting lists. This is very similar to the solution provided by sastanin, except much shorter:
from collections import deque
def split(iterable, function):
dq_true = deque()
dq_false = deque()
# deque - the fastest way to consume an iterator and append items
deque((
(dq_true if function(item) else dq_false).append(item) for item in iterable
), maxlen=0)
return dq_true, dq_false
Then, you can use the function in the following way:
lower, higher = split([0,1,2,3,4,5,6,7,8,9], lambda x: x < 5)
selected, other = split([0,1,2,3,4,5,6,7,8,9], lambda x: x in {0,4,9})
If you're not fine with the resulting deque object, you can easily convert it to list, set, whatever you like (for example list(lower)). The conversion is much faster, that construction of the lists directly.
This methods keeps order of the items, as well as any duplicates.
If you don't mind using an external library there two I know that nativly implement this operation:
>>> files = [ ('file1.jpg', 33, '.jpg'), ('file2.avi', 999, '.avi')]
>>> IMAGE_TYPES = ('.jpg','.jpeg','.gif','.bmp','.png')
iteration_utilities.partition:
>>> from iteration_utilities import partition
>>> notimages, images = partition(files, lambda x: x[2].lower() in IMAGE_TYPES)
>>> notimages
[('file2.avi', 999, '.avi')]
>>> images
[('file1.jpg', 33, '.jpg')]
more_itertools.partition
>>> from more_itertools import partition
>>> notimages, images = partition(lambda x: x[2].lower() in IMAGE_TYPES, files)
>>> list(notimages) # returns a generator so you need to explicitly convert to list.
[('file2.avi', 999, '.avi')]
>>> list(images)
[('file1.jpg', 33, '.jpg')]
For example, splitting list by even and odd
arr = range(20)
even, odd = reduce(lambda res, next: res[next % 2].append(next) or res, arr, ([], []))
Or in general:
def split(predicate, iterable):
return reduce(lambda res, e: res[predicate(e)].append(e) or res, iterable, ([], []))
Advantages:
Shortest posible way
Predicate applies only once for each element
Disadvantages
Requires knowledge of functional programing paradigm
Inspired by #gnibbler's great (but terse!) answer, we can apply that approach to map to multiple partitions:
from collections import defaultdict
def splitter(l, mapper):
"""Split an iterable into multiple partitions generated by a callable mapper."""
results = defaultdict(list)
for x in l:
results[mapper(x)] += [x]
return results
Then splitter can then be used as follows:
>>> l = [1, 2, 3, 4, 2, 3, 4, 5, 6, 4, 3, 2, 3]
>>> split = splitter(l, lambda x: x % 2 == 0) # partition l into odds and evens
>>> split.items()
>>> [(False, [1, 3, 3, 5, 3, 3]), (True, [2, 4, 2, 4, 6, 4, 2])]
This works for more than two partitions with a more complicated mapping (and on iterators, too):
>>> import math
>>> l = xrange(1, 23)
>>> split = splitter(l, lambda x: int(math.log10(x) * 5))
>>> split.items()
[(0, [1]),
(1, [2]),
(2, [3]),
(3, [4, 5, 6]),
(4, [7, 8, 9]),
(5, [10, 11, 12, 13, 14, 15]),
(6, [16, 17, 18, 19, 20, 21, 22])]
Or using a dictionary to map:
>>> map = {'A': 1, 'X': 2, 'B': 3, 'Y': 1, 'C': 2, 'Z': 3}
>>> l = ['A', 'B', 'C', 'C', 'X', 'Y', 'Z', 'A', 'Z']
>>> split = splitter(l, map.get)
>>> split.items()
(1, ['A', 'Y', 'A']), (2, ['C', 'C', 'X']), (3, ['B', 'Z', 'Z'])]
solution
from itertools import tee
def unpack_args(fn):
return lambda t: fn(*t)
def separate(fn, lx):
return map(
unpack_args(
lambda i, ly: filter(
lambda el: bool(i) == fn(el),
ly)),
enumerate(tee(lx, 2)))
test
[even, odd] = separate(
lambda x: bool(x % 2),
[1, 2, 3, 4, 5])
print(list(even) == [2, 4])
print(list(odd) == [1, 3, 5])
If the list is made of groups and intermittent separators, you can use:
def split(items, p):
groups = [[]]
for i in items:
if p(i):
groups.append([])
groups[-1].append(i)
return groups
Usage:
split(range(1,11), lambda x: x % 3 == 0)
# gives [[1, 2], [3, 4, 5], [6, 7, 8], [9, 10]]
Use Boolean logic to assign data to two arrays
>>> images, anims = [[i for i in files if t ^ (i[2].lower() in IMAGE_TYPES) ] for t in (0, 1)]
>>> images
[('file1.jpg', 33, '.jpg')]
>>> anims
[('file2.avi', 999, '.avi')]
For perfomance, try itertools.
The itertools module standardizes a core set of fast, memory efficient tools that are useful by themselves or in combination. Together, they form an “iterator algebra” making it possible to construct specialized tools succinctly and efficiently in pure Python.
See itertools.ifilter or imap.
itertools.ifilter(predicate, iterable)
Make an iterator that filters elements from iterable returning only those for which the predicate is True
If you insist on clever, you could take Winden's solution and just a bit spurious cleverness:
def splay(l, f, d=None):
d = d or {}
for x in l: d.setdefault(f(x), []).append(x)
return d
Sometimes you won't need that other half of the list.
For example:
import sys
from itertools import ifilter
trustedPeople = sys.argv[1].split(',')
newName = sys.argv[2]
myFriends = ifilter(lambda x: x.startswith('Shi'), trustedPeople)
print '%s is %smy friend.' % (newName, newName not in myFriends 'not ' or '')
Already quite a few solutions here, but yet another way of doing that would be -
anims = []
images = [f for f in files if (lambda t: True if f[2].lower() in IMAGE_TYPES else anims.append(t) and False)(f)]
Iterates over the list only once, and looks a bit more pythonic and hence readable to me.
>>> files = [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi'), ('file1.bmp', 33L, '.bmp')]
>>> IMAGE_TYPES = ('.jpg','.jpeg','.gif','.bmp','.png')
>>> anims = []
>>> images = [f for f in files if (lambda t: True if f[2].lower() in IMAGE_TYPES else anims.append(t) and False)(f)]
>>> print '\n'.join([str(anims), str(images)])
[('file2.avi', 999L, '.avi')]
[('file1.jpg', 33L, '.jpg'), ('file1.bmp', 33L, '.bmp')]
>>>
I'd take a 2-pass approach, separating evaluation of the predicate from filtering the list:
def partition(pred, iterable):
xs = list(zip(map(pred, iterable), iterable))
return [x[1] for x in xs if x[0]], [x[1] for x in xs if not x[0]]
What's nice about this, performance-wise (in addition to evaluating pred only once on each member of iterable), is that it moves a lot of logic out of the interpreter and into highly-optimized iteration and mapping code. This can speed up iteration over long iterables, as described in this answer.
Expressivity-wise, it takes advantage of expressive idioms like comprehensions and mapping.
Not sure if this is a good approach but it can be done in this way as well
IMAGE_TYPES = ('.jpg','.jpeg','.gif','.bmp','.png')
files = [ ('file1.jpg', 33L, '.jpg'), ('file2.avi', 999L, '.avi')]
images, anims = reduce(lambda (i, a), f: (i + [f], a) if f[2] in IMAGE_TYPES else (i, a + [f]), files, ([], []))
Yet another answer, short but "evil" (for list-comprehension side effects).
digits = list(range(10))
odd = [x.pop(i) for i, x in enumerate(digits) if x % 2]
>>> odd
[1, 3, 5, 7, 9]
>>> digits
[0, 2, 4, 6, 8]

Grouping lists by a common element

Assume we have a list of list as follows:
S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
S2 = []
I want to go over this list and for each set of that check whether a property is true between that set and the other sets of that list. Then, if that property holds, join those two sets together and compare the new set to the other sets of S1. At the end, add this new set to S2.
Now, as an example, assume we say the property holds between two sets if all elements of those two sets begin with the same letter.
For the list S1 described above, I want S2 to be:
S2 = [{'A_1', 'A_3', 'A_2'}, {'B_1', 'B_3', 'B_2'}, {'C_1','C_2'}]
How we should write code for this?
This is my code. It works fine but I think it is not efficient because it tries to add set(['A_3', 'A_2', 'A_1']) several times. Assume the Checker function is given and it checks the property between two lists. That property I mentioned above is just an example. We may want to change that later. So, we should have Checker as a function.
def Checker(list1, list2):
flag = 1
for item1 in list1:
for item2 in list2:
if item1[0] != item2[0]:
flag =0
if flag ==1:
return 1
else:
return 0
S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
S2 = []
for i in range(0,len(S1)):
Temp = S1[i]
for j in range(0,i-1) + range(i+1,len(S1)):
if Checker(Temp,S1[j]) == 1:
Temp = Temp.union(S1[j])
if Temp not in S2:
S2.append(Temp)
print S2
Output:
[set(['A_3', 'A_2', 'A_1']), set(['B_1', 'B_2', 'B_3']), set(['C_1', 'C_2'])]
def Checker(list1, list2):
flag = 1
for item1 in list1:
for item2 in list2:
if item1[0] != item2[0]:
return 0
return 1
I have tried to reduce the complexity of the Checker() function.
You can flatten (many ways to do this but a simple way is to use it.chain(*nested_list)) and sorted the list using only the property as the key and then use it.groupby() with the same key to create the new list:
In []:
import operator as op
import itertools as it
prop = op.itemgetter(0)
[set(v) for _, v in it.groupby(sorted(it.chain(*S1), key=prop), key=prop)]
Out[]:
[{'A_1', 'A_2', 'A_3'}, {'B_1', 'B_2', 'B_3'}, {'C_1', 'C_2'}]
If performance is a consideration, I suggest the canoncical grouping approach in python: using a defaultdict:
>>> S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
>>> from collections import defaultdict
>>> grouper = defaultdict(set)
>>> from itertools import chain
>>> for item in chain.from_iterable(S1):
... grouper[item[0]].add(item)
...
>>> grouper
defaultdict(<class 'set'>, {'C': {'C_1', 'C_2'}, 'B': {'B_1', 'B_2', 'B_3'}, 'A': {'A_1', 'A_2', 'A_3'}})
Edit
Note, the following applies to Python 3. In Python 2, .values returns a list.
Note, you probably actually just want this dict, likely, it is much more useful to you than a list of the groups. You can also use the .values() method, which returns a view on the values:
>>> grouper.values()
dict_values([{'C_1', 'C_2'}, {'B_1', 'B_2', 'B_3'}, {'A_1', 'A_2', 'A_3'}])
If you really want a list, you can always get it in a straight-forward way:
>>> S2 = list(grouper.values())
>>> S2
[{'C_1', 'C_2'}, {'B_1', 'B_2', 'B_3'}, {'A_1', 'A_2', 'A_3'}]
Given that N is the number of items in all the nested sets, then this solution is O(N).
S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
from itertools import chain
l = list( chain.from_iterable(S1) )
s = {i[0] for i in l}
t = []
for k in s:
t.append([i for i in l if i[0]==k])
print (t)
Output:
[['B_1', 'B_3', 'B_2'], ['A_1', 'A_3', 'A_2'], ['C_1', 'C_2']]
Is your property 1. symmetric and 2. transitive? i.e. 1. prop(a,b) if and only if prop(b,a) and 2. prop(a,b) and prop(b,c) implies prop(a,c)? If so, you can write a function that takes a set and gives some code for the corresponding equivalence class. E.g.
1 S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
2
3 def eq_class(s):
4 fs = set(w[0] for w in s)
5 if len(fs) != 1:
6 return None
7 return fs.pop()
8
9 S2 = dict()
10 for s in S1:
11 cls = eq_class(s)
12 S2[cls] = S2.get(cls,set()).union(s)
13
14 S2 = list(S2.values())
This has an advantage of being amortized O(len(S1)). Also note that your final output may depend on the order of S1 if 1 or 2 fails.
A bit more verbose version using itertools.groupby
from itertools import groupby
S1 = [['A_1'], ['B_1', 'B_3'], ['C_1'], ['A_3'], ['C_2'],['B_2'], ['A_2']]
def group(data):
# Flatten the data
l = list((d for sub in data for d in sub))
# Sort it
l.sort()
groups = []
keys = []
# Iterates for each group found only
for k, g in groupby(l, lambda x: x[0]):
groups.append(list(g))
keys.append(k)
# Return keys group data
return keys, [set(x) for x in groups]
keys, S2 = group(S1)
print "Found the following keys", keys
print "S2 = ", S2
The main thought here was to reduce the number og appends as this really cripples performance. We flatten the data using a generator and sort it. Then we use groupby to group the data. The loop only iterates once per group. There is still a fair bit of data copy here that could potentially be removed.
A bonus is that the function also returns the groups keys detected in the data.

Finding the minimum value for different variables

If i am doing some math functions for different variables for example:
a = x - y
b = x**2 - y**2
c = (x-y)**2
d = x + y
How can i find the minimum value out of all the variables. For example:
a = 4
b = 7
c = 3
d = 10
So the minimum value is 3 for c. How can i let my program do this.
What have i thought so far:
make a list
append a,b,c,d in the list
sort the list
print list[0] as it will be the smallest value.
The problem is if i append a,b,c,d to a list i have to do something like:
lst.append((a,b,c,d))
This makes the list to be -
[(4,7,3,10)]
making all the values relating to one index only ( lst[0] )
If possible is there any substitute to do this or any way possible as to how can i find the minimum!
LNG - PYTHON
Thank you
You can find the index of the smallest item like this
>>> L = [4,7,3,10]
>>> min(range(len(L)), key=L.__getitem__)
2
Now you know the index, you can get the actual item too. eg: L[2]
Another way which finds the answer in the form(index, item)
>>> min(enumerate(L), key=lambda x:x[1])
(2, 3)
I think you may be going the wrong way to solving your problem, but it's possible to pull values of variable from the local namespace if you know their names. eg.
>>> a = 4
>>> b = 7
>>> c = 3
>>> d = 10
>>> min(enumerate(['a', 'b', 'c', 'd']), key=lambda x, ns=locals(): ns[x[1]])
(2, 'c')
a better way is to use a dict, so you are not filling your working namespace with these "junk" variables
>>> D = {}
>>> D['a'] = 4
>>> D['b'] = 7
>>> D['c'] = 3
>>> D['d'] = 10
>>> min(D, key=D.get)
'c'
>>> min(D.items(), key=lambda x:x[1])
('c', 3)
You can see that when the correct data structure is used, the amount of code required is much less.
If you store the numbers in an list you can use a reduce having a O(n) complexity due the list is not sorted.
numbers = [999, 1111, 222, -1111]
minimum = reduce(lambda mn, candidate: candidate if candidate < mn else mn, numbers[1:], numbers[0])
pack as dictionary, find min value and then find keys that have matching values (possibly more than one minimum)
D = dict(a = 4, b = 7, c = 3, d = 10)
min_val = min(D.values())
for k,v in D.items():
if v == min_val: print(k)
The buiit-in function min will do the trick. In your example, min(a,b,c,d) will yield 3.

using FOR statement on 2 elements at once python

I have the following list of variables and a mastervariable
a = (1,5,7)
b = (1,3,5)
c = (2,2,2)
d = (5,2,8)
e = (5,5,8)
mastervariable = (3,2,5)
I'm trying to check if 2 elements in each variable exist in the master variable, such that the above would show B (3,5) and D (5,2) as being elements with at least 2 elements matching in the mastervariable. Also note that using sets would result in C showing up as matchign but I don't want to count C cause only 'one' of the elements in C are in mastervariable (i.e. 2 only shows up once in mastervariable not twice)
I currently have the very inefficient:
if current_variable[0]==mastervariable[0]:
if current_variable[1] = mastervariable[1]:
True
elif current_variable[2] = mastervariable[1]:
True
#### I don't use OR here because I need to know which variables match.
elif current_variable[1] == mastervariable[0]: ##<-- I'm now checking 2nd element
etc. etc.
I then continue to iterate like the above by checking each one at a time which is extremely inefficient. I did the above because using a FOR statement resulted in me checking the first element twice which was incorrect:
For i in a:
for j in a:
### this checked if 1 was in the master variable and not 1,5 or 1,7
Is there a way to use 2 FOR statement that allows me to check 2 elements in a list at once while skipping any element that has been used already? Alternatively, can you suggest an efficient way to do what I'm trying?
Edit: Mastervariable can have duplicates in it.
For the case where matching elements can be duplicated so that set breaks, use Counter as a multiset - the duplicates between a and master are found by:
count_a = Counter(a)
count_master = Counter(master)
count_both = count_a + count_master
dups = Counter({e : min((count_a[e], count_master[e])) for e in count_a if count_both[e] > count_a[e]})
The logic is reasonably intuitive: if there's more of an item in the combined count of a and master, then it is duplicated, and the multiplicity is however many of that item are in whichever of a and master has less of them.
It gives a Counter of all the duplicates, where the count is their multiplicity. If you want it back as a tuple, you can do tuple(dups.elements()):
>>> a
(2, 2, 2)
>>> master
(1, 2, 2)
>>> dups = Counter({e : min((count_a[e], count_master[e])) for e in count_a if count_both[e] > count_a[e]})
>>> tuple(dups.elements())
(2, 2)
Seems like a good job for sets. Edit: sets aren't suitable since mastervariable can contain duplicates. Here is a version using Counters.
>>> a = (1,5,7)
>>>
>>> b = (1,3,5)
>>>
>>> c = (2,2,2)
>>>
>>> d = (5,2,8)
>>>
>>> e = (5,5,8)
>>> D=dict(a=a, b=b, c=c, d=d, e=e)
>>>
>>> from collections import Counter
>>> mastervariable = (5,5,3)
>>> mvc = Counter(mastervariable)
>>> for k,v in D.items():
... vc = Counter(v)
... if sum(min(count, vc[item]) for item, count in mvc.items())==2:
... print k
...
b
e

Categories