In my Python 2.7.5 code I have the following data structures:
A simple list...
>>> data["parts"]
['com', 'google', 'www']
...and a list of tuples...
>>> data["glue"]
[(1L, 'com'), (3L, 'google')]
When entering the code where these structures exist I will always know what is in data["parts"]; data["glue"], at best, will contain "matching" tuples with what is in data["parts"] - worst case data["glue"] can be empty. What I need is to know is the parts that are missing from glue. So with the example data above, I need to know that 'www' is missing, meaning it is not in any of the tuples that may exist in data["glue"].
I first tried to produce a list of the missing pieces by way of various for loops coupled with if statements but it was very messy at best. I have tried list comprehensions and failed. Maybe list comprehension is not the way to handle this either.
Your help is much appreciated, thanks.
You can use set difference operations.
print set(data['parts'])-set(i[1] for i in data['glue'])
>>> set(['www'])
or with simply using list comprehensions:
print [i for i in data['parts'] if i not in (j[1] for j in data['glue'])]
>>> ['www']
The set operation wins in the speed department, running the operation 10,000,000 times, we can see that the list comprehension takes over 16s longer:
import timeit
print timeit.timeit(lambda : set(data['parts'])-set(i[1] for i in data['glue']), number=10000000)
>>> 16.8089739356
print timeit.timeit(lambda : [i for i in data['parts'] if i not in (j[1] for j in data['glue'])], number=10000000)
>>> 33.5426096522
You can use list comprehensions here. Maybe the simplest thing would be to create a set of all indices, then return the missing indices. Note this answer will give you all the missing components, even if there are duplicates in the parts array (for example, if "www" appeared twice in parts). This would not be the case with set comprehension.
# set of 0-based indices extracted from the 1-based tuples
indices = set(glue_tuple[0] - 1 for glue_tuple in data['glue'])
# array of missing parts, in order
missing_parts = [part for i, part in enumerate(data["parts"]) if i not in indices]
Related
I'm trying to figure out what is the pythonic way to unpack an iterator inside of a list.
For example:
my_iterator = zip([1, 2, 3, 4], [1, 2, 3, 4])
I have come with the following ways to unpack my iterator inside of a list:
1)
my_list = [*my_iterator]
2)
my_list = [e for e in my_iterator]
3)
my_list = list(my_iterator)
No 1) is my favorite way to do it since is less code, but I'm wondering if this is also the pythonic way. Or maybe there is another way to achieve this besides those 3 which is the pythonic way?
This might be a repeat of Fastest way to convert an iterator to a list, but your question is a bit different since you ask which is the most Pythonic. The accepted answer is list(my_iterator) over [e for e in my_iterator] because the prior runs in C under the hood. One commenter suggests [*my_iterator] is faster than list(my_iterator), so you might want to test that. My general vote is that they are all equally Pythonic, so I'd go with the faster of the two for your use case. It's also possible that the older answer is out of date.
After exploring more the subject I've come with some conclusions.
There should be one-- and preferably only one --obvious way to do it
(zen of python)
Deciding which option is the "pythonic" one should take into consideration some criteria :
how explicit,
simple,
and readable it is.
And the obvious "pythonic" option winning in all criteria is option number 3):
list = list(my_iterator)
Here is why is "obvious" that no 3) is the pythonic one:
Option 3) is close to natural language making you to 'instantly'
think what is the output.
Option 2) (using list comprehension) if you see for the first time
that line of code will take you to read a little bit more and to pay
a bit more attention. For example, I use list comprehension when I
want to add some extra steps(calling a function with the iterated
elements or having some checking using if statement), so when I see a
list comprehension I check for any possible function call inside or
for any if statment.
option 1) (unpacking using *) asterisk operator can be a bit confusing
if you don't use it regularly, there are 4 cases for using the
asterisk in Python:
For multiplication and power operations.
For repeatedly extending the list-type containers.
For using the variadic arguments. (so-called “packing”)
For unpacking the containers.
Another good argument is python docs themselves, I have done some statistics to check which options are chosen by the docs, for this I've chose 4 buil-in iterators and everything from the module itertools (that are used like: itertools.) to see how they are unpacked in a list:
map
range
filter
enumerate
itertools.
After exploring the docs I found: 0 iterators unpacked in a list using option 1) and 2) and 35 using option 3).
Conclusion :
The pythonic way to unpack an iterator inside of a list is: my_list = list(my_iterator)
While the unpacking operator * is not often used for unpacking a single iterable into a list (therefore [*it] is a bit less readable than list(it)), it is handy and more Pythonic in several other cases:
1. Unpacking an iterable into a single list / tuple / set, adding other values:
mixed_list = [a, *it, b]
This is more concise and efficient than
mixed_list = [a]
mixed_list.extend(it)
mixed_list.append(b)
2. Unpacking multiple iterables + values into a list / tuple / set
mixed_list = [*it1, *it2, a, b, ... ]
This is similar to the first case.
3. Unpacking an iterable into a list, excluding elements
first, *rest = it
This extracts the first element of it into first and unpacks the rest into a list. One can even do
_, *mid, last = it
This dumps the first element of it into a don't-care variable _, saves last element into last, and unpacks the rest into a list mid.
4. Nested unpacking of multiple levels of an iterable in one statement
it = (0, range(5), 3)
a1, (*a2,), a3 = it # Unpack the second element of it into a list a2
e1, (first, *rest), e3 = it # Separate the first element from the rest while unpacking it[1]
This can also be used in for statements:
from itertools import groupby
s = "Axyz123Bcba345D"
for k, (first, *rest) in groupby(s, key=str.isalpha):
...
If you're interested in the least amount of typing possible, you can actually do one character better than my_list = [*my_iterator] with iterable unpacking:
*my_list, = my_iterator
or (although this only equals my_list = [*my_iterator] in the number of characters):
[*my_list] = my_iterator
(Funny how it has the same effect as my_list = [*my_iterator].)
For the most Pythonic solution, however, my_list = list(my_iterator) is clearly the clearest and the most readable of all, and should therefore be considered the most Pythonic.
I tend to use zip if I need to convert a list to a dictionary or use it as a key-value pair in a loop or list comprehension.
However, if this is only for illustration to create an iterator. I will definitely vote for #3 for clarity.
I have a couple of long lists of lists of related objects that I'd like to group to reduce redundancy. Pseudocode:
>>>list_of_lists = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]...]
>>>remove_redundancy(list_of_lists)
[[1,2,3,4,8,9,10],[5,6,7]...]
So lists that contain the same elements would be collapsed into single lists. Collapsing them is easy, once I find lists to combine I can make the lists into sets and take their union, but I'm not sure how to compare the lists. Do I need to do a series of for loops?
My first thought was that I should loop through and check whether each item in a sublist is in any of the other lists, if yes, merge the lists and then start over, but that seems terribly inefficient. I did some searching and found this: Python - dividing a list-of-lists to groups but my data isn't structured. Also, my actual data is a series of strings and thus not sortable in any meaningful sense.
I can write some gnarly looping code to make this work, but I was wondering if there are any built-in functions that would make this sort of comparison easier. Maybe something in list comprehensions?
I think this is a reasonably efficient way of doing it, if I understand your question correctly. The result here will be a list of sets.
Maybe the missing bit of knowledge was d & g (also written d.intersection(g)) for finding the set intersection, along with the fact that an empty set is "falsey" in Python
data = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]]
result = []
for d in data:
d = set(d)
matched = [d]
unmatched = []
# first divide into matching and non-matching groups
for g in result:
if d & g:
matched.append(g)
else:
unmatched.append(g)
# then combine all matching groups into one group
# while leaving unmatched groups intact
result = unmatched + [set().union(*matched)]
print(result)
# [set([5, 6, 7]), set([1, 2, 3, 4, 8, 9, 10])]
We start with no groups at all (result = []). Then we take the first list from the data. We then check which of the existing groups intersect this list and which don't. Then we merge all of these matching groups along with the list (achieved by starting with matched = [d]). We don't touch the non-matching groups (though maybe some of these will end up being merged in a later iteration). If you add a line print(result) in each loop you should be able to see how it's built up.
The union of all the sets in matched is computed by set().union(*matched). For reference:
Pythonic Way to Create Union of All Values Contained in Multiple Lists
What does the Star operator mean?
I assume that you want to merge lists that contain any common element.
Here is a function that looks efficiently (to the best of my knowledge) if any two lists contain at least one common element (according to the == operator)
import functools #python 2.5+
def seematch(X,Y):
return functools.reduce(lambda x,y : x|y,functools.reduce(lambda x,y : x+y, [[k==l for k in X ] for l in Y]))
it would be even faster if you would use a reduce that can be interrupted when finding "true" as described here:
Stopping a Reduce() operation mid way. Functional way of doing partial running sum
I was trying to find an elegant way to iterate fast after having that in place, but I think a good way would be simply looping once and creating an other container that will contain the "merged" lists. You loop once on the lists contained on the original list and for every new list created on the proxy list.
Having said that - it seems there might be a much better option - see if you can do away with that redundancy by some sort of book-keeping on the previous steps.
I know this is an incomplete answer - hope that helped anyway!
I have a list of fruits:
fruits = ["apple","banana"]
I also have a nested list of baskets, in which each list contains a string (the name of the basket) and a list of fruits.
baskets = [["basket1",["apple","banana","pear","strawberry"]],["basket2",["strawberry","pear","peach"]],["basket3",["peach","apple","banana"]]]
I would like to know which baskets contain every fruits in the list fruits: the result I expect is a list with two elements, "basket1" and "basket3".
I figured that intersections would the cleanest way of achieving that, and I tried the following:
myset = set(fruits).intersection(*map(set, set(baskets)))
But I'm getting a TypeError "unhashable type: 'list'". I understand that I can't map lists, but I thought that using the function "set" on both lists would convert them to sets... is there any other way I can find the intersection of a list and a list of lists?
You can loop over baskets and check if the fruits set is a subset of fruits in current basket, if yes store current basket's name.
>>> fruits = {"apple", "banana"} #notice the {}, or `set(["apple","banana"])` in Python 2.6 or earlier
>>> [b for b, f in baskets if fruits.issubset(f)]
['basket1', 'basket3']
You can't hash sets any more than you can hash lists. They both have the same problem: because they're mutable, a value can change its contents, making any set that contains it as a member or any dictionary that contains it as a key suddenly invalid.
You can hash the immutable equivalents of both, tuple and frozenset.
Meanwhile, your immediate problem is ironically created by your attempt to solve this problem. Break this line down into pieces:
myset = set(fruits).intersection(*map(set, set(baskets)))
The first piece is this:
baskets_set = set(baskets)
You've got a list of lists. You, set(baskets) is trying to make a set of lists. Which you can't do, because lists aren't hashable.
If you just removed that, and used map(set, baskets), you would then have an iterator of sets, which is a perfectly valid thing.
Of course as soon as you try to iterate it, it will try to make a set out of the first element of baskets, which is a list, so you'll run into the error again.
Plus, even if you solve this, the logic still doesn't make any sense. What's the intersection of a set of, say, 3 strings with a set of, say, 3 (frozen)sets of strings? It's empty. The two sets don't have any elements in common. The fact that some elements of the second one may contain elements of the first doesn't mean that the second one itself contains any elements of the first.
You could do it this way using your approach:
fruits = ["apple","banana"]
baskets = [["basket1",["apple","banana","pear","strawberry"]],
["basket2",["strawberry","pear","peach"]],
["basket3",["peach","apple","banana"]]]
fruitset = set(fruits)
res = set(b for b, s in ((b, set(c)) for b, c in baskets) if s & fruitset)
print res # --> set(['basket1', 'basket3'])
What is the most pythonic way to truncate a list to N indices when you can not guarantee the list is even N length? Something like this:
l = range(6)
if len(l) > 4:
l = l[:4]
I'm fairly new to python and am trying to learn to think pythonicly. The reason I want to even truncate the list is because I'm going to enumerate on it with an expected length and I only care about the first 4 elements.
Python automatically handles, the list indices that are out of range gracefully.
In [5]: k = range(2)
In [6]: k[:4]
Out[6]: [0, 1]
In [7]: k = range(6)
In [8]: k[:4]
Out[8]: [0, 1, 2, 3]
BTW, the degenerate slice behavior is explained in The Python tutorial. That is a good place to start because it covers a lot of concepts very quickly.
You've got it here:
lst = lst[:4]
This works regardless of the number of items in the list, even if it's less than 4.
If you want it to always have 4 elements, padding it with (say) zeroes or None if it's too short, try this:
lst = (lst + [0] * 4)[:4]
When you have a question like this, it's usually feasible to try it and see what happens, or look it up in the documentation.
It's bad idea to name a variable list, by the way; this will prevent you from referring to the built-in list type.
All of the answers so far don't truncate the list. They follow your example in assigning the name to a new list which contains the first up to 4 elements of the old list. To truncate the existing list, delete elements whose index is 4 or higher. This is done very simply:
del lst[4:]
Moving on to what you really want to do, one possibility is:
for i, value in enumerate(lst):
if i >= 4:
break
do_something_with(lst, i, value)
I think the most pythonic solution that answers your question is this:
for x in l[:4]:
do_something_with(x)
The advantages:
It's the most succinct and descriptive way of doing what you're after.
I suggest that the pythonic way to do this is to slice the source array rather than truncate it. Once you remove elements from the source array they are gone forever, and while in your trivial example that's fine, I think in general it's not a great practice.
Slicing the array to the first four is almost certainly more efficient than removing a possibly large number of elements from its tail.
I have a python function defined as follows which i use to delete from list1 the items which are already in list2. I am using python 2.6.2 on windows XP
def compareLists(list1, list2):
curIndex = 0
while curIndex < len(list1):
if list1[curIndex] in list2:
list1.pop(curIndex)
else:
curIndex += 1
Here, list1 and list2 are a list of lists
list1 = [ ['a', 11221, '2232'], ['b', 1321, '22342'] .. ]
# list2 has a similar format.
I tried this function with list1 with 38,000 elements and list2 with 150,000 elements. If i put in a print statement to print the current iteration, I find that the function slows down with each iterations. At first, it processes around 1000 or more items in a second and then after a while it reduces to around 20-50 a second. Why can that be happening?
EDIT: In the case with my data, the curIndex remains 0 or very close to 0 so the pop operation on list1 is almost always on the first item.
If possible, can someone also suggest a better way of doing the same thing in a different way?
Try a more pythonic approach to the filtering, something like
[x for x in list1 if x not in set(list2)]
Converting both lists to sets is unnessescary, and will be very slow and memory hungry on large amounts of data.
Since your data is a list of lists, you need to do something in order to hash it.
Try out
list2_set = set([tuple(x) for x in list2])
diff = [x for x in list1 if tuple(x) not in list2_set]
I tested out your original function, and my approach, using the following test data:
list1 = [[x+1, x*2] for x in range(38000)]
list2 = [[x+1, x*2] for x in range(10000, 160000)]
Timings - not scientific, but still:
#Original function
real 2m16.780s
user 2m16.744s
sys 0m0.017s
#My function
real 0m0.433s
user 0m0.423s
sys 0m0.007s
There are 2 issues that cause your algorithm to scale poorly:
x in list is an O(n) operation.
pop(n) where n is in the middle of the array is an O(n) operation.
Both situations cause it to scale poorly O(n^2) for large amounts of data. gnud's implementation would probably be the best solution since it solves both problems without changing the order of elements or removing potential duplicates.
If we rule the data structure itself out, look at your memory usage next. If you end up asking the OS to swap in for you (i.e., the list takes up more memory than you have), Python's going to sit in iowait waiting on the OS to get the pages from disk, which makes sense given your description.
Is Python sitting in a jacuzzi of iowait when this slowdown happens? Anything else going on in the environment?
(If you're not sure, update with your platform and one of us will tell you how to tell.)
The only reason why the code can become slower is that you have big elements in both lists which share a lot of common elements (so the list1[curIndex] in list2 takes more time).
Here are a couple of ways to fix this:
If you don't care about the order, convert both lists into sets and use set1.difference(set2)
If the order in list1 is important, then at least convert list2 into a set because in is much faster with a set.
Lastly, try a filter: filter(list1, lambda x: x not in set2)
[EDIT] Since set() doesn't work on recursive lists (didn't expect that), try:
result = filter(list1, lambda x: x not in list2)
It should still be much faster than your version. If it isn't, then your last option is to make sure that there can't be duplicate elements in either list. That would allow you to remove items from both lists (and therefore making the compare ever cheaper as you find elements from list2).
EDIT: I've updated my answer to account for lists being unhashable, as well as some other feedback. This one is even tested.
It probably relates to the cost of poping an item out of a middle of a list.
Alternatively have you tried using sets to handle this?
def difference(list1, list2):
return [x for x in list1 if tuple(x) in set(tuple(y) for y in list2)]
You can then set list one to the resulting list if that is your intention by doing
list1 = difference(list1, list2)
The often suggested set wont work here, because the two lists contain lists, which are unhashable. You need to change your data structure first.
You can
convert the sublists into tuples or class instances to make them hashable, then use sets.
Keep both lists sorted, then you just have to compare the lists' heads.