I am looking to program a small bit of code in Python. I have a list of lists called "keep_list", from which I want to remove any sublist containing a specific value found in another list called "deleteNODE".
for example:
deleteNODE=[0,4]
keep_list=[[0,1,2],[0,2,3],[1,2,3],[4,5,6]]
After running the code the result should be (removing any list containing 0 or 4):
keep_list=[[1,2,3]]
Is there any efficient way of doing this?
I did it like this:
[x for x in keep_list if not set(x).intersection(deleteNODE)]
Since I thought the other answers were better, I also ran timeit on all the 3 answers and surprisingly this one was faster.
Python 3.8.2
>>> import timeit
>>>
>>> deleteNODE=[0,4]
>>> keep_list=[[0,1,2],[0,2,3],[1,2,3],[4,5,6]]
>>>
>>>
>>> def v1(keep, delete):
... return [l for l in keep_list if not any(n in l for n in deleteNODE)]
...
>>> def v2(keep, delete):
... return [i for i in keep_list if len(set(i)&set(deleteNODE)) == 0]
...
>>> def v3(keep, delete):
... return [x for x in keep_list if not set(x).intersection(deleteNODE)]
...
>>>
>>> timeit.timeit(lambda: v1(keep_list, deleteNODE), number=3000000)
7.2224646
>>> timeit.timeit(lambda: v2(keep_list, deleteNODE), number=3000000)
7.1723587
>>> timeit.timeit(lambda: v3(keep_list, deleteNODE), number=3000000)
5.640403499999998
I'm no Python expert so can anyone understand why mine was faster since it appears to be creating a new set for every evaluation?
You can solve this using a list comprehension to cycle through each list within the larger list and by using sets. The & operator between sets returns the intersection (the elements common between both sets). Therefore, if the intersection between the list you're evaluating and deleteNODE is not zero, that means there is a common element and it gets excluded.
keep_list = [i for i in keep_list if len(set(i)&set(deleteNODE)) == 0]
This could be done using list comprehension
deleteNODE=[0,4]
keep_list=[[0,1,2],[0,2,3],[1,2,3],[4,5,6]]
...
keep_list = [l for l in keep_list if not any(n in l for n in deleteNODE)]
Related
I have a list of 50 numbers, [0,1,2,...49] and I would like to create a list of tuples without duplicates, where i define (a,b) to be a duplicate of (b,a). Similarly, I do not want tuples of the form (a,a).
I have this:
pairs = set([])
mylist = range(0,50)
for i in mylist:
for j in mylist:
pairs.update([(i,j)])
set((a,b) if a<=b else (b,a) for a,b in pairs)
print len(pairs)
>>> 2500
I get 2500 whereas I expect to get, i believe, 1225 (n(n-1)/2).
What is wrong?
You want all combinations. Python provides a module, itertools, with all sorts of combinatorial utilities like this. Where you can, I would stick with using itertool, it almost certainly faster and more memory efficient than anything you would cook up yourself. It is also battle-tested. You should not reinvent the wheel.
>>> import itertools
>>> combs = list(itertools.combinations(range(50),2))
>>> len(combs)
1225
>>>
However, as others have noted, in the case where you have a sequence (i.e. something indexable) such as a list, and you want N choose k, where k=2 the above could simply be implemented by a nested for-loop over the indices, taking care to generate your indices intelligently:
>>> result = []
>>> for i in range(len(numbers)):
... for j in range(i + 1, len(numbers)):
... result.append((numbers[i], numbers[j]))
...
>>> len(result)
1225
However, itertool.combinations takes any iterable, and also takes a second argument, r which deals with cases where k can be something like 7, (and you don't want to write a staircase).
Your approach essentially takes the cartesian product, and then filters. This is inefficient, but if you wanted to do that, the best way is to use frozensets:
>>> combinations = set()
>>> for i in numbers:
... for j in numbers:
... if i != j:
... combinations.add(frozenset([i,j]))
...
>>> len(combinations)
1225
And one more pass to make things tuples:
>>> combinations = [tuple(fz) for fz in combinations]
Try This,
pairs = set([])
mylist = range(0,50)
for i in mylist:
for j in mylist:
if (i < j):
pairs.append([(i,j)])
print len(pairs)
problem in your code snippet is that you filter out unwanted values but you don't assign back to pairs so the length is the same... also: this formula yields the wrong result because it considers (20,20) as valid for instance.
But you should just create the proper list at once:
pairs = set()
for i in range(0,50):
for j in range(i+1,50):
pairs.add((i,j))
print (len(pairs))
result:
1225
With that method you don't even need a set since it's guaranteed that you don't have duplicates in the first place:
pairs = []
for i in range(0,50):
for j in range(i+1,50):
pairs.append((i,j))
or using list comprehension:
pairs = [(i,j) for i in range(0,50) for j in range(i+1,50)]
I have two lists shown below. I'm trying to use the list.remove(x) function to remove the files that are in both list1 and list2, but one of my lists has file extensions while the other does not! What should be my approach!?
List1 = ['myfile.v', 'myfile2.sv', 'myfile3.vhd', 'etcfile.v', 'randfile.sv']
List2 = ['myfile', 'myfile2', 'myfile3']
#This is in short what I would like to do, but the file extensions throw off
#the tool!
for x in List2:
List1.remove(x)
Thanks!
It's really dangerous to loop over a list as you are removing items from it. You'll nearly always end up skipping over some elements.
>>> L = [1, 1, 2, 2, 3, 3]
>>> for x in L:
... print x
... if x == 2:
... L.remove(2)
...
1
1
2
3
3
It's also inefficient, since each .remove is O(n) complexity
Better to create a new list and bind it back to list1
import os
list1 = ['myfile.v', 'myfile2.sv', 'myfile3.vhd', 'etcfile.v', 'randfile.sv']
list2 = ['myfile', 'myfile2', 'myfile3']
set2 = set(list2) # Use a set for O(1) lookups
list1 = [x for x in list1 if os.path.splitext(x)[0] not in set2]
or for an "inplace" version
list1[:] = [x for x in list1 if os.path.splitext(x)[0] not in set2]
for a truly inplace version as discussed in the comments - doesn't use extra O(n) memory. And runs in O(n) time
>>> list1 = ['myfile.v', 'myfile2.sv', 'myfile3.vhd', 'etcfile.v', 'randfile.sv']
>>> p = 0
>>> for x in list1:
... if os.path.splitext(x)[0] not in set2:
... list1[p] = x
... p += 1
...
>>> del(list1[p:])
>>> list1
['etcfile.v', 'randfile.sv']
For the sake of it, if you want to use the list.remove(element), as it is very easy to read for others, you can try the following. If you have a function f that returns true if the value is correct/passes certain tests as required,
Since this will NOT work:
def rem_vals(L):
for x in L:
if not f(x):
L.remove(x)
for more than one value to be removed in the list L, we can use recursion as follows:
def rem_vals_rec(L):
for x in L:
if not f(x):
L.remove(x)
rem_vals_rec(L)
Not the fastest, but the easiest to read.
I'm trying to figure out how to compare an n number of lists to find the common elements.
For example:
p=[ [1,2,3],
[1,9,9],
..
..
[1,2,4]
>> print common(p)
>> [1]
Now if I know the number of elements I can do comparions like:
for a in b:
for c in d:
for x in y:
...
but that wont work if I don't know how many elements p has. I've looked at this solution that compares two lists
https://stackoverflow.com/a/1388864/1320800
but after spending 4 hrs trying to figure a way to make that recursive, a solution still eludes me so any help would be highly appreciated!
You are looking for the set intersection of all the sublists, and the data type you should use for set operations is a set:
result = set(p[0])
for s in p[1:]:
result.intersection_update(s)
print result
A simple solution (one-line) is:
set.intersection(*[set(list) for list in p])
The set.intersection() method supports intersecting multiple inputs at a time. Use argument unpacking to pull the sublists out of the outer list and pass them into set.intersection() as separate arguments:
>>> p=[ [1,2,3],
[1,9,9],
[1,2,4]]
>>> set(p[0]).intersection(*p)
set([1])
Why not just:
set.intersection(*map(set, p))
Result:
set([1])
Or like this:
ip = iter(p)
s = set(next(ip))
s.intersection(*ip)
Result:
set([1])
edit:
copied from console:
>>> p = [[1,2,3], [1,9,9], [1,2,4]]
>>> set.intersection(*map(set, p))
set([1])
>>> ip = iter(p)
>>> s = set(next(ip))
>>> s.intersection(*ip)
set([1])
p=[ [1,2,3],
[1,9,9],
[1,2,4]]
ans = [ele[0] for ele in zip(*p) if len(set(ele)) == 1]
Result:
>>> ans
[1]
reduce(lambda x, y: x & y, (set(i) for i in p))
You are looking for the set intersection of all the sublists, and the data type you should use for set operations is a set:
result = set(p[0])
for s in p[1:]:
result.intersection_update(s)
print result
However, there is a limitation of 10 lists in a list. Anything bigger causes 'result' list to be out of order. Assuming you've made 'result' into a list by list(result).
Make sure you result.sort() to ensure it's ordered if you depend on it to be that way.
Is there any builtins to check if a list is contained inside another list without doing any loop?
I looked for that in dir(list) but found nothing useful.
Depends on what you mean by "contained". Maybe this:
if set(a) <= set(b):
print("a is in b")
Assuming that you want to see if all elements of sublist are also elements of superlist:
all(x in superlist for x in sublist)
You might want to use a set
if set(a).issubset(b):
print('a is contained in b')
the solution depends on what values you expect from your lists.
if there is the possiblity of a repetition of a value, and you need to check that there is enough values in the tested container, then here is a time-inefficient solution:
def contained(candidate, container):
temp = container[:]
try:
for v in candidate:
temp.remove(v)
return True
except ValueError:
return False
test this function with:
>>> a = [1,1,2,3]
>>> b = [1,2,3,4,5]
>>> contained(a,b)
False
>>> a = [1,2,3]
>>> contained(a,b)
True
>>> a = [1,1,2,4,4]
>>> b = [1,1,2,2,2,3,4,4,5]
>>> contained(a,b)
True
of course this solution can be greatly improved: list.remove() is potentially time consuming and can be avoided using clever sorting and indexing. but i don't see how to avoid a loop here...
(anyway, any other solution will be implemented using sets or list-comprehensions, which are using loops internally...)
If you want to validate that all the items from the list1 are on list2 you can do the following list comprehension:
all(elem in list1 for elem in list2)
You can also replace list1 and list2 directly with the code that will return that list
all([snack in ["banana", "apple", "lemon", "chocolate", "chips"] for snack in ["chips","chocolate"])
That any + list comprehension can be translated into this for a better understanding of the code
return_value = False
for snack in snacks:
if snack in groceries:
return_value = True
else:
return_value = False
I want to remove an element from list, such that the element contains 'X' or 'N'. I have to apply for a large genome. Here is an example:
input:
codon=['AAT','XAC','ANT','TTA']
expected output:
codon=['AAT','TTA']
For basis purpose
>>> [x for x in ['AAT','XAC','ANT','TTA'] if "X" not in x and "N" not in x]
['AAT', 'TTA']
But if you have huge amount of data, I suggest you to use dict or set
And If you have many characters other than X and N, you may do like this
>>> [x for x in ['AAT','XAC','ANT','TTA'] if not any(ch for ch in list(x) if ch in ["X","N","Y","Z","K","J"])]
['AAT', 'TTA']
NOTE: list(x) can be just x, and ["X","N","Y","Z","K","J"] can be just "XNYZKJ", and refer gnibbler answer, He did the best one.
Another not fastest way but I think it reads nicely
>>> [x for x in ['AAT','XAC','ANT','TTA'] if not any(y in x for y in "XN")]
['AAT', 'TTA']
>>> [x for x in ['AAT','XAC','ANT','TTA'] if not set("XN")&set(x)]
['AAT', 'TTA']
This way will be faster for long codons (assuming there is some repetition)
codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
if s not in memo:
memo[s]=not any(y in s for y in "XN")
return memo[s]
print filter(pred,codon)
Here is the method suggested by James Brooks, you'd have to test to see which is faster for your data
codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
if s not in memo:
memo[s]= not set("XN")&set(s)
return memo[s]
print filter(pred,codon)
For this sample codon, the version using sets is about 10% slower
There is also the method of doing it using filter
lst = filter(lambda x: 'X' not in x and 'N' not in x, list)
filter(lambda x: 'N' not in x or 'X' not in x, your_list)
your_list = [x for x in your_list if 'N' not in x or 'X' not in x]
I like gnibbler’s memoization approach a lot. Either method using memoization should be identically fast in the big picture on large data sets, as the memo dictionary should quickly be filled and the actual test should be rarely performed. With this in mind, we should be able to improve the performance even more for large data sets. (This comes at some cost for very small ones, but who cares about those?) The following code only has to look up an item in the memo dict once when it is present, instead of twice (once to determine membership, another to extract the value).
codon = ['AAT', 'XAC', 'ANT', 'TTA']
def pred(s,memo={}):
try:
return memo[s]
except KeyError:
memo[s] = not any(y in s for y in "XN")
return memo[s]
filtered = filter(pred, codon)
As I said, this should be noticeably faster when the genome is large (or at least not extremely small).
If you don’t want to duplicate the list, but just iterate over the filtered list, do something like:
for item in (item for item in codon if pred):
do_something(item)
If you're dealing with extremely large lists, you want to use methods that don't involve traversing the entire list any more than you absolutely need to.
Your best bet is likely to be creating a filter function, and using itertools.ifilter, e.g.:
new_seq = itertools.ifilter(lambda x: 'X' in x or 'N' in x, seq)
This defers actually testing every element in the list until you actually iterate over it. Note that you can filter a filtered sequence just as you can the original sequence:
new_seq1 = itertools.ifilter(some_other_predicate, new_seq)
Edit:
Also, a little testing shows that memoizing found entries in a set is likely to provide enough of an improvement to be worth doing, and using a regular expression is probably not the way to go:
seq = ['AAT','XAC','ANT','TTA']
>>> p = re.compile('[X|N]')
>>> timeit.timeit('[x for x in seq if not p.search(x)]', 'from __main__ import p, seq')
3.4722548536196314
>>> timeit.timeit('[x for x in seq if "X" not in x and "N" not in x]', 'from __main__ import seq')
1.0560532134670666
>>> s = set(('XAC', 'ANT'))
>>> timeit.timeit('[x for x in seq if x not in s]', 'from __main__ import s, seq')
0.87923730529996647
Any reason for duplicating the entire list? How about:
>>> def pred(item, haystack="XN"):
... return any(needle in item for needle in haystack)
...
>>> lst = ['AAT', 'XAC', 'ANT', 'TTA']
>>> idx = 0
>>> while idx < len(lst):
... if pred(lst[idx]):
... del lst[idx]
... else:
... idx = idx + 1
...
>>> lst
['AAT', 'TTA']
I know that list comprehensions are all the rage these days, but if the list is long we don't want to duplicate it without any reason right? You can take this to the next step and create a nice utility function:
>>> def remove_if(coll, predicate):
... idx = len(coll) - 1
... while idx >= 0:
... if predicate(coll[idx]):
... del coll[idx]
... idx = idx - 1
... return coll
...
>>> lst = ['AAT', 'XAC', 'ANT', 'TTA']
>>> remove_if(lst, pred)
['AAT', 'TTA']
>>> lst
['AAT', 'TTA']
As S.Mark requested here is my version. It's probably slower but does make it easier to change what gets removed.
def filter_genome(genome, killlist = set("X N".split()):
return [codon for codon in genome if 0 == len(set(codon) | killlist)]
It is (asympotically) faster to use a regular expression than searching many times in the same string for a certain character: in fact, with a regular expression the sequences is only be read at most once (instead of twice when the letters are not found, in gnibbler's original answer, for instance). With gnibbler's memoization, the regular expression approach reads:
import re
remove = re.compile('[XN]').search
codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
if s not in memo:
memo[s]= not remove(s)
return memo[s]
print filter(pred,codon)
This should be (asymptotically) faster than using the "in s" or the "set" checks (i.e., the code above should be faster for long enough strings s).
I originally thought that gnibbler's answer could be written in a faster and more compact way with dict.setdefault():
codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
return memo.setdefault(s, not any(y in s for y in "XN"))
print filter(pred,codon)
However, as gnibbler noted, the value in setdefault is always evaluated (even though, in principle, it could be evaluated only when the dictionary key is not found).
If you want to modify the actual list instead of creating a new one here is a simple set of functions that you can use:
from typing import TypeVar, Callable, List
T = TypeVar("T")
def list_remove_first(lst: List[T], accept: Callable[[T], bool]) -> None:
for i, v in enumerate(lst):
if accept(v):
del lst[i]
return
def list_remove_all(lst: List[T], accept: Callable[[T], bool]) -> None:
for i in reversed(range(len(lst))):
if accept(lst[i]):
del lst[i]