Removing an element from a list based on a predicate - python

I want to remove an element from list, such that the element contains 'X' or 'N'. I have to apply for a large genome. Here is an example:
input:
codon=['AAT','XAC','ANT','TTA']
expected output:
codon=['AAT','TTA']

For basis purpose
>>> [x for x in ['AAT','XAC','ANT','TTA'] if "X" not in x and "N" not in x]
['AAT', 'TTA']
But if you have huge amount of data, I suggest you to use dict or set
And If you have many characters other than X and N, you may do like this
>>> [x for x in ['AAT','XAC','ANT','TTA'] if not any(ch for ch in list(x) if ch in ["X","N","Y","Z","K","J"])]
['AAT', 'TTA']
NOTE: list(x) can be just x, and ["X","N","Y","Z","K","J"] can be just "XNYZKJ", and refer gnibbler answer, He did the best one.

Another not fastest way but I think it reads nicely
>>> [x for x in ['AAT','XAC','ANT','TTA'] if not any(y in x for y in "XN")]
['AAT', 'TTA']
>>> [x for x in ['AAT','XAC','ANT','TTA'] if not set("XN")&set(x)]
['AAT', 'TTA']
This way will be faster for long codons (assuming there is some repetition)
codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
if s not in memo:
memo[s]=not any(y in s for y in "XN")
return memo[s]
print filter(pred,codon)
Here is the method suggested by James Brooks, you'd have to test to see which is faster for your data
codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
if s not in memo:
memo[s]= not set("XN")&set(s)
return memo[s]
print filter(pred,codon)
For this sample codon, the version using sets is about 10% slower

There is also the method of doing it using filter
lst = filter(lambda x: 'X' not in x and 'N' not in x, list)

filter(lambda x: 'N' not in x or 'X' not in x, your_list)
your_list = [x for x in your_list if 'N' not in x or 'X' not in x]

I like gnibbler’s memoization approach a lot. Either method using memoization should be identically fast in the big picture on large data sets, as the memo dictionary should quickly be filled and the actual test should be rarely performed. With this in mind, we should be able to improve the performance even more for large data sets. (This comes at some cost for very small ones, but who cares about those?) The following code only has to look up an item in the memo dict once when it is present, instead of twice (once to determine membership, another to extract the value).
codon = ['AAT', 'XAC', 'ANT', 'TTA']
def pred(s,memo={}):
try:
return memo[s]
except KeyError:
memo[s] = not any(y in s for y in "XN")
return memo[s]
filtered = filter(pred, codon)
As I said, this should be noticeably faster when the genome is large (or at least not extremely small).
If you don’t want to duplicate the list, but just iterate over the filtered list, do something like:
for item in (item for item in codon if pred):
do_something(item)

If you're dealing with extremely large lists, you want to use methods that don't involve traversing the entire list any more than you absolutely need to.
Your best bet is likely to be creating a filter function, and using itertools.ifilter, e.g.:
new_seq = itertools.ifilter(lambda x: 'X' in x or 'N' in x, seq)
This defers actually testing every element in the list until you actually iterate over it. Note that you can filter a filtered sequence just as you can the original sequence:
new_seq1 = itertools.ifilter(some_other_predicate, new_seq)
Edit:
Also, a little testing shows that memoizing found entries in a set is likely to provide enough of an improvement to be worth doing, and using a regular expression is probably not the way to go:
seq = ['AAT','XAC','ANT','TTA']
>>> p = re.compile('[X|N]')
>>> timeit.timeit('[x for x in seq if not p.search(x)]', 'from __main__ import p, seq')
3.4722548536196314
>>> timeit.timeit('[x for x in seq if "X" not in x and "N" not in x]', 'from __main__ import seq')
1.0560532134670666
>>> s = set(('XAC', 'ANT'))
>>> timeit.timeit('[x for x in seq if x not in s]', 'from __main__ import s, seq')
0.87923730529996647

Any reason for duplicating the entire list? How about:
>>> def pred(item, haystack="XN"):
... return any(needle in item for needle in haystack)
...
>>> lst = ['AAT', 'XAC', 'ANT', 'TTA']
>>> idx = 0
>>> while idx < len(lst):
... if pred(lst[idx]):
... del lst[idx]
... else:
... idx = idx + 1
...
>>> lst
['AAT', 'TTA']
I know that list comprehensions are all the rage these days, but if the list is long we don't want to duplicate it without any reason right? You can take this to the next step and create a nice utility function:
>>> def remove_if(coll, predicate):
... idx = len(coll) - 1
... while idx >= 0:
... if predicate(coll[idx]):
... del coll[idx]
... idx = idx - 1
... return coll
...
>>> lst = ['AAT', 'XAC', 'ANT', 'TTA']
>>> remove_if(lst, pred)
['AAT', 'TTA']
>>> lst
['AAT', 'TTA']

As S.Mark requested here is my version. It's probably slower but does make it easier to change what gets removed.
def filter_genome(genome, killlist = set("X N".split()):
return [codon for codon in genome if 0 == len(set(codon) | killlist)]

It is (asympotically) faster to use a regular expression than searching many times in the same string for a certain character: in fact, with a regular expression the sequences is only be read at most once (instead of twice when the letters are not found, in gnibbler's original answer, for instance). With gnibbler's memoization, the regular expression approach reads:
import re
remove = re.compile('[XN]').search
codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
if s not in memo:
memo[s]= not remove(s)
return memo[s]
print filter(pred,codon)
This should be (asymptotically) faster than using the "in s" or the "set" checks (i.e., the code above should be faster for long enough strings s).
I originally thought that gnibbler's answer could be written in a faster and more compact way with dict.setdefault():
codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
return memo.setdefault(s, not any(y in s for y in "XN"))
print filter(pred,codon)
However, as gnibbler noted, the value in setdefault is always evaluated (even though, in principle, it could be evaluated only when the dictionary key is not found).

If you want to modify the actual list instead of creating a new one here is a simple set of functions that you can use:
from typing import TypeVar, Callable, List
T = TypeVar("T")
def list_remove_first(lst: List[T], accept: Callable[[T], bool]) -> None:
for i, v in enumerate(lst):
if accept(v):
del lst[i]
return
def list_remove_all(lst: List[T], accept: Callable[[T], bool]) -> None:
for i in reversed(range(len(lst))):
if accept(lst[i]):
del lst[i]

Related

How delete a element of a list and save the original index of deleted element?

I want delete some elements of one list equal to a value:
I can do it :
List =[1,2,3.....]
List = [x for x in List if x != 2]
How can i save the indexs of the deleted elements ?
I want to use this index to delete elements of another list.
Simplest solution is to make a list of indices to keep, then use that to strip the elements from both of your lists. itertools provides a handy compress utility to apply the indices to keep quickly:
from itertools import compress
tokeep = [x != 2 for x in List]
List = list(compress(List, tokeep))
otherlist = list(compress(otherlist, tokeep))
Alternatively (and frankly more clearly) you can just use one loop to strip both inputs; listcomps are fun, but sometimes they're not the way to go.
newlist = []
newotherlist = []
for x, y in zip(List, otherlist):
if x != 2:
newlist.append(x)
newotherlist.append(y)
which gets the same effect in a single pass. Even if it does feel less overtly clever, it's very clear, which is a good thing; brevity for the sake of brevity that creates complexity is not a win.
And now, to contradict that last paragraph, the amusingly overtly clever and brief solution to one-line this:
List, otherlist = map(list, zip(*[(x, y) for x, y in zip(List, otherlist) if x != 2]))
For the love of sanity, please don't actually use this, I just had to write it for funsies.
You can also leverage enumerate
for index, val in enumerate(List):
if val == value:
del List[index]
break
print(index)
Based on documentation
list_first = ['d', 'a']
list_second = ['x', 'z']
def remove_from_lists(element):
index_deleted = list_first.index(element)
list_first.remove(element)
list_second.pop(index_deleted)
remove_from_lists('d')

Uniqueify returning a empty list

I'm new to python and trying to make a function Uniqueify(L) that will be given either a list of numbers or a list of strings (non-empty), and will return a list of the unique elements of that list.
So far I have:
def Uniquefy(x):
a = []
for i in range(len(x)):
if x[i] in a == False:
a.append(x[i])
return a
It looks like the if str(x[i]) in a == False: is failing, and that's causing the function to return a empty list.
Any help you guys can provide?
Relational operators all have exactly the same precedence and are chained. This means that this line:
if x[i] in a == False:
is evaluated as follows:
if (x[i] in a) and (a == False):
This is obviously not what you want.
The solution is to remove the second relational operator:
if x[i] not in a:
You can just create a set based on the list which will only contain unique values:
>>> s = ["a", "b", "a"]
>>> print set(s)
set(['a', 'b'])
The best option here is to use a set instead! By definition, sets only contain unique items and putting the same item in twice will not result in two copies.
If you need to create it from a list and need a list back, try this. However, if there's not a specific reason you NEED a list, then just pass around a set instead (that would be the duck-typing way anyway).
def uniquefy(x):
return list(set(x))
You can use the built in set type to get unique elements from a collection:
x = [1,2,3,3]
unique_elements = set(x)
You should use set() here. It reduces the in operation time:
def Uniquefy(x):
a = set()
for item in x:
if item not in a:
a.add(item)
return list(a)
Or equivalently:
def Uniquefy(x):
return list(set(x))
If order matters:
def uniquefy(x):
s = set()
return [i for i in x if i not in s and s.add(i) is None]
Else:
def uniquefy(x):
return list(set(x))

Counting elements in a list with given property values

I'm using Python 2.7. I have a list of objects that each include a property called vote that can have values of either yes or no, and I want to count the number of yes and no votes. One way to do this is:
list = [ {vote:'yes'}, {vote:'no'}, {vote:'yes'} ] #...
numYesVotes = len([x for x in list if x.vote=='yes'])
numNoVotes = len([x for x in list if x.vote=='no'])
but it seems horribly wasteful/inefficient to me to build these lists only to get their lengths and
First question: Am I right about that? Wouldn't it a good bit more efficient to simply loop through the list once and increment counter variables? i.e:
numYesVotes = numNoVotes = 0;
for x in list:
if x.vote == 'yes':
numYesVotes++
else:
numNoVotes++
or is there something about list comprehensions that would make them more efficient here?
Second question: I'm just learning to use lambdas, and I have a feeling this is a perfect use case for one - but I can't quite figure out how to use one here how might I do that?
See Counter
Counter(x.vote for x in mylst)
Edit:
Example:
yn = Counter("yes" if x%2 else "no" for x in range(10))
print(yn["yes"],yn["no"])
Note that it is faster to do:
sum(1 for x in L if x.vote=='yes')
Than:
len([x for x in L if x.vote=='yes'])
As no list has to be created.
The lambda form;
t =lambda lst: len([ v for x in lst for k,v in x.items() if v=='yes'])
#print (t(mylst))
mylst is the list that you want to check how much yes or no in that. Use key-dict methods.
Demo;
mylst = [ {"vote":'yes'}, {"vote":'no'}, {"vote":'yes'} ]
t =lambda lst: len([ v for x in lst for k,v in x.items() if v=='yes'])
print (t(mylst))
>>>
2
>>>

How to find a duplicate in a list without using set in python?

I know that we can use the set in python to find if there is any duplicate in a list. I was just wondering, if we can find a duplicate in a list without using set.
Say, my list is
a=['1545','1254','1545']
then how to find a duplicate?
a=['1545','1254','1545']
from collections import Counter
print [item for item, count in Counter(a).items() if count != 1]
Output
['1545']
This solution runs in O(N). This will be a huge advantage if the list used has a lot of elements.
If you just want to find if the list has duplicates, you can simply do
a=['1545','1254','1545']
from collections import Counter
print any(count != 1 for count in Counter(a).values())
As #gnibbler suggested, this would be the practically fastest solution
from collections import defaultdict
def has_dup(a):
result = defaultdict(int)
for item in a:
result[item] += 1
if result[item] > 1:
return True
else:
return False
a=['1545','1254','1545']
print has_dup(a)
>>> lis = []
>>> a=['1545','1254','1545']
>>> for i in a:
... if i not in lis:
... lis.append(i)
...
>>> lis
['1545', '1254']
>>> set(a)
set(['1254', '1545'])
use list.count:
In [309]: a=['1545','1254','1545']
...: a.count('1545')>1
Out[309]: True
Using list.count:
>>> a = ['1545','1254','1545']
>>> any(a.count(x) > 1 for x in a) # To check whether there's any duplicate
True
>>> # To retrieve any single element that is duplicated
>>> next((x for x in a if a.count(x) > 1), None)
'1545'
# To get duplicate elements (used set literal!)
>>> {x for x in a if a.count(x) > 1}
set(['1545'])
sort the list and check that the next value is not equal to the last one..
a.sort()
last_x = None
for x in a:
if x == last_x:
print "duplicate: %s" % x
break # existence of duplicates is enough
last_x = x
This should be O(n log n) which is slower for big n than the Counter solution (but counter uses a dict under the hood.. which is not too dissimilar from a set really).
An alternative is to insert the elements and keep the list sorted.. see the bisect module. It makes your inserts slower but your check for duplicates fast.
If this is homework, your teacher is probably asking for the hideously inefficient .count() style answer.
In practice using a dict is your next best bet if set is disallowed.
>>> a = ['1545','1254','1545']
>>> D = {}
>>> for i in a:
... if i in D:
... print "duplicate", i
... break
... D[i] = i
... else:
... print "no duplicate"
...
duplicate 1545
Here is a version using groupby which is still much better that the .count() method
>>> from itertools import groupby
>>> a = ['1545','1254','1545']
>>> next(k for k, g in groupby(sorted(a)) if sum(1 for i in g) > 1)
'1545'
thanks all for working on this problem. I also got to learn a lot from different answers. This is how I have answered:
a=['1545','1254','1545']
d=[]
duplicates=False
for i in a:
if i not in d:
d.append(i)
if len(d)<len(a):
duplicates=True
else:
duplicates=False
print(duplicates)

python list that matches everything

I probably didn't ask correctly: I would like a list value that can match any list: the "inverse" of (None,)
but even with (None,) it will match item as None (which I don't want)
The point is I have a function working with: [x for x in my_list if x[field] not in filter_list]
and I would like to filter everything or nothing without making tests like:
if filter_list==(None,): return [] and if filter_list==('*',): return my_list
PS: I wanted to simplify my question leading to some errors (list identifier) or stupid thing [x for x in x] ;)
Hi,
I need to do some filtering with list comprehension in python.
if I do something like that:
[x for x in list if x in (None,)]
I get rid of all values, which is fine
but I would like to have the same thing to match everything
I can do something like:
[x for x in list if x not in (None,)]
but it won't be homogeneous with the rest
I tried some things but for example (True,) matches only 1
Note than the values to filter are numeric, but if you have something generic (like (None,) to match nothing), it would be great
Thanks
Louis
__contains__ is the magic method that checks if something is in a sequence:
class everything(object):
def __contains__(self, _):
return True
for x in (1,2,3):
print x in everything()
The better syntax would be:
[x for x in lst if x is None]
[x for x in lst if x is not None]
What do you mean by
I would like to have the same thing to match everything
Just do
[x for x in list]
and every item in list is matched.
You could change your program to accept a filter object, instead of a list.
The abstract base filter would have a matches method, that returns true if x *matches".
Your general case filters would be constructed with a list argument, and would filter on membership of the list - the matches function would search the list and return true if the argument was in the list.
You could also have two special subclasses of the filter object : none and all.
These would have special match functions which either always return true (all) or false (none).
You don't need an if, you can just say
[x for x in list]
but I would like to have the same
thing to match everything
To match everything, you don't need if statement
[x for x in list1]
or If you really like to do
[x for x in list1 if x in [x]]
Answering your revised question: the list that "matches" all possible values is effectively of infinite length. So you can't do what you want to do without an if test. I suggest that your arg should be either a list or one of two values representing the "all" and "none" cases:
FILTER_NONE = object() # or []
FILTER_ALL = object()
def filter_func(alist, filter_list):
if filter_list is FILTER_ALL:
return []
elif filter_list is FILTER_NONE:
return alist
# or maybe alist[:] # copy the list
return [x for x in alist if x not in filter_list]
If filter_list is large, you may wish the replace the last line by:
filter_set = set(filter_list)
return [x for x in alist if x not in filter_set]
Alternatively, don't bother; just document that filter_list (renamed as filter_collection) can be anything that supports __contains__() and remind readers that sets will be faster than lists.

Categories