I'm working on scientific data and using a module called pysam in order to get reference position for each unique "object" in my file.
In the end, I obtain a "list of lists" that looks like that (here I provide an example with only two objects in the file):
pos = [[1,2,3,6,7,8,15,16,17,20],[1,5,6,7,8,20]]
and, for each list in pos, I would like to iterate over the values and compare value[i] with value[i+1]. When the difference is greater than 2 (for example) I want to store both values (value[i] and value[i+1]) into a new list.
If we call it final_pos then I would like to obtain:
final_pos = [[3,6,8,15,17,20],[1,5,8,20]]
It seemed rather easy to do, at first, but I must be lacking some basic knowledge on how lists works and I can't manage to iterate over each values of each list and then compare consecutive values together..
If anyone has an idea, I'm more than willing to hear about it !
Thanks in advance for your time !
EDIT: Here's what I tried:
pos = [[1,2,3,6,7,8,15,16,17,20],[1,5,6,7,8,20]]
final_pos = []
for list in pos:
for value in list:
for i in range(len(list)-1):
if value[i+1]-value[i] > 2:
final_pos.append(value[i])
final_pos.append(value[i+1])
You can iterate over each of the individual list in pos and then compare the consecutive values. When you need to insert the values, you can use a temporary set because you wouldn't want to insert the same element twice in your final list. Then, you can convert the temporary set to a list and append it to your final list (after sorting it, to preserve order). Also, the sorting will only work if the elements in the original list is actually sorted.
pos = [[1,2,3,6,7,8,15,16,17,20],[1,5,6,7,8,20]]
final_pos = []
for l in pos:
temp_set = set()
for i in range(len(l)-1):
if l[i+1] - l[i] > 2:
temp_set.add(l[i])
temp_set.add(l[i+1])
final_pos.append(sorted(list(temp_set)))
print(final_pos)
Output
[[3, 6, 8, 15, 17, 20], [1, 5, 8, 20]]
Edit: About what you tried:
for list in pos:
This line will give us list = [1,2,3,6,7,8,15,16,17,20] (in the first iteration)
for value in list:
This line will give us value = 1 (in the first iteration)
Now, value is just a number not a list and hence, value[i] and value[i+1] doesn't make sense.
Your code has an obvious "too many loop" issues. It also stores the result as a flat list, you need a list of lists.
It has also a more subtle bug: a same index can be added more than once if 2 intervals match in a row. I've registered the added indices in a set to avoid this.
The bug doesn't show with your original data (which tripped a lot of experienced users, including me), so I've changed it:
pos = [[1,2,3,6,7,8,11,15,16,17,20],[1,5,6,7,8,20]]
final_pos = []
for value in pos:
sublist = []
added_indexes = set()
for i in range(len(value)-1):
if value[i+1]-value[i] > 2:
if not i in added_indexes:
sublist.append(value[i])
## added_indexes.add(i) # we don't need to add it, we won't go back
# no need to test for i+1, it's new
sublist.append(value[i+1])
# registering it for later
added_indexes.add(i+1)
final_pos.append(sublist)
print(final_pos)
result:
[[3, 6, 8, 11, 15, 17, 20], [1, 5, 8, 20]]
Storing the indexes in a set, and not the values (which would also work here, with some post-processing sort, see this answer) also would work when objects aren't hashable (like custom objects which have a custom distance implemented between them) or only partially sorted (waves) if it has some interest (ex: pos = [[1,2,3,6,15,16,17,20,1,6,10,11],[1,5,6,7,8,20,1,5,6,7,8,20]])
Related
I am a newbie to python and just learning things as I do my project and here I have a list of lists which I need to compare between the second and last column and get the output for the one which has the largest distance. Moreover, I am replicating a list. If someone could help me to do it within a single list of lists in ag it could be really helpful.
Thanks in advance
if this is an INPUT then the output should be,
ag = [['chr12','XX',1,5,4],
['chr12','XX',2,5,3],
['chr13','ZZ',6,10,4],
['chr13','ZZ',8,9,1],
['ch14','YY',12,15,3],['ch14','YY',12,15,3]]
EXPECTED OUTPUT:
['chr12','XX',1,5,4]
['chr13','ZZ',6,10,4]
['ch14','YY',12,15,3]
#However I tried of replicating the list like
#INPUT
ag =
[['chr12','XX',1,5,4],
['chr12','XX',2,5,3],
['chr13','ZZ',6,10,4],
['chr13','ZZ',8,9,1],
['ch14','YY',12,15,3],
['ch14','YY',12,15,3]]
bg =
[['chr12','XX',1,5,4],
['chr12','XX',2,5,3],
['chr13','ZZ',6,10,4],
['chr13','ZZ',8,9,1],
['ch14','YY',12,15,3],
['ch14','YY',12,15,3]]
#The code which I tried was
c= []
for i in ag:
for j in bg:
if i[0]==j[0] and i[1]==j[1] and i[4]>j[4]:
c.append(i)
the output which i get is
[['chr12', 'XX', 1, 5, 4], ['chr13', 'ZZ', 6, 10, 4]]
In short: To compare items in an iterable (e.g. list) of lists, use the keyword agument key of the max/min function. It takes a function or lambda expression and compares the given values by the result of the key function when given each value.
Assuming that what you really want is to reduce a list of lists so that the entries' second elements are unique and that you want the last elements to determin which entry to keep in case of redundant 2nd values:
If there is any problem regarding iteration, itertools has the answer. In this case, we just need the groupby method and standard Python's max method with the keyword argument key.
from itertools import groupby
def filter(matrix):
filtered = [] # create a result list to hold the rows (lists) we want
for key, rows in groupby(matrix, lambda row: row[1]): # get the rows grouped by their 2nd element and iterate over that
filtered.append(max(rows, key=lambda row: row[-1])) # add the line that among its group has the largest last value to our result
return filtered # return the result
We could squeeze this into a single generator expression or list comprehension but for a beginner, the above code should be complex enought.
Please be sure to follow Stack Overflow's guidelines for future questions to prevent low ratings and ensure prompt and high quality answers.
Your list ag and bg are completely duplicating, so I give this example for your issue. Hope it help.
>>>ag = [['chr12','XX',1,5,4],['chr12','XX',2,5,3],['chr13','ZZ',6,10,4],['chr13','ZZ',8,9,1],['ch14','YY',12,15,3]]
>>>bg = [['chr12','XX',1,5,4],['chr12','XX',2,5,3],['chr13','ZZ',6,10,4],['chr13','ZZ',8,9,1]]
>>>[i for i in ag + bg if i not in ag or i not in bg]
[['ch14', 'YY', 12, 15, 3]]
I was doing one of the course exercises on codeacademy for python and I had a few questions I couldn't seem to find an answer to:
For this block of code, how exactly does python check whether something is "in" or "not in" a list? Does it run through each item in the list to check or does it use a quicker process?
Also, how would this code be affected if it were running with a massive list of numbers (thousands or millions)? Would it slow down as the list size increases, and are there better alternatives?
numbers = [1, 1, 2, 3, 5, 8, 13]
def remove_duplicates(list):
new_list = []
for i in list:
if i not in new_list:
new_list.append(i)
return new_list
remove_duplicates(numbers)
Thanks!
P.S. Why does this code not function the same?
numbers = [1, 1, 2, 3, 5, 8, 13]
def remove_duplicates(list):
new_list = []
new_list.append(i for i in list if i not in new_list)
return new_list
In order to execute i not in new_list Python has to do a linear scan of the list. The scanning loop breaks as soon as the result of the test is known, but if i is actually not in the list the whole list must be scanned to determine that. It does that at C speed, so it's faster than doing a Python loop to explicitly check each item. Doing the occasional in some_list test is ok, but if you need to do a lot of such membership tests it's much better to use a set.
On average, with random data, testing membership has to scan through half the list items, and in general the time taken to perform the scan is proportional to the length of the list. In the usual notation the size of the list is denoted by n, and the time complexity of this task is written as O(n).
In contrast, determining membership of a set (or a dict) can be done (on average) in constant time, so its time complexity is O(1). Please see TimeComplexity in the Python Wiki for further details on this topic. Thanks, Serge, for that link.
Of course, if your using a set then you get de-duplication for free, since it's impossible to add duplicate items to a set.
One problem with sets is that they generally don't preserve order. But you can use a set as an auxilliary collection to speed up de-duping. Here is an illustration of one common technique to de-dupe a list, or other ordered collection, which does preserve order. I'll use a string as the data source because I'm too lazy to type out a list. ;)
new_list = []
seen = set()
for c in "this is a test":
if c not in seen:
new_list.append(c)
seen.add(c)
print(new_list)
output
['t', 'h', 'i', 's', ' ', 'a', 'e']
Please see How do you remove duplicates from a list whilst preserving order? for more examples. Thanks, Jean-François Fabre, for the link.
As for your PS, that code appends a single generator object to new_list, it doesn't append what the generate would produce.
I assume you alreay tried to do it with a list comprehension:
new_list = [i for i in list if i not in new_list]
That doesn't work, because the new_list doesn't exist until the list comp finishes running, so doing in new_list would raise a NameError. And even if you did new_list = [] before the list comp, it won't be modified by the list comp, and the result of the list comp would simply replace that empty list object with a new one.
BTW, please don't use list as a variable name (even in example code) since that shadows the built-in list type, which can lead to mysterious error messages.
You are asking multiple questions and one of them asking if you can do this more efficiently. I'll answer that.
Ok let's say you'd have thousands or millions of numbers. From where exactly? Let's say they were stored in some kind of txtfile, then you would probably want to use numpy (if you are sticking with Python that is). Example:
import numpy as np
numbers = np.array([1, 1, 2, 3, 5, 8, 13], dtype=np.int32)
numbers = np.unique(numbers).tolist()
This will be more effective (above all memory-efficient compared) than reading it with python and performing a list(set..)
numbers = [1, 1, 2, 3, 5, 8, 13]
numbers = list(set(numbers))
You are asking for the algorithmic complexity of this function. To find that you need to see what is happening at each step.
You are scanning the list one at a time, which takes 1 unit of work. This is because retrieving something from a list is O(1). If you know the index, it can be retrieved in 1 operation.
The list to which you are going to add it increases at worst case 1 at a time. So at any point in time, the unique items list is going to be of size n.
Now, to add the item you picked to the unique items list is going to take n work in the worst case. Because we have to scan each item to decide that.
So if you sum up the total work in each step, it would be 1 + 2 + 3 + 4 + 5 + ... n which is n (n + 1) / 2. So if you have a million items, you can just find that by applying n = million in the formula.
This is not entirely true because of how list works. But theoretically, it would help to visualize this way.
to answer the question in the title: python has more efficient data types but the list() object is just a plain array, if you want a more efficient way to search values you can use dict() which uses a hash of the object stored to insert it into a tree which i assume is what you were thinking of when you mentioned "a quicker process".
as to the second code snippet:
list().append() inserts whatever value you give it to the end of the list, i for i in list if i not in new_list is a generator object and it inserts that generator as an object into the array, list().extend() does what you want: it takes in an iterable and appends all of its elements to the list
My problem is about managing insert/append methods within loops.
I have two lists of length N: the first one (let's call it s) indicates a subset to which, while the second one represents a quantity x that I want to evaluate. For sake of simplicity, let's say that every subset presents T elements.
cont = 0;
for i in range(NSUBSETS):
for j in range(T):
subcont = 0;
if (x[(i*T)+j] < 100):
s.insert(((i+1)*T)+cont, s[(i*T)+j+cont]);
x.insert(((i+1)*T)+cont, x[(i*T)+j+cont]);
subcont += 1;
cont += subcont;
While cycling over all the elements of the two lists, I'd like that, when a certain condition is fulfilled (e.g. x[i] < 100), a copy of that element is put at the end of the subset, and then going on with the loop till completing the analysis of all the original members of the subset. It would be important to maintain the "order", i.e. inserting the elements next to the last element of the subset it comes from.
I thought a way could have been to store within 2 counter variables the number of copies made within the subset and globally, respectively (see code): this way, I could shift the index of the element I was looking at according to that. I wonder whether there exists some simpler way to do that, maybe using some Python magic.
If the idea is to interpolate your extra copies into the lists without making a complete copy of the whole list, you can try this with a generator expression. As you loop through your lists, collect the matches you want to append. Yield each item as you process it, then yield each collected item too.
This is a simplified example with only one list, but hopefully it illustrates the idea. You would only get a copy if you do like i've done and expand the generator with a comprehension. If you just wanted to store or further analyze the processed list (eg, to write it to disk) you could never have it in memory at all.
def append_matches(input_list, start, end, predicate):
# where predicate is a filter function or lambda
for item in input_list[start:end]:
yield item
for item in filter(predicate, input_list[start:end]):
yield item
example = lambda p: p < 100
data = [1,2,3,101,102,103,4,5,6,104,105,106]
print [k for k in append_matches (data, 0, 6, example)]
print [k for k in append_matches (data, 5, 11, example)]
[1, 2, 3, 101, 102, 103, 1, 2, 3]
[103, 4, 5, 6, 104, 105, 4, 5, 6]
I'm guessing that your desire not to copy the lists is based on your C background - an assumption that it would be more expensive that way. In Python lists are not actually lists, inserts have O(n) time as they are more like vectors and so those insert operations are each copying the list.
Building a new copy with the extra elements would be more efficient than trying to update in-place. If you really want to go that way you would need to write a LinkedList class that held prev/next references so that your Python code really was a copy of the C approach.
The most Pythonic approach would not try to do an in-place update, as it is simpler to express what you want using values rather than references:
def expand(origLs) :
subsets = [ origLs[i*T:(i+1)*T] for i in range(NSUBSETS) ]
result = []
for s in subsets :
copies = [ e for e in s if e<100 ]
result += s + copies
return result
The main thing to keep in mind is that the underlying cost model for an interpreted garbage-collected language is very different to C. Not all copy operations actually cause data movement, and there are no guarantees that trying to reuse the same memory will be successful or more efficient. The only real answer is to try both techniques on your real problem and profile the results.
I'd be inclined to make a copy of your lists and then, while looping across the originals, as you come across a criteria to insert you insert into the copy at the place you need it to be at. You can then output the copied and updated lists.
I think to have found a simple solution.
I cycle from the last subset backwards, putting the copies at the end of each subset. This way, I avoid encountering the "new" elements and get rid of counters and similia.
for i in range(NSUBSETS-1, -1, -1):
for j in range(T-1, -1, -1):
if (x[(i*T)+j] < 100):
s.insert(((i+1)*T), s[(i*T)+j])
x.insert(((i+1)*T), x[(i*T)+j])
One possibility would be using numpy's advanced indexing to provide the illusion of copying elements to the ends of the subsets by building a list of "copy" indices for the original list, and adding that to an index/slice list that represents each subset. Then you'd combine all the index/slice lists at the end, and use the final index list to access all your items (I believe there's support for doing so generator-style, too, which you may find useful as advanced indexing/slicing returns a copy rather than a view). Depending on how many elements meet the criteria to be copied, this should be decently efficient as each subset will have its indices as a slice object, reducing the number of indices needed to keep track of.
I promise I've tried searching, but every single question I find ends up having some criteria unstated or violated that makes the answer insufficient for me.
I'm sending a list to a Python script. That list will be stored somewhere, but I want to minimize writes (this is on a remote service and I get charged for each write).
listNew = ["some", "list", "sent", "in", "that", "may", "be", "different", "later", "some"]
listPrevious = ["some", "some", "list", "that", "was", "saved", "previously"]
(Please don't get distracted by their being strings; my list actually contains ints.)
The simple, basic algorithm is to iterate both lists on an index-by-index basis. If the items are the same, I don't need to write; boom, money saved. The data ultimately saved, however, should be listNew.
In other languages, I could directly reference elements by index.
for (int i = 0; i < listNew.length; i++) {
// Have we exceeded the previous list's length? Time to just write data in.
if (listPrevious[i] == null)
listPrevious.append(listNew[i]);
continue;
if (listNew[i] != listPrevious[i])
listPrevious[i] = listNew[i]
}
Unfortunately, what I've found in looping techniques and list methods doesn't provide:
the means to get elements by index without removing it (pop method), nor
the means to get the index of an element by exact value and positioning, since I have duplicates (in the above code, using
list.index("some") would return the first index in listPrevious
though I'm actually looking at the last element in listNew), nor
the means to iterate through my lists beyond the length of one of the lists (zip() doesn't iterate beyond the length of the smaller list, it seems).
Any ideas on how I should handle this? One of those three criteria were always violated in some way when I searched through previous questions.
I'm trying to avoid a solution like the following, by the way, which is also among the marked solutions in other questions.
for newitem in listNew
for olditem in listPrevious
if newitem != olditem
# save the newitem
That compares the element from listNew with every single element in listPrevious, which is inefficient. I just need to know if it matches at the same index in the other list.
------- By Comment Request
Input: 2 lists, listNew and listPrevious. Another example
listNew = [100, 500, 200, 200, 100, 50, 700]
listPrevious = [100, 500, 200, 400, 400, 50]
Output: listPrevious is now listNew without having to overwrite elements that were the same.
listPrevious = [100, 500, 200, 200, 100, 50, 700]
did not require writes: [100, 500, 200, _, , 50, __] <- 4 writes saved
did require writes : [_, , __, 200, 100, __, 700] <- 3 writes executed, not .length writes executed!
From you C code I have created the following. Hopefully it does what you want:
for i in range(len(listNew)):
# Have we exceeded the previous list's length? Time to just write data in.
if i >= len(listPrevious):
listPrevious.append(listNew[i])
continue
if listNew[i] != listPrevious[i]:
listPrevious[i] = listNew[i]
If you want to iterate in order with indexes you need enumerate:
for idx, item in enumerate(mylist):
# idx is the 0-indexed value where item resides in mylist.
If you want to iterate over pairs of things in python you use zip:
for a, b in zip(newlist, oldlist):
# items a and b reside at the same index in their respective parent lists.
You can combine the approaches:
for idx, (a, b) in enumerate(zip(newlist, oldlist)):
# here you have everything you probably need, based on what I can
# tell from your question.
Depending on your data sets, you may also look at the additional functions in the itertools module, specifically izip_longest.
Python's list methods actually do provide all of the capabilities you think it doesn't (the last code sample is equivalent to your example code):
the means to get elements by index without removing it (pop method)
>>> data = ['a', 'b', 'c']
>>> data[1] # accessing an element by index
'b'
the means to get the index of an element by exact value and positioning, since I have duplicates (in the above code, using list.index("some") would return the first index in listPrevious though I'm actually looking at the last element in listNew)
>>> data = ['a', 'b', 'c', 'b', 'a']
>>> data.index('a') # without a start arg, call finds the first index
0
>>> data.index('a', 1) # you can find later indices by giving a start index
4
the means to iterate through my lists beyond the length of one of the lists (zip() doesn't iterate beyond the length of the smaller list, it seems).
for i, item in enumerate(listNew): # loops over indices and values
if i >= len(listPrevious):
listPrevious.append(item)
continue
if item != listPrevious[i]:
listPrevious[i] = item
Is the item's position important?
If not simply do this::
for n in NewList:
if n not in OldList:
OldList.append(n)
process(n)
I have two lists of strings that are passed into a function. They are more or less the same, except that one has been run through a regex filter to remove certain boilerplate substrings (e.g. removing 'LLC' from 'Blues Brothers LLC').
This function is meant to internally deduplicate the modified list and remove the associated item in the non-modified list. You can assume that these lists were sorted alphabetically before being run through the regex filter, and remain in the same order (i.e. original[x] and modified[x] refer to the same entity, even if original[x] != modified[x]). Relative order must be maintained between the two lists in the output.
This is what I have so far. It works 99% of the time, except for very rare combinations of inputs and boilerplate strings (1 in 1000s) where some output strings will be mismatched by a single list position. Input lists are 'original' and 'modified'.
# record positions of duplicates so we're not trying to modify the same lists we're iterating
dellist_modified = []
dellist_original = []
# probably not necessary, extra precaution against modifying lists being iterated.
# fwiw the problem still exists if I remove these and change their references in the last two lines directly to the input lists
modified_copy = modified
original_copy = original
for i in range(0, len(modified)-1):
if modified[i] == modified[i+1]:
dellist_modified.append(modified[i+1])
dellist_original.append(original[i+1])
for j in dellist_modified:
if j in modified:
del modified_copy[agg_match.index(j)]
del original_copy[agg_match.index(j)]
# return modified_copy and original_copy
It's ugly, but it's all I got. My testing indicates the problem is created by the last chunk of code.
Modifications or entirely new approaches would be greatly appreciated. My next step is to try using dictionaries.
Here is a clean way of doing this:
original = list(range(10))
modified = list(original)
modified[5] = "a"
modified[6] = "a"
def without_repeated(original, modified):
seen = set()
for (o, m) in zip(original, modified):
if m not in seen:
seen.add(m)
yield o, m
original, modified = zip(*without_repeated(original, modified))
print(original)
print(modified)
Giving us:
(0, 1, 2, 3, 4, 5, 7, 8, 9)
(0, 1, 2, 3, 4, 'a', 7, 8, 9)
We iterate through both lists at the same time. We keep a set of items we have seen (sets have very fast checks for ownership) and then yields any results that we haven't already seen.
We can then use zip again to give us two lists back.
Note we could actually do this like so:
seen = set()
original, modified = zip(*((o, m) for (o, m) in zip(original, modified) if m not in seen and not seen.add(m)))
This works the same way, except using a single generator expression, with adding the item to the set hacked in using the conditional statement (as add always returns false, we can do this). However, this method is considerably harder to read and so I'd advise against it, just an example for the sake of it.
A set in python is a collection of distinct elements. Is the order of these elements critical? Something like this may work:
distinct = list(set(original))
Why use parallel lists? Why not a single list of class instances? That keeps things grouped easily, and reduces your list lookups.