numbers in a list, find average of surroundings - python

Question:
Given a list listA of numbers, write a program that generates a new list listB with the same number of elements as listA, such that each element in the new list is the average of its neighbors and itself in the original list.
For example, if listA = [5, 1, 3, 8, 4], listB = [3.0, 3.0, 4.0, 5.0, 6.0], where:
(5 + 1)/2 = 3.0
(5 + 1 + 3)/3 = 3.0
(1 + 3 + 8)/3 = 4.0
(3 + 8 + 4)/3 = 5.0
(8 + 4)/2 = 6.0
so i can get the first part, and the last part since they only deal with 2 numbers, but for the middle part i can not get it. my loop is wrong, but i dont know exactly. this is what i have so far.
listA= [5,1,3,8,4]
N=len(listA)
print(listA)
listB=[]
listB.append((listA[0]+listA[1])/2)
y=0
x=1
while x in listA:
y=((listA[x-1] + list[x] + list[x+1])/3)
listB.append(y)
y=y+1
listB.append((listA[-1]+listA[-2])/2)
print(listB)

You can do this using iterators without having to resort to looping through indices:
import itertools
def neighbours(items, fill=None):
before = itertools.chain([fill], items)
after = itertools.chain(items, [fill]) #You could use itertools.zip_longest() later instead.
next(after)
for a, b, c in zip(before, items, after):
yield [value for value in (a, b, c) if value is not fill]
Used like so:
>>> items = [5, 1, 3, 8, 4]
>>> [sum(values)/len(values) for values in neighbours(items)]
[3.0, 3.0, 4.0, 5.0, 6.0]
So how does this work? We create some extra iterators - for the before and after values. We use itertools.chain to add an extra value to the beginning and end respectively, in order to allow us to get the right values at the right time (and not run out of items). We then advance the later item on one, to put it in the right position, then loop through, returning the values that are not None. This means we can just loop through in a very natural way.
Note that this requires a list, as an iterator will be exhausted. If you need it to work lazily on an iterator, the following example uses itertools.tee() to do the job:
def neighbours(items, fill=None):
b, i, a = itertools.tee(items, 3)
before = itertools.chain([fill], b)
after = itertools.chain(a, [fill])
next(a)
for a, b, c in zip(before, i, after):
yield [value for value in (a, b, c) if value is not fill]

list_a = [5, 1, 3, 8, 4]
# Start list_b with the special case first element
list_b = [sum(list_a[:1]) / 2.0]
# Iterate over each remaining index in the list
for i in range(1, len(list_a - 1)):
# Get the slice of the element and it's neighbors
neighborhood = list_a[i-1:i+1]
# Add the average of the element and it's neighbors
# have to calculate the len of neighborhood to handle
# last element special case
list_b.append(sum(neighborhood) / float(len(neighborhood)))

Could use arcpy.AverageNearestNeighbor_stats
Otherwise, if you like loops:
import numpy as np
listA= [5,1,3,8,4]
b = []
for r in xrange(len(listA)):
if r==0 or r==len(listA):
b.append(np.mean(listA[r:r+2]))
else:
b.append(np.mean(listA[r-1:r+2]))

It looks like you've got the right idea. Your code is a little tough to follow though, try using more descriptive variable names in the future :) It makes it easier on every one.
Here's my quick and dirty solution:
def calcAverages(listOfNums):
outputList = []
for i in range(len(listOfNums)):
if i == 0:
outputList.append((listOfNums[0] + listOfNums[1]) / 2.)
elif i == len(listOfNums)-1:
outputList.append((listOfNums[i] + listOfNums[i-1]) / 2.)
else:
outputList.append((listOfNums[i-1] +
listOfNums[i] +
listOfNums[i+1]) / 3.)
return outputList
if __name__ == '__main__':
listOne = [5, 1, 3, 8, 4, 7, 20, 12]
print calcAverages(listOne)
I opted for a for loop instead of a while. This doesn't make a big difference, but I feel the syntax is easier to follow.
for i in range(len(listOfNums)):
We create a loop which will iterate over the length of the input list.
Next we handle the two "special" cases: the beginning and end of the list.
if i == 0:
outputList.append((listOfNums[0] + listOfNums[1]) / 2.)
elif i == len(listOfNums)-1:
outputList.append((listOfNums[i] + listOfNums[i-1]) / 2.)
So, if our index is 0, we're at the beginning, and so we add the value of the currect index, 0, and the next highest 1, average it, and append it to our output list.
If our index is equal to the length of out list - 1 (we use -1 because lists are indexed starting at 0, while length is not. Without the -1, we would get an IndexOutOfRange error.) we know we're on the last element. And thus, we take the value at that position, add it to the value at the previous position in the list, and finally append the average of those numbers to the output list.
else:
outputList.append((listOfNums[i-1] +
listOfNums[i] +
listOfNums[i+1]) / 3.)
Finally, for all of the other cases, we simply grab the value at the current index, and those immediately above and below it, and then append the averaged result to our output list.

In [31]: lis=[5, 1, 3, 8, 4]
In [32]: new_lis=[lis[:2]]+[lis[i:i+3] for i in range(len(lis)-1)]
In [33]: new_lis
Out[33]: [[5, 1], [5, 1, 3], [1, 3, 8], [3, 8, 4], [8, 4]]
#now using sum() ans len() on above list and using a list comprehension
In [35]: [sum(x)/float(len(x)) for x in new_lis]
Out[35]: [3.0, 3.0, 4.0, 5.0, 6.0]
or using zip():
In [36]: list1=[lis[:2]] + zip(lis,lis[1:],lis[2:]) + [lis[-2:]]
In [37]: list1
Out[37]: [[5, 1], (5, 1, 3), (1, 3, 8), (3, 8, 4), [8, 4]]
In [38]: [sum(x)/float(len(x)) for x in list1]
Out[38]: [3.0, 3.0, 4.0, 5.0, 6.0]

Related

Sorting a list depending on multiple elements within the list

is it possible to implement a python key for sorting depending on multiple list elements?
For example:
list = [1, 2, 3, 4]
And I want to sort the list depending on the difference between two elements, so that the delta is maximized between them.
Expected result:
list = [1, 4, 2, 3] # delta = 4-1 + 4-2 + 3-2 = 6
Other result would also be possible, but 1 is before 4 in the origin array so 1 should be taken first:
list = [4, 1, 3, 2] # delta = 4-1 + 3-1 + 3-2 = 6
I want to use python sorted like:
sorted(list, key=lambda e1, e2: abs(e1-e2))
Is there any possibility to do it this way? Maybe there is another library which could be used.
Since (as you showed us) there could be multiple different results - it means that this sorting/order is not deterministic and hence you can't apply a key function to it.
That said, it's easy to implement the sorting by yourself:
def my_sort(col):
res = []
while col:
_max = max(col)
col.remove(_max)
res.append(_max)
if col:
_min = min(col)
col.remove(_min)
res.append(_min)
return res
print(my_sort([1,2,3,4])) # [4, 1, 3, 2]
This solution runs in O(n^2) but it can be improved by sorting col in the beginning and then instead of looking for max and min we can extract the items in the beginning and the end of the list. By doing that we'll reduce the time complexity to O(n log(n))
EDIT
Per your comment below: if the index plays a role, again, it's not a "real" sorting :) that said, this solution can be engineered to keep the smaller index first and etc:
def my_sort(col):
res = []
while col:
_max = max(col)
max_index = col.index(_max)
col.remove(_max)
if col:
_min = min(col)
min_index = col.index(_min)
col.remove(_min)
if max_index < min_index:
res.extend([_max, _min])
else:
res.extend([_min, _max])
continue
res.append((_max))
return res
print(my_sort([1,2,3,4])) # [1, 4, 2, 3]
This solution is quite brute force; however, it is still a possibility:
from itertools import permutations
list = [1, 2, 3, 4]
final_list = ((i, sum(abs(i[b]-i[b+1]) for b in range(len(i)-1))) for i in permutations(list, len(list)))
final_lists = max(final_list, key=lambda x:x[-1])
Output:
((2, 4, 1, 3), 7)
Note that the output is in the form: (list, total_sum))

Picking the most common element from a bunch of lists

I have a list l of lists [l1, ..., ln] of equal length
I want to compare the l1[k], l2[k], ..., ln[k] for all k in len(l1) and make another list l0 by picking the element that appears most frequently.
So, if l1 = [1, 2, 3], l2 = [1, 4, 4] and l3 = [0, 2, 4], then l = [1, 2, 4]. If there is a tie, I will look at the lists that make up the tie and choose the one in the list with higher priority. Priority is given a priori, each list is given a priority.
Ex. if you have value 1 in lists l1 and l3, and value 2 in lists l2 and l4, and 3 in l5, and lists are ordered according to priority, say l5>l2>l3>l1>l4, then I will pick 2, because 2 is in l2 that contains an element with highest occurrence and its priority is higher than l1 and l3.
How do I do this in python without creating a for loop with lots of if/else conditions?
You can use the Counter module from the collections library. Using the map function will reduce your list looping. You will need an if/else statement for the case that there is no most frequent value but only for that:
import collections
list0 = []
list_length = len(your_lists[0])
for k in list_length:
k_vals = map(lambda x: x[k], your_lists) #collect all values at k pos
counts = collections.Counter(k_vals).most_common() #tuples (val,ct) sorted by count
if counts[0][1] > counts[1][1]: #is there a most common value
list0.append(counts[0][0]) #takes the value with highest count
else:
list0.append(k_vals[0]) #takes element from first list
list0 is the answer you are looking for. I just hate using l because it's easy to confuse with the number 1
Edit (based on comments):
Incorporating your comments, instead of the if/else statement, use a while loop:
i = list_length
while counts[0][1] == counts[1][1]:
counts = collections.Counter(k_vals[:i]).most_common() #ignore the lowest priority element
i -= 1 #go back farther if there's still a tie
list0.append(counts[0][0]) #takes the value with highest count once there's no tie
So the whole thing is now:
import collections
list0 = []
list_length = len(your_lists[0])
for k in list_length:
k_vals = map(lambda x: x[k], your_lists) #collect all values at k pos
counts = collections.Counter(k_vals).most_common() #tuples (val,ct) sorted by count
i = list_length
while counts[0][1] == counts[1][1]: #in case of a tie
counts = collections.Counter(k_vals[:i]).most_common() #ignore the lowest priority element
i -= 1 #go back farther if there's still a tie
list0.append(counts[0][0]) #takes the value with highest count
You throw in one more tiny loop but on the bright side there's no if/else statements at all!
Just transpose the sublists and get the Counter.most_common element key from each group:
from collections import Counter
lists = [[1, 2, 3],[1, 4, 4],[0, 2, 4]]
print([Counter(sub).most_common(1)[0][0] for sub in zip(*lists)])
If they are individual lists just zip those:
l1, l2, l3 = [1, 2, 3], [1, 4, 4], [0, 2, 4]
print([Counter(sub).most_common(1)[0][0] for sub in zip(l1,l2,l3)])
Not sure how taking the first element from the grouping if there is a tie makes sense as it may not be the one that tied but that is trivial to implement, just get the two most_common and check if their counts are equal:
def most_cm(lists):
for sub in zip(*lists):
# get two most frequent
comm = Counter(sub).most_common(2)
# if their values are equal just return the ele from l1
yield comm[0][0] if len(comm) == 1 or comm[0][1] != comm[1][1] else sub[0]
We also need if len(comm) == 1 in case all the elements are the same or we will get an IndexError.
If you are talking about taking the element that comes from the earlier list in the event of a tie i.e l2 comes before l5 then that is just the same as taking any of the elements that tie.
For a decent number of sublists:
In [61]: lis = [[randint(1,10000) for _ in range(10)] for _ in range(100000)]
In [62]: list(most_cm(lis))
Out[62]: [5856, 9104, 1245, 4304, 829, 8214, 9496, 9182, 8233, 7482]
In [63]: timeit list(most_cm(lis))
1 loops, best of 3: 249 ms per loop
Solution is:
a = [1, 2, 3]
b = [1, 4, 4]
c = [0, 2, 4]
print [max(set(element), key=element.count) for element in zip(a, b, c)]
That's what you're looking for:
from collections import Counter
from operator import itemgetter
l0 = [max(Counter(li).items(), key=itemgetter(1))[0] for li in zip(*l)]
If you are OK taking any one of a set of elements that are tied as most common, and you can guarantee that you won't hit an empty list within your list of lists, then here is a way using Counter (so, from collections import Counter):
l = [ [1, 0, 2, 3, 4, 7, 8],
[2, 0, 2, 1, 0, 7, 1],
[2, 0, 1, 4, 0, 1, 8]]
res = []
for k in range(len(l[0])):
res.append(Counter(lst[k] for lst in l).most_common()[0][0])
Doing this in IPython and printing the result:
In [86]: res
Out[86]: [2, 0, 2, 1, 0, 7, 8]
Try this:
l1 = [1,2,3]
l2 = [1,4,4]
l3 = [0,2,4]
lists = [l1, l2, l3]
print [max(set(x), key=x.count) for x in zip(*lists)]

Pythonic way to merge two overlapping lists, preserving order

Alright, so I have two lists, as such:
They can and will have overlapping items, for example, [1, 2, 3, 4, 5], [4, 5, 6, 7].
There will not be additional items in the overlap, for example, this will not happen: [1, 2, 3, 4, 5], [3.5, 4, 5, 6, 7]
The lists are not necessarily ordered nor unique. [9, 1, 1, 8, 7], [8, 6, 7].
I want to merge the lists such that existing order is preserved, and to merge at the last possible valid position, and such that no data is lost. Additionally, the first list might be huge. My current working code is as such:
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
def merge(master, addition):
n = 1
while n < len(master):
if master[-n:] == addition[:n]:
return master + addition[n:]
n += 1
return master + addition
What I would like to know is - is there a more efficient way of doing this? It works, but I'm slightly leery of this, because it can run into large runtimes in my application - I'm merging large lists of strings.
EDIT: I'd expect the merge of [1,3,9,8,3,4,5], [3,4,5,7,8] to be: [1,3,9,8,3,4,5,7,8]. For clarity, I've highlighted the overlapping portion.
[9, 1, 1, 8, 7], [8, 6, 7] should merge to [9, 1, 1, 8, 7, 8, 6, 7]
You can try the following:
>>> a = [1, 3, 9, 8, 3, 4, 5]
>>> b = [3, 4, 5, 7, 8]
>>> matches = (i for i in xrange(len(b), 0, -1) if b[:i] == a[-i:])
>>> i = next(matches, 0)
>>> a + b[i:]
[1, 3, 9, 8, 3, 4, 5, 7, 8]
The idea is we check the first i elements of b (b[:i]) with the last i elements of a (a[-i:]). We take i in decreasing order, starting from the length of b until 1 (xrange(len(b), 0, -1)) because we want to match as much as possible. We take the first such i by using next and if we don't find it we use the zero value (next(..., 0)). From the moment we found the i, we add to a the elements of b from index i.
There are a couple of easy optimizations that are possible.
You don't need to start at master[1], since the longest overlap starts at master[-len(addition)]
If you add a call to list.index you can avoid creating sub-lists and comparing lists for each index:
This approach keeps the code pretty understandable too (and easier to optimize by using cython or pypy):
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
def merge(master, addition):
first = addition[0]
n = max(len(master) - len(addition), 1) # (1)
while 1:
try:
n = master.index(first, n) # (2)
except ValueError:
return master + addition
if master[-n:] == addition[:n]:
return master + addition[n:]
n += 1
This actually isn't too terribly difficult. After all, essentially all you're doing is checking what substring at the end of A lines up with what substring of B.
def merge(a, b):
max_offset = len(b) # can't overlap with greater size than len(b)
for i in reversed(range(max_offset+1)):
# checks for equivalence of decreasing sized slices
if a[-i:] == b[:i]:
break
return a + b[i:]
We can test with your test data by doing:
test_data = [{'a': [1,3,9,8,3,4,5], 'b': [3,4,5,7,8], 'result': [1,3,9,8,3,4,5,7,8]},
{'a': [9, 1, 1, 8, 7], 'b': [8, 6, 7], 'result': [9, 1, 1, 8, 7, 8, 6, 7]}]
all(merge(test['a'], test['b']) == test['result'] for test in test_data)
This runs through every possible combination of slices that could result in an overlap and remembers the result of the overlap if one is found. If nothing is found, it uses the last result of i which will always be 0. Either way, it returns all of a plus everything past b[i] (in the overlap case, that's the non overlapping portion. In the non-overlap case, it's everything)
Note that we can make a couple optimizations in corner cases. For instance, the worst case here is that it runs through the whole list without finding any solution. You could add a quick check at the beginning that might short circuit that worst case
def merge(a, b):
if a[-1] not in b:
return a + b
...
In fact you could take that solution one step further and probably make your algorithm much faster
def merge(a, b):
while True:
try:
idx = b.index(a[-1]) + 1 # leftmost occurrence of a[-1] in b
except ValueError: # a[-1] not in b
return a + b
if a[-idx:] == b[:idx]:
return a + b[:idx]
However this might not find the longest overlap in cases like:
a = [1,2,3,4,1,2,3,4]
b = [3,4,1,2,3,4,5,6]
# result should be [1,2,3,4,1,2,3,4,5,6], but
# this algo produces [1,2,3,4,1,2,3,4,1,2,3,4,5,6]
You could fix that be using rindex instead of index to match the longest slice instead of the shortest, but I'm not sure what that does to your speed. It's certainly slower, but it might be inconsequential. You could also memoize the results and return the shortest result, which might be a better idea.
def merge(a, b):
results = []
while True:
try:
idx = b.index(a[-1]) + 1 # leftmost occurrence of a[-1] in b
except ValueError: # a[-1] not in b
results.append(a + b)
break
if a[-idx:] == b[:idx]:
results.append(a + b[:idx])
return min(results, key=len)
Which should work since merging the longest overlap should produce the shortest result in all cases.
One trivial optimization is not iterating over the whole master list. I.e., replace while n < len(master) with for n in range(min(len(addition), len(master))) (and don't increment n in the loop). If there is no match, your current code will iterate over the entire master list, even if the slices being compared aren't even of the same length.
Another concern is that you're taking slices of master and addition in order to compare them, which creates two new lists every time, and isn't really necessary. This solution (inspired by Boyer-Moore) doesn't use slicing:
def merge(master, addition):
overlap_lens = (i + 1 for i, e in enumerate(addition) if e == master[-1])
for overlap_len in overlap_lens:
for i in range(overlap_len):
if master[-overlap_len + i] != addition[i]:
break
else:
return master + addition[overlap_len:]
return master + addition
The idea here is to generate all the indices of the last element of master in addition, and add 1 to each. Since a valid overlap must end with the last element of master, only those values are lengths of possible overlaps. Then we can check for each of them if the elements before it also line up.
The function currently assumes that master is longer than addition (you'll probably get an IndexError at master[-overlap_len + i] if it isn't). Add a condition to the overlap_lens generator if you can't guarantee it.
It's also non-greedy, i.e. it looks for the smallest non-empty overlap (merge([1, 2, 2], [2, 2, 3]) will return [1, 2, 2, 2, 3]). I think that's what you meant by "to merge at the last possible valid position". If you want a greedy version, reverse the overlap_lens generator.
I don't offer optimizations but another way of looking at the problem. To me, this seems like a particular case of http://en.wikipedia.org/wiki/Longest_common_substring_problem where the substring would always be at the end of the list/string. The following algorithm is the dynamic programming version.
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return x_longest - longest, x_longest
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
s, e = longest_common_substring(master, addition)
if e - s > 1:
print master[:s] + addition
master = [9, 1, 1, 8, 7]
addition = [8, 6, 7]
s, e = longest_common_substring(master, addition)
if e - s > 1:
print master[:s] + addition
else:
print master + addition
[1, 3, 9, 8, 3, 4, 5, 7, 8]
[9, 1, 1, 8, 7, 8, 6, 7]
First of all and for clarity, you can replace your while loop with a for loop:
def merge(master, addition):
for n in xrange(1, len(master)):
if master[-n:] == addition[:n]:
return master + addition[n:]
return master + addition
Then, you don't have to compare all possible slices, but only those for which master's slice starts with the first element of addition:
def merge(master, addition):
indices = [len(master) - i for i, x in enumerate(master) if x == addition[0]]
for n in indices:
if master[-n:] == addition[:n]:
return master + addition[n:]
return master + addition
So instead of comparing slices like this:
1234123141234
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
you are only doing these comparisons:
1234123141234
| | |
| | 3579
| 3579
3579
How much this will speed up your program depends on the nature of your data: the fewer repeated elements your lists have, the better.
You could also generate a list of indices for addition so its own slices always end with master's last element, further restricting the number of comparisons.
Based on https://stackoverflow.com/a/30056066/541208:
def join_two_lists(a, b):
index = 0
for i in xrange(len(b), 0, -1):
#if everything from start to ith of b is the
#same from the end of a at ith append the result
if b[:i] == a[-i:]:
index = i
break
return a + b[index:]
All the above solutions are similar in terms of using a for / while loop for the merging task. I first tried the solutions by #JuniorCompressor and #TankorSmash, but these solutions are way too slow for merging two large-scale lists (e.g. lists with about millions of elements).
I found using pandas to concatenate lists with large size is much more time-efficient:
import pandas as pd, numpy as np
trainCompIdMaps = pd.DataFrame( { "compoundId": np.random.permutation( range(800) )[0:80], "partition": np.repeat( "train", 80).tolist()} )
testCompIdMaps = pd.DataFrame( {"compoundId": np.random.permutation( range(800) )[0:20], "partition": np.repeat( "test", 20).tolist()} )
# row-wise concatenation for two pandas
compoundIdMaps = pd.concat([trainCompIdMaps, testCompIdMaps], axis=0)
mergedCompIds = np.array(compoundIdMaps["compoundId"])
What you need is a sequence alignment algorithm like Needleman-Wunsch.
Needleman-Wunsch is a global sequence alignment algorithm based on dynamic programming:
I found this nice implementation to merge arbitrary object sequences in python:
https://github.com/ajnisbet/paired
import paired
seq_1 = 'The quick brown fox jumped over the lazy dog'.split(' ')
seq_2 = 'The brown fox leaped over the lazy dog'.split(' ')
alignment = paired.align(seq_1, seq_2)
print(alignment)
# [(0, 0), (1, None), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 7)]
for i_1, i_2 in alignment:
print((seq_1[i_1] if i_1 is not None else '').ljust(15), end='')
print(seq_2[i_2] if i_2 is not None else '')
# The The
# quick
# brown brown
# fox fox
# jumped leaped
# over over
# the the
# lazy lazy
# dog dog

How do I subtract one list from another?

I want to take the difference between lists x and y:
>>> x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> y = [1, 3, 5, 7, 9]
>>> x - y
# should return [0, 2, 4, 6, 8]
Use a list comprehension to compute the difference while maintaining the original order from x:
[item for item in x if item not in y]
If you don't need list properties (e.g. ordering), use a set difference, as the other answers suggest:
list(set(x) - set(y))
To allow x - y infix syntax, override __sub__ on a class inheriting from list:
class MyList(list):
def __init__(self, *args):
super(MyList, self).__init__(args)
def __sub__(self, other):
return self.__class__(*[item for item in self if item not in other])
Usage:
x = MyList(1, 2, 3, 4)
y = MyList(2, 5, 2)
z = x - y
Use set difference
>>> z = list(set(x) - set(y))
>>> z
[0, 8, 2, 4, 6]
Or you might just have x and y be sets so you don't have to do any conversions.
if duplicate and ordering items are problem :
[i for i in a if not i in b or b.remove(i)]
a = [1,2,3,3,3,3,4]
b = [1,3]
result: [2, 3, 3, 3, 4]
That is a "set subtraction" operation. Use the set data structure for that.
In Python 2.7:
x = {1,2,3,4,5,6,7,8,9,0}
y = {1,3,5,7,9}
print x - y
Output:
>>> print x - y
set([0, 8, 2, 4, 6])
For many use cases, the answer you want is:
ys = set(y)
[item for item in x if item not in ys]
This is a hybrid between aaronasterling's answer and quantumSoup's answer.
aaronasterling's version does len(y) item comparisons for each element in x, so it takes quadratic time. quantumSoup's version uses sets, so it does a single constant-time set lookup for each element in x—but, because it converts both x and y into sets, it loses the order of your elements.
By converting only y into a set, and iterating x in order, you get the best of both worlds—linear time, and order preservation.*
However, this still has a problem from quantumSoup's version: It requires your elements to be hashable. That's pretty much built into the nature of sets.** If you're trying to, e.g., subtract a list of dicts from another list of dicts, but the list to subtract is large, what do you do?
If you can decorate your values in some way that they're hashable, that solves the problem. For example, with a flat dictionary whose values are themselves hashable:
ys = {tuple(item.items()) for item in y}
[item for item in x if tuple(item.items()) not in ys]
If your types are a bit more complicated (e.g., often you're dealing with JSON-compatible values, which are hashable, or lists or dicts whose values are recursively the same type), you can still use this solution. But some types just can't be converted into anything hashable.
If your items aren't, and can't be made, hashable, but they are comparable, you can at least get log-linear time (O(N*log M), which is a lot better than the O(N*M) time of the list solution, but not as good as the O(N+M) time of the set solution) by sorting and using bisect:
ys = sorted(y)
def bisect_contains(seq, item):
index = bisect.bisect(seq, item)
return index < len(seq) and seq[index] == item
[item for item in x if bisect_contains(ys, item)]
If your items are neither hashable nor comparable, then you're stuck with the quadratic solution.
* Note that you could also do this by using a pair of OrderedSet objects, for which you can find recipes and third-party modules. But I think this is simpler.
** The reason set lookups are constant time is that all it has to do is hash the value and see if there's an entry for that hash. If it can't hash the value, this won't work.
If the lists allow duplicate elements, you can use Counter from collections:
from collections import Counter
result = list((Counter(x)-Counter(y)).elements())
If you need to preserve the order of elements from x:
result = [ v for c in [Counter(y)] for v in x if not c[v] or c.subtract([v]) ]
The other solutions have one of a few problems:
They don't preserve order, or
They don't remove a precise count of elements, e.g. for x = [1, 2, 2, 2] and y = [2, 2] they convert y to a set, and either remove all matching elements (leaving [1] only) or remove one of each unique element (leaving [1, 2, 2]), when the proper behavior would be to remove 2 twice, leaving [1, 2], or
They do O(m * n) work, where an optimal solution can do O(m + n) work
Alain was on the right track with Counter to solve #2 and #3, but that solution will lose ordering. The solution that preserves order (removing the first n copies of each value for n repetitions in the list of values to remove) is:
from collections import Counter
x = [1,2,3,4,3,2,1]
y = [1,2,2]
remaining = Counter(y)
out = []
for val in x:
if remaining[val]:
remaining[val] -= 1
else:
out.append(val)
# out is now [3, 4, 3, 1], having removed the first 1 and both 2s.
Try it online!
To make it remove the last copies of each element, just change the for loop to for val in reversed(x): and add out.reverse() immediately after exiting the for loop.
Constructing the Counter is O(n) in terms of y's length, iterating x is O(n) in terms of x's length, and Counter membership testing and mutation are O(1), while list.append is amortized O(1) (a given append can be O(n), but for many appends, the overall big-O averages O(1) since fewer and fewer of them require a reallocation), so the overall work done is O(m + n).
You can also test for to determine if there were any elements in y that were not removed from x by testing:
remaining = +remaining # Removes all keys with zero counts from Counter
if remaining:
# remaining contained elements with non-zero counts
Looking up values in sets are faster than looking them up in lists:
[item for item in x if item not in set(y)]
I believe this will scale slightly better than:
[item for item in x if item not in y]
Both preserve the order of the lists.
We can use set methods as well to find the difference between two list
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
y = [1, 3, 5, 7, 9]
list(set(x).difference(y))
[0, 2, 4, 6, 8]
Try this.
def subtract_lists(a, b):
""" Subtracts two lists. Throws ValueError if b contains items not in a """
# Terminate if b is empty, otherwise remove b[0] from a and recurse
return a if len(b) == 0 else [a[:i] + subtract_lists(a[i+1:], b[1:])
for i in [a.index(b[0])]][0]
>>> x = [1,2,3,4,5,6,7,8,9,0]
>>> y = [1,3,5,7,9]
>>> subtract_lists(x,y)
[2, 4, 6, 8, 0]
>>> x = [1,2,3,4,5,6,7,8,9,0,9]
>>> subtract_lists(x,y)
[2, 4, 6, 8, 0, 9] #9 is only deleted once
>>>
The answer provided by #aaronasterling looks good, however, it is not compatible with the default interface of list: x = MyList(1, 2, 3, 4) vs x = MyList([1, 2, 3, 4]). Thus, the below code can be used as a more python-list friendly:
class MyList(list):
def __init__(self, *args):
super(MyList, self).__init__(*args)
def __sub__(self, other):
return self.__class__([item for item in self if item not in other])
Example:
x = MyList([1, 2, 3, 4])
y = MyList([2, 5, 2])
z = x - y
from collections import Counter
y = Counter(y)
x = Counter(x)
print(list(x-y))
Let:
>>> xs = [1, 2, 3, 4, 3, 2, 1]
>>> ys = [1, 3, 3]
Keep each unique item only once   xs - ys == {2, 4}
Take the set difference:
>>> set(xs) - set(ys)
{2, 4}
Remove all occurrences   xs - ys == [2, 4, 2]
>>> [x for x in xs if x not in ys]
[2, 4, 2]
If ys is large, convert only1 ys into a set for better performance:
>>> ys_set = set(ys)
>>> [x for x in xs if x not in ys_set]
[2, 4, 2]
Only remove same number of occurrences   xs - ys == [2, 4, 2, 1]
from collections import Counter, defaultdict
def diff(xs, ys):
counter = Counter(ys)
for x in xs:
if counter[x] > 0:
counter[x] -= 1
continue
yield x
>>> list(diff(xs, ys))
[2, 4, 2, 1]
1 Converting xs to set and taking the set difference is unnecessary (and slower, as well as order-destroying) since we only need to iterate once over xs.
This example subtracts two lists:
# List of pairs of points
list = []
list.append([(602, 336), (624, 365)])
list.append([(635, 336), (654, 365)])
list.append([(642, 342), (648, 358)])
list.append([(644, 344), (646, 356)])
list.append([(653, 337), (671, 365)])
list.append([(728, 13), (739, 32)])
list.append([(756, 59), (767, 79)])
itens_to_remove = []
itens_to_remove.append([(642, 342), (648, 358)])
itens_to_remove.append([(644, 344), (646, 356)])
print("Initial List Size: ", len(list))
for a in itens_to_remove:
for b in list:
if a == b :
list.remove(b)
print("Final List Size: ", len(list))
list1 = ['a', 'c', 'a', 'b', 'k']
list2 = ['a', 'a', 'a', 'a', 'b', 'c', 'c', 'd', 'e', 'f']
for e in list1:
try:
list2.remove(e)
except ValueError:
print(f'{e} not in list')
list2
# ['a', 'a', 'c', 'd', 'e', 'f']
This will change list2. if you want to protect list2 just copy it and use the copy of list2 in this code.
def listsubtraction(parent,child):
answer=[]
for element in parent:
if element not in child:
answer.append(element)
return answer
I think this should work. I am a beginner so pardon me for any mistakes

Efficient method to calculate the rank vector of a list in Python

I'm looking for an efficient way to calculate the rank vector of a list in Python, similar to R's rank function. In a simple list with no ties between the elements, element i of the rank vector of a list l should be x if and only if l[i] is the x-th element in the sorted list. This is simple so far, the following code snippet does the trick:
def rank_simple(vector):
return sorted(range(len(vector)), key=vector.__getitem__)
Things get complicated, however, if the original list has ties (i.e. multiple elements with the same value). In that case, all the elements having the same value should have the same rank, which is the average of their ranks obtained using the naive method above. So, for instance, if I have [1, 2, 3, 3, 3, 4, 5], the naive ranking gives me [0, 1, 2, 3, 4, 5, 6], but what I would like to have is [0, 1, 3, 3, 3, 5, 6]. Which one would be the most efficient way to do this in Python?
Footnote: I don't know if NumPy already has a method to achieve this or not; if it does, please let me know, but I would be interested in a pure Python solution anyway as I'm developing a tool which should work without NumPy as well.
Using scipy, the function you are looking for is scipy.stats.rankdata:
In [13]: import scipy.stats as ss
In [19]: ss.rankdata([3, 1, 4, 15, 92])
Out[19]: array([ 2., 1., 3., 4., 5.])
In [20]: ss.rankdata([1, 2, 3, 3, 3, 4, 5])
Out[20]: array([ 1., 2., 4., 4., 4., 6., 7.])
The ranks start at 1, rather than 0 (as in your example), but then again, that's the way R's rank function works as well.
Here is a pure-python equivalent of scipy's rankdata function:
def rank_simple(vector):
return sorted(range(len(vector)), key=vector.__getitem__)
def rankdata(a):
n = len(a)
ivec=rank_simple(a)
svec=[a[rank] for rank in ivec]
sumranks = 0
dupcount = 0
newarray = [0]*n
for i in xrange(n):
sumranks += i
dupcount += 1
if i==n-1 or svec[i] != svec[i+1]:
averank = sumranks / float(dupcount) + 1
for j in xrange(i-dupcount+1,i+1):
newarray[ivec[j]] = averank
sumranks = 0
dupcount = 0
return newarray
print(rankdata([3, 1, 4, 15, 92]))
# [2.0, 1.0, 3.0, 4.0, 5.0]
print(rankdata([1, 2, 3, 3, 3, 4, 5]))
# [1.0, 2.0, 4.0, 4.0, 4.0, 6.0, 7.0]
[sorted(l).index(x) for x in l]
sorted(l) will give the sorted version
index(x) will give the index in the sorted array
for example :
l = [-1, 3, 2, 0,0]
>>> [sorted(l).index(x) for x in l]
[0, 4, 3, 1, 1]
This is one of the functions that I wrote to calculate rank.
def calculate_rank(vector):
a={}
rank=1
for num in sorted(vector):
if num not in a:
a[num]=rank
rank=rank+1
return[a[i] for i in vector]
input:
calculate_rank([1,3,4,8,7,5,4,6])
output:
[1, 2, 3, 7, 6, 4, 3, 5]
This doesn't give the exact result you specify, but perhaps it would be useful anyways. The following snippet gives the first index for each element, yielding a final rank vector of [0, 1, 2, 2, 2, 5, 6]
def rank_index(vector):
return [vector.index(x) for x in sorted(range(n), key=vector.__getitem__)]
Your own testing would have to prove the efficiency of this.
Here is a small variation of unutbu's code, including an optional 'method' argument for the type of value of tied ranks.
def rank_simple(vector):
return sorted(range(len(vector)), key=vector.__getitem__)
def rankdata(a, method='average'):
n = len(a)
ivec=rank_simple(a)
svec=[a[rank] for rank in ivec]
sumranks = 0
dupcount = 0
newarray = [0]*n
for i in xrange(n):
sumranks += i
dupcount += 1
if i==n-1 or svec[i] != svec[i+1]:
for j in xrange(i-dupcount+1,i+1):
if method=='average':
averank = sumranks / float(dupcount) + 1
newarray[ivec[j]] = averank
elif method=='max':
newarray[ivec[j]] = i+1
elif method=='min':
newarray[ivec[j]] = i+1 -dupcount+1
else:
raise NameError('Unsupported method')
sumranks = 0
dupcount = 0
return newarray
There is a really nice module called Ranking http://pythonhosted.org/ranking/ with an easy to follow instruction page. To download, simply use easy_install ranking
I really don't get why all the existing solutions are so complex. This can be done just like this:
[index for element, index in sorted(zip(sequence, range(len(sequence))))]
You build tuples which contain the elements and a running index. Then you sort the whole thing, and tuples sort by their first element and during ties by their second element. This way one has a sorted list of these tuples and just need to pick out the indices from that afterwards. Also this removes the need to look up elements in the sequence afterwards, which likely makes it a O(N²) operation whereas this is O(N log(N)).
So.. this is 2019, and I have no idea why nobody suggested the following:
# Python-only
def rank_list( x, break_ties=False ):
n = len(x)
t = list(range(n))
s = sorted( t, key=x.__getitem__ )
if not break_ties:
for k in range(n-1):
t[k+1] = t[k] + (x[s[k+1]] != x[s[k]])
r = s.copy()
for i,k in enumerate(s):
r[k] = t[i]
return r
# Using Numpy, see also: np.argsort
def rank_vec( x, break_ties=False ):
n = len(x)
t = np.arange(n)
s = sorted( t, key=x.__getitem__ )
if not break_ties:
t[1:] = np.cumsum(x[s[1:]] != x[s[:-1]])
r = t.copy()
np.put( r, s, t )
return r
This approach has linear runtime complexity after the initial sort, it only stores 2 arrays of indices, and does not require values to be hashable (only pairwise comparison needed).
AFAICT, this is better than other approaches suggested so far:
#unutbu's approach is essentially similar, but (I would argue) too complicated for what the OP asked;
All suggestions using .index() are terrible, with a runtime complexity of N^2;
#Yuvraj Singh improves slightly upon the .index() search using a dictionary, however with search and insert operations at each iteration, this is still highly inefficient both in time (NlogN) and space, and it also requires the values to be hashable.
The most pythonic style to find rank of an array:
a = [10.0, 9.8, 8.0, 7.8, 7.7, 7.0, 6.0, 5.0, 4.0, 2.0]
rank = lambda arr: list(map(lambda i: sorted(arr).index(i)+1, arr))
rank(a)
These codes give me a lot of inspiration, especially unutbu's code.
However my needs are simpler, so I changed the code a little.
Hoping to help the guys with the same needs.
Here is the class to record the players' scores and ranks.
class Player():
def __init__(self, s, r):
self.score = s
self.rank = r
Some data.
l = [Player(90,0),Player(95,0),Player(85,0), Player(90,0),Player(95,0)]
Here is the code for calculation:
l.sort(key=lambda x:x.score, reverse=True)
l[0].rank = 1
dupcount = 0
prev = l[0]
for e in l[1:]:
if e.score == prev.score:
e.rank = prev.rank
dupcount += 1
else:
e.rank = prev.rank + dupcount + 1
dupcount = 0
prev = e
import numpy as np
def rankVec(arg):
p = np.unique(arg) #take unique value
k = (-p).argsort().argsort() #sort based on arguments in ascending order
dd = defaultdict(int)
for i in xrange(np.shape(p)[0]):
dd[p[i]] = k[i]
return np.array([dd[x] for x in arg])
timecomplexity is 46.2us
This works for the spearman correlation coefficient .
def get_rank(X, n):
x_rank = dict((x, i+1) for i, x in enumerate(sorted(set(X))))
return [x_rank[x] for x in X]
The rank function could be implemented in O(n log n) time and O(n) additional space using the following approach.
import bisect
def rank_list(lst: list[int]) -> list[int]:
sorted_vals = sorted(set(lst))
return [bisect.bisect_left(sorted_vals, val) for val in lst]
I use here bisect library, but for the pure independent code it is enough to implement binary search procedure on the sorted array with unique values for a query on existing (in this array) value.

Categories