Calculating the similarity of two lists

Calculating the similarity of two lists - python

I have two lists:
eg.
a = [1,8,3,9,4,9,3,8,1,2,3]
and
b = [1,8,1,3,9,4,9,3,8,1,2,3]
Both contain ints. There is no meaning behind the ints (eg. 1 is not 'closer' to 3 than it is to 8).
I'm trying to devise an algorithm to calculate the similarity between two ORDERED lists. Ordered is keyword right here (so I can't just take the set of both lists and calculate their set_difference percentage). Sometimes numbers do repeat (for example 3, 8, and 9 above, and I cannot ignore the repeats).
In the example above, the function I would call would tell me that a and b are ~90% similar for example. How can I do that? Edit distance was something which came to mind. I know how to use it with strings but I'm not sure how to use it with a list of ints. Thanks!

You can use the difflib module
ratio()
Return a measure of the sequences’ similarity as a float in the range [0, 1].
Which gives :
>>> s1=[1,8,3,9,4,9,3,8,1,2,3]
>>> s2=[1,8,1,3,9,4,9,3,8,1,2,3]
>>> sm=difflib.SequenceMatcher(None,s1,s2)
>>> sm.ratio()
0.9565217391304348

It sounds like edit (or Levenshtein) distance is precisely the right tool for the job.
Here is one Python implementation that can be used on lists of integers: http://hetland.org/coding/python/levenshtein.py
Using that code, levenshtein([1,8,3,9,4,9,3,8,1,2,3], [1,8,1,3,9,4,9,3,8,1,2,3]) returns 1, which is the edit distance.
Given the edit distance and the lengths of the two arrays, computing a "percentage similarity" metric should be pretty trivial.

One way to tackle this is to utilize histogram. As an example (demonstration with numpy):
In []: a= array([1,8,3,9,4,9,3,8,1,2,3])
In []: b= array([1,8,1,3,9,4,9,3,8,1,2,3])
In []: a_c, _= histogram(a, arange(9)+ 1)
In []: a_c
Out[]: array([2, 1, 3, 1, 0, 0, 0, 4])
In []: b_c, _= histogram(b, arange(9)+ 1)
In []: b_c
Out[]: array([3, 1, 3, 1, 0, 0, 0, 4])
In []: (a_c- b_c).sum()
Out[]: -1
There exist now plethora of ways to harness a_c and b_c.
Where the (seemingly) simplest similarity measure is:
In []: 1- abs(-1/ 9.)
Out[]: 0.8888888888888888
Followed by:
In []: norm(a_c)/ norm(b_c)
Out[]: 0.92796072713833688
and:
In []: a_n= (a_c/ norm(a_c))[:, None]
In []: 1- norm(b_c- dot(dot(a_n, a_n.T), b_c))/ norm(b_c)
Out[]: 0.84445724579043624
Thus, you need to be much more specific to find out most suitable similarity measure suitable for your purposes.

Just use the same algorithm for calculating edit distance on strings if the values don't have any particular meaning.

I've implemented something for a similar task a long time ago. Now, I have only a blog entry for that. It was simple: you had to compute the pdf of both sequences then it would find the common area covered by the graphical representation of pdf.
Sorry for the broken images on link, the external server that I've used back then is dead now.
Right now, for your problem the code translates to
def overlap(pdf1, pdf2):
s = 0
for k in pdf1:
if pdf2.has_key(k):
s += min(pdf1[k], pdf2[k])
return s
def pdf(l):
d = {}
s = 0.0
for i in l:
s += i
if d.has_key(i):
d[i] += 1
else:
d[i] = 1
for k in d:
d[k] /= s
return d
def solve():
a = [1, 8, 3, 9, 4, 9, 3, 8, 1, 2, 3]
b = [1, 8, 1, 3, 9, 4, 9, 3, 8, 1, 2, 3]
pdf_a = pdf(a)
pdf_b = pdf(b)
print pdf_a
print pdf_b
print overlap(pdf_a, pdf_b)
print overlap(pdf_b, pdf_a)
if __name__ == '__main__':
solve()
Unfortunately, it gives an unexpected answer, only 0.212292609351

The solution proposed by #kraymer does not work in the case of
s1=[1,2,3,4,5,6,7,8,9,10]
s2=[2,1,3,4,5,6,7,8,9,9]
since it returns 0.8 even though there are 3 different elements and not 2.
A workaround could be:
def find_percentage_agreement(s1, s2):
assert len(s1)==len(s2), "Lists must have the same shape"
nb_agreements = 0 # initialize counter to 0
for idx, value in enumerate(s1):
if s2[idx] == value:
nb_agreements += 1
percentage_agreement = nb_agreements/len(s1)
return percentage_agreement
Which returns the expected result:
>>> s1=[1,2,3,4,5,6,7,8,9,10]
>>> s2=[2,1,3,4,5,6,7,8,9,9]
>>> find_percentage_agreement(s1, s2)
0.7

Unless im missing the point.
from __future__ import division
def similar(x,y):
si = 0
for a,b in zip(x, y):
if a == b:
si += 1
return (si/len(x)) * 100
if __name__ in '__main__':
a = [1,8,3,9,4,9,3,8,1,2,3]
b = [1,8,1,3,9,4,9,3,8,1,2,3]
result = similar(a,b)
if result is not None:
print "%s%s Similar!" % (result,'%')

Related

Identify least common number in a list python

With a list of numbers, each number can appear multiple times, i need to find the least common number in the list. If different numbers have the same lowest frequency, the result is the one occurring last in the list. An example, the least common integer in [1, 7, 2, 1, 2] is 7 (not 2, as originally said). And the list needs to stay unsorted
I have the following but it always sets the last entry to leastCommon
def least_common_in_unsorted(integers):
leastCommon = integers[0]
check = 1
appears = 1
for currentPosition in integers:
if currentPosition == leastCommon:
appears + 1
elif currentPosition != leastCommon:
if check <= appears:
check = 1
appears = 1
leastCommon = currentPosition
return leastCommon
Any help would be greatly appreciated

It is the simplest way come in my mind right now:
a = [1, 7, 2, 1, 2]
c, least = len(a), 0
for x in a:
if a.count(x) <= c :
c = a.count(x)
least = x
least # 7
and in two least items it will return the last occurrence one.
a = [1, 7, 2, 1, 2, 7] # least = 7

Using the Counter:
from collections import Counter
lst = [1, 7, 2, 1, 2]
cnt = Counter(lst)
mincnt = min(cnt.values())
minval = next(n for n in reversed(lst) if cnt[n] == mincnt)
print(minval) #7

This answer is based on #offtoffel to incorporate multiple items of the same number of occurrences while choosing the last occurring one:
def least_common(lst):
return min(lst, key=lambda x: (lst.count(x), lst[::-1].index(x)))
print(least_common([1,2,1,2]))
# 2
print(least_common([1,2,7,1,2]))
# 7
Edit: I noticed that there’s a even simpler solution, that is efficient and effective (just reverse the list in the beginning and min will keep the last value that has the minimum count):
def least_common(lst):
lst = lst[::-1]
return min(lst, key=lst.count)
print(least_common([1,2,1,2]))
# 2
print(least_common([1,2,7,1,2]))
# 7

Short but inefficient:
>>> min(a[::-1], key=a.count)
7
Efficient version using collections.Counter:
>>> min(a[::-1], key=Counter(a).get)
7

def least_common(lst):
return min(set(lst), key=lst.count)
Edit: sorry, this does not always take the last list item with least occurancy, as demanded by the user...it works on the example, but not for every instance.

How to get values in list at incremental indexes in Python?

I'm looking at getting values in a list with an increment.
l = [0,1,2,3,4,5,6,7]
and I want something like:
[0,4,6,7]
At the moment I am using l[0::2] but I would like sampling to be sparse at the beginning and increase towards the end of the list.
The reason I want this is because the list represents the points along a line from the center of a circle to a point on its circumference. At the moment I iterate every 10 points along the lines and draw a circle with a small radius on each. Therefore, my circles close to the center tend to overlap and I have gaps as I get close to the circle edge. I hope this provides a bit of context.
Thank you !

This can be more complicated than it sounds... You need a list of indices starting at zero and ending at the final element position in your list, presumably with no duplication (i.e. you don't want to get the same points twice). A generic way to do this would be to define the number of points you want first and then use a generator (scaled_series) that produces the required number of indices based on a function. We need a second generator (unique_ints) to ensure we get integer indices and no duplication.
def scaled_series(length, end, func):
""" Generate a scaled series based on y = func(i), for an increasing
function func, starting at 0, of the specified length, and ending at end
"""
scale = float(end) / (func(float(length)) - func(1.0))
intercept = -scale * func(1.0)
print 'scale', scale, 'intercept', intercept
for i in range(1, length + 1):
yield scale * func(float(i)) + intercept
def unique_ints(iter):
last_n = None
for n in iter:
if last_n is None or round(n) != round(last_n):
yield int(round(n))
last_n = n
L = [0, 1, 2, 3, 4, 5, 6, 7]
print [L[i] for i in unique_ints(scaled_series(4, 7, lambda x: 1 - 1 / (2 * x)))]
In this case, the function is 1 - 1/2x, which gives the series you want [0, 4, 6, 7]. You can play with the length (4) and the function to get the kind of spacing between the circles you are looking for.

I am not sure what exact algorithm you want to use, but if it is non-constant, as your example appears to be, then you should consider creating a generator function to yield values:
https://wiki.python.org/moin/Generators
Depending on what your desire here is, you may want to consider a built in interpolator like scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html#scipy.interpolate.interp1d
Basically, given your question, you can't do it with the basic slice operator. Without more information this is the best answer I can give you :-)

Use the slice function to create a range of indices. You can then extend your sliced list with other slices.
k = [0,1,2,3,4,5,6,7]
r = slice(0,len(k)//2,4)
t = slice(r.stop,None,1)
j = k[r]
j.extend(k[t])
print(j) #outputs: [0,4,5,6,7]

What I would do is just use list comprehension to retrieve the values. It is not possible to do it just by indexing. This is what I came up with:
l = [0, 1, 2, 3, 4, 5, 6, 7]
m = [l[0]] + [l[1+sum(range(3, s-1, -1))] for s in [x for x in range(3, 0, -1)]]
and here is a breakdown of the code into loops:
# Start the list with the first value of l (the loop does not include it)
m = [l[0]]
# Descend from 3 to 1 ([3, 2, 1])
for s in range(3, 0, -1):
# append 1 + sum of [3], [3, 2] and [3, 2, 1]
m.append(l[ 1 + sum(range(3, s-1, -1)) ])
Both will give you the same answer:
>>> m
[0, 4, 6, 7]
I made this graphic that would I hope will help you to understand the process:

Pythonic way to merge two overlapping lists, preserving order

Alright, so I have two lists, as such:
They can and will have overlapping items, for example, [1, 2, 3, 4, 5], [4, 5, 6, 7].
There will not be additional items in the overlap, for example, this will not happen: [1, 2, 3, 4, 5], [3.5, 4, 5, 6, 7]
The lists are not necessarily ordered nor unique. [9, 1, 1, 8, 7], [8, 6, 7].
I want to merge the lists such that existing order is preserved, and to merge at the last possible valid position, and such that no data is lost. Additionally, the first list might be huge. My current working code is as such:
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
def merge(master, addition):
n = 1
while n < len(master):
if master[-n:] == addition[:n]:
return master + addition[n:]
n += 1
return master + addition
What I would like to know is - is there a more efficient way of doing this? It works, but I'm slightly leery of this, because it can run into large runtimes in my application - I'm merging large lists of strings.
EDIT: I'd expect the merge of [1,3,9,8,3,4,5], [3,4,5,7,8] to be: [1,3,9,8,3,4,5,7,8]. For clarity, I've highlighted the overlapping portion.
[9, 1, 1, 8, 7], [8, 6, 7] should merge to [9, 1, 1, 8, 7, 8, 6, 7]

You can try the following:
>>> a = [1, 3, 9, 8, 3, 4, 5]
>>> b = [3, 4, 5, 7, 8]
>>> matches = (i for i in xrange(len(b), 0, -1) if b[:i] == a[-i:])
>>> i = next(matches, 0)
>>> a + b[i:]
[1, 3, 9, 8, 3, 4, 5, 7, 8]
The idea is we check the first i elements of b (b[:i]) with the last i elements of a (a[-i:]). We take i in decreasing order, starting from the length of b until 1 (xrange(len(b), 0, -1)) because we want to match as much as possible. We take the first such i by using next and if we don't find it we use the zero value (next(..., 0)). From the moment we found the i, we add to a the elements of b from index i.

There are a couple of easy optimizations that are possible.
You don't need to start at master[1], since the longest overlap starts at master[-len(addition)]
If you add a call to list.index you can avoid creating sub-lists and comparing lists for each index:
This approach keeps the code pretty understandable too (and easier to optimize by using cython or pypy):
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
def merge(master, addition):
first = addition[0]
n = max(len(master) - len(addition), 1) # (1)
while 1:
try:
n = master.index(first, n) # (2)
except ValueError:
return master + addition
if master[-n:] == addition[:n]:
return master + addition[n:]
n += 1

This actually isn't too terribly difficult. After all, essentially all you're doing is checking what substring at the end of A lines up with what substring of B.
def merge(a, b):
max_offset = len(b) # can't overlap with greater size than len(b)
for i in reversed(range(max_offset+1)):
# checks for equivalence of decreasing sized slices
if a[-i:] == b[:i]:
break
return a + b[i:]
We can test with your test data by doing:
test_data = [{'a': [1,3,9,8,3,4,5], 'b': [3,4,5,7,8], 'result': [1,3,9,8,3,4,5,7,8]},
{'a': [9, 1, 1, 8, 7], 'b': [8, 6, 7], 'result': [9, 1, 1, 8, 7, 8, 6, 7]}]
all(merge(test['a'], test['b']) == test['result'] for test in test_data)
This runs through every possible combination of slices that could result in an overlap and remembers the result of the overlap if one is found. If nothing is found, it uses the last result of i which will always be 0. Either way, it returns all of a plus everything past b[i] (in the overlap case, that's the non overlapping portion. In the non-overlap case, it's everything)
Note that we can make a couple optimizations in corner cases. For instance, the worst case here is that it runs through the whole list without finding any solution. You could add a quick check at the beginning that might short circuit that worst case
def merge(a, b):
if a[-1] not in b:
return a + b
...
In fact you could take that solution one step further and probably make your algorithm much faster
def merge(a, b):
while True:
try:
idx = b.index(a[-1]) + 1 # leftmost occurrence of a[-1] in b
except ValueError: # a[-1] not in b
return a + b
if a[-idx:] == b[:idx]:
return a + b[:idx]
However this might not find the longest overlap in cases like:
a = [1,2,3,4,1,2,3,4]
b = [3,4,1,2,3,4,5,6]
# result should be [1,2,3,4,1,2,3,4,5,6], but
# this algo produces [1,2,3,4,1,2,3,4,1,2,3,4,5,6]
You could fix that be using rindex instead of index to match the longest slice instead of the shortest, but I'm not sure what that does to your speed. It's certainly slower, but it might be inconsequential. You could also memoize the results and return the shortest result, which might be a better idea.
def merge(a, b):
results = []
while True:
try:
idx = b.index(a[-1]) + 1 # leftmost occurrence of a[-1] in b
except ValueError: # a[-1] not in b
results.append(a + b)
break
if a[-idx:] == b[:idx]:
results.append(a + b[:idx])
return min(results, key=len)
Which should work since merging the longest overlap should produce the shortest result in all cases.

One trivial optimization is not iterating over the whole master list. I.e., replace while n < len(master) with for n in range(min(len(addition), len(master))) (and don't increment n in the loop). If there is no match, your current code will iterate over the entire master list, even if the slices being compared aren't even of the same length.
Another concern is that you're taking slices of master and addition in order to compare them, which creates two new lists every time, and isn't really necessary. This solution (inspired by Boyer-Moore) doesn't use slicing:
def merge(master, addition):
overlap_lens = (i + 1 for i, e in enumerate(addition) if e == master[-1])
for overlap_len in overlap_lens:
for i in range(overlap_len):
if master[-overlap_len + i] != addition[i]:
break
else:
return master + addition[overlap_len:]
return master + addition
The idea here is to generate all the indices of the last element of master in addition, and add 1 to each. Since a valid overlap must end with the last element of master, only those values are lengths of possible overlaps. Then we can check for each of them if the elements before it also line up.
The function currently assumes that master is longer than addition (you'll probably get an IndexError at master[-overlap_len + i] if it isn't). Add a condition to the overlap_lens generator if you can't guarantee it.
It's also non-greedy, i.e. it looks for the smallest non-empty overlap (merge([1, 2, 2], [2, 2, 3]) will return [1, 2, 2, 2, 3]). I think that's what you meant by "to merge at the last possible valid position". If you want a greedy version, reverse the overlap_lens generator.

I don't offer optimizations but another way of looking at the problem. To me, this seems like a particular case of http://en.wikipedia.org/wiki/Longest_common_substring_problem where the substring would always be at the end of the list/string. The following algorithm is the dynamic programming version.
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return x_longest - longest, x_longest
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
s, e = longest_common_substring(master, addition)
if e - s > 1:
print master[:s] + addition
master = [9, 1, 1, 8, 7]
addition = [8, 6, 7]
s, e = longest_common_substring(master, addition)
if e - s > 1:
print master[:s] + addition
else:
print master + addition
[1, 3, 9, 8, 3, 4, 5, 7, 8]
[9, 1, 1, 8, 7, 8, 6, 7]

First of all and for clarity, you can replace your while loop with a for loop:
def merge(master, addition):
for n in xrange(1, len(master)):
if master[-n:] == addition[:n]:
return master + addition[n:]
return master + addition
Then, you don't have to compare all possible slices, but only those for which master's slice starts with the first element of addition:
def merge(master, addition):
indices = [len(master) - i for i, x in enumerate(master) if x == addition[0]]
for n in indices:
if master[-n:] == addition[:n]:
return master + addition[n:]
return master + addition
So instead of comparing slices like this:
1234123141234
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
you are only doing these comparisons:
1234123141234
| | |
| | 3579
| 3579
3579
How much this will speed up your program depends on the nature of your data: the fewer repeated elements your lists have, the better.
You could also generate a list of indices for addition so its own slices always end with master's last element, further restricting the number of comparisons.

Based on https://stackoverflow.com/a/30056066/541208:
def join_two_lists(a, b):
index = 0
for i in xrange(len(b), 0, -1):
#if everything from start to ith of b is the
#same from the end of a at ith append the result
if b[:i] == a[-i:]:
index = i
break
return a + b[index:]

All the above solutions are similar in terms of using a for / while loop for the merging task. I first tried the solutions by #JuniorCompressor and #TankorSmash, but these solutions are way too slow for merging two large-scale lists (e.g. lists with about millions of elements).
I found using pandas to concatenate lists with large size is much more time-efficient:
import pandas as pd, numpy as np
trainCompIdMaps = pd.DataFrame( { "compoundId": np.random.permutation( range(800) )[0:80], "partition": np.repeat( "train", 80).tolist()} )
testCompIdMaps = pd.DataFrame( {"compoundId": np.random.permutation( range(800) )[0:20], "partition": np.repeat( "test", 20).tolist()} )
# row-wise concatenation for two pandas
compoundIdMaps = pd.concat([trainCompIdMaps, testCompIdMaps], axis=0)
mergedCompIds = np.array(compoundIdMaps["compoundId"])

What you need is a sequence alignment algorithm like Needleman-Wunsch.
Needleman-Wunsch is a global sequence alignment algorithm based on dynamic programming:
I found this nice implementation to merge arbitrary object sequences in python:
https://github.com/ajnisbet/paired
import paired
seq_1 = 'The quick brown fox jumped over the lazy dog'.split(' ')
seq_2 = 'The brown fox leaped over the lazy dog'.split(' ')
alignment = paired.align(seq_1, seq_2)
print(alignment)
# [(0, 0), (1, None), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 7)]
for i_1, i_2 in alignment:
print((seq_1[i_1] if i_1 is not None else '').ljust(15), end='')
print(seq_2[i_2] if i_2 is not None else '')
# The The
# quick
# brown brown
# fox fox
# jumped leaped
# over over
# the the
# lazy lazy
# dog dog

Pattern Matching in Python

This question might go closer to pattern matching in image processing.
Is there any way to get a cost function value, applied on different lists, which will return the inter-list proximity? For example,
a = [4, 7, 9]
b = [5, 8, 10]
c = [2, 3]
Now the cost function value for, may be a 2-tuple, (a, b) should be more than (a, c) and (b, c). This can be a huge computational task since there can be many more number of lists and all permutations would blow up the complexity of the problem. So only the set of 2-tuples would work as well.
EDIT:
The list names indicate the type of actions, and elements in them are the time at which corresponding actions occur. What I'm trying to do is to come up with set(s) of actions which have similar occurrence pattern. Since two actions cannot occur at the same time, it's the combination of intra- and inter-list distance.
Thanks in advance!

You're asking a very difficult question. Without allowing the sizes to change there are already several distance measures you could use (Euclidean, Manhattan, etc, check the See Also section for more). The one you need depends on what you think a good measure of the proximity is for whatever these lists represent.
Without knowing what you're trying to do with these lists no-one can define what a good answer would be, let alone how to compute it efficiently.

For comparing two strings or lists you can use the Levenshtein distance (Python implementation from here):
def levenshtein(s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(l1 + 1)] * (l2 + 1)
for zz in range(l2 + 1):
matrix[zz] = range(zz,zz + l1 + 1)
for zz in range(0,l2):
for sz in range(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz] + 1)
return matrix[l2][l1]
Using that on your lists:
>>> a = [4, 7, 9]
>>> b = [5, 8, 10]
>>> c = [2, 3]
>>> levenshtein(a,b)
3
>>> levenshtein(b,c)
3
>>> levenshtein(a,c)
3
EDIT: with the added explanation in the comments, you could use sets instead of lists. Since every element of a set is unique, adding an existing element again is a no-op. And you can use the set's isdisjoint method to check that two sets do not contain the same elements, or the intersection method to see which elements they have in common:
In [1]: a = {1,3,5}
In [2]: a.add(3)
In [3]: a
Out[3]: set([1, 3, 5])
In [4]: a.add(4)
In [5]: a
Out[5]: set([1, 3, 4, 5])
In [6]: b = {2,3,7}
In [7]: a.isdisjoint(b)
Out[7]: False
In [8]: a.intersection(b)
Out[8]: set([3])
N.B.: this syntax of creating sets requires at least Python 2.7.

Given the answer you gave to Michael's clarification, you should probably look up "Dynamic Time Warping".
I haven't used http://mlpy.sourceforge.net/ but its blurb says it provides DTW. (Might be a hammer to crack a nut; depends on your use case.)

Efficient method to calculate the rank vector of a list in Python

I'm looking for an efficient way to calculate the rank vector of a list in Python, similar to R's rank function. In a simple list with no ties between the elements, element i of the rank vector of a list l should be x if and only if l[i] is the x-th element in the sorted list. This is simple so far, the following code snippet does the trick:
def rank_simple(vector):
return sorted(range(len(vector)), key=vector.__getitem__)
Things get complicated, however, if the original list has ties (i.e. multiple elements with the same value). In that case, all the elements having the same value should have the same rank, which is the average of their ranks obtained using the naive method above. So, for instance, if I have [1, 2, 3, 3, 3, 4, 5], the naive ranking gives me [0, 1, 2, 3, 4, 5, 6], but what I would like to have is [0, 1, 3, 3, 3, 5, 6]. Which one would be the most efficient way to do this in Python?
Footnote: I don't know if NumPy already has a method to achieve this or not; if it does, please let me know, but I would be interested in a pure Python solution anyway as I'm developing a tool which should work without NumPy as well.

Using scipy, the function you are looking for is scipy.stats.rankdata:
In [13]: import scipy.stats as ss
In [19]: ss.rankdata([3, 1, 4, 15, 92])
Out[19]: array([ 2., 1., 3., 4., 5.])
In [20]: ss.rankdata([1, 2, 3, 3, 3, 4, 5])
Out[20]: array([ 1., 2., 4., 4., 4., 6., 7.])
The ranks start at 1, rather than 0 (as in your example), but then again, that's the way R's rank function works as well.
Here is a pure-python equivalent of scipy's rankdata function:
def rank_simple(vector):
return sorted(range(len(vector)), key=vector.__getitem__)
def rankdata(a):
n = len(a)
ivec=rank_simple(a)
svec=[a[rank] for rank in ivec]
sumranks = 0
dupcount = 0
newarray = [0]*n
for i in xrange(n):
sumranks += i
dupcount += 1
if i==n-1 or svec[i] != svec[i+1]:
averank = sumranks / float(dupcount) + 1
for j in xrange(i-dupcount+1,i+1):
newarray[ivec[j]] = averank
sumranks = 0
dupcount = 0
return newarray
print(rankdata([3, 1, 4, 15, 92]))
# [2.0, 1.0, 3.0, 4.0, 5.0]
print(rankdata([1, 2, 3, 3, 3, 4, 5]))
# [1.0, 2.0, 4.0, 4.0, 4.0, 6.0, 7.0]

[sorted(l).index(x) for x in l]
sorted(l) will give the sorted version
index(x) will give the index in the sorted array
for example :
l = [-1, 3, 2, 0,0]
>>> [sorted(l).index(x) for x in l]
[0, 4, 3, 1, 1]

This is one of the functions that I wrote to calculate rank.
def calculate_rank(vector):
a={}
rank=1
for num in sorted(vector):
if num not in a:
a[num]=rank
rank=rank+1
return[a[i] for i in vector]
input:
calculate_rank([1,3,4,8,7,5,4,6])
output:
[1, 2, 3, 7, 6, 4, 3, 5]

This doesn't give the exact result you specify, but perhaps it would be useful anyways. The following snippet gives the first index for each element, yielding a final rank vector of [0, 1, 2, 2, 2, 5, 6]
def rank_index(vector):
return [vector.index(x) for x in sorted(range(n), key=vector.__getitem__)]
Your own testing would have to prove the efficiency of this.

Here is a small variation of unutbu's code, including an optional 'method' argument for the type of value of tied ranks.
def rank_simple(vector):
return sorted(range(len(vector)), key=vector.__getitem__)
def rankdata(a, method='average'):
n = len(a)
ivec=rank_simple(a)
svec=[a[rank] for rank in ivec]
sumranks = 0
dupcount = 0
newarray = [0]*n
for i in xrange(n):
sumranks += i
dupcount += 1
if i==n-1 or svec[i] != svec[i+1]:
for j in xrange(i-dupcount+1,i+1):
if method=='average':
averank = sumranks / float(dupcount) + 1
newarray[ivec[j]] = averank
elif method=='max':
newarray[ivec[j]] = i+1
elif method=='min':
newarray[ivec[j]] = i+1 -dupcount+1
else:
raise NameError('Unsupported method')
sumranks = 0
dupcount = 0
return newarray

There is a really nice module called Ranking http://pythonhosted.org/ranking/ with an easy to follow instruction page. To download, simply use easy_install ranking

I really don't get why all the existing solutions are so complex. This can be done just like this:
[index for element, index in sorted(zip(sequence, range(len(sequence))))]
You build tuples which contain the elements and a running index. Then you sort the whole thing, and tuples sort by their first element and during ties by their second element. This way one has a sorted list of these tuples and just need to pick out the indices from that afterwards. Also this removes the need to look up elements in the sequence afterwards, which likely makes it a O(N²) operation whereas this is O(N log(N)).

So.. this is 2019, and I have no idea why nobody suggested the following:
# Python-only
def rank_list( x, break_ties=False ):
n = len(x)
t = list(range(n))
s = sorted( t, key=x.__getitem__ )
if not break_ties:
for k in range(n-1):
t[k+1] = t[k] + (x[s[k+1]] != x[s[k]])
r = s.copy()
for i,k in enumerate(s):
r[k] = t[i]
return r
# Using Numpy, see also: np.argsort
def rank_vec( x, break_ties=False ):
n = len(x)
t = np.arange(n)
s = sorted( t, key=x.__getitem__ )
if not break_ties:
t[1:] = np.cumsum(x[s[1:]] != x[s[:-1]])
r = t.copy()
np.put( r, s, t )
return r
This approach has linear runtime complexity after the initial sort, it only stores 2 arrays of indices, and does not require values to be hashable (only pairwise comparison needed).
AFAICT, this is better than other approaches suggested so far:
#unutbu's approach is essentially similar, but (I would argue) too complicated for what the OP asked;
All suggestions using .index() are terrible, with a runtime complexity of N^2;
#Yuvraj Singh improves slightly upon the .index() search using a dictionary, however with search and insert operations at each iteration, this is still highly inefficient both in time (NlogN) and space, and it also requires the values to be hashable.

The most pythonic style to find rank of an array:
a = [10.0, 9.8, 8.0, 7.8, 7.7, 7.0, 6.0, 5.0, 4.0, 2.0]
rank = lambda arr: list(map(lambda i: sorted(arr).index(i)+1, arr))
rank(a)

These codes give me a lot of inspiration, especially unutbu's code.
However my needs are simpler, so I changed the code a little.
Hoping to help the guys with the same needs.
Here is the class to record the players' scores and ranks.
class Player():
def __init__(self, s, r):
self.score = s
self.rank = r
Some data.
l = [Player(90,0),Player(95,0),Player(85,0), Player(90,0),Player(95,0)]
Here is the code for calculation:
l.sort(key=lambda x:x.score, reverse=True)
l[0].rank = 1
dupcount = 0
prev = l[0]
for e in l[1:]:
if e.score == prev.score:
e.rank = prev.rank
dupcount += 1
else:
e.rank = prev.rank + dupcount + 1
dupcount = 0
prev = e

import numpy as np
def rankVec(arg):
p = np.unique(arg) #take unique value
k = (-p).argsort().argsort() #sort based on arguments in ascending order
dd = defaultdict(int)
for i in xrange(np.shape(p)[0]):
dd[p[i]] = k[i]
return np.array([dd[x] for x in arg])
timecomplexity is 46.2us

This works for the spearman correlation coefficient .
def get_rank(X, n):
x_rank = dict((x, i+1) for i, x in enumerate(sorted(set(X))))
return [x_rank[x] for x in X]

The rank function could be implemented in O(n log n) time and O(n) additional space using the following approach.
import bisect
def rank_list(lst: list[int]) -> list[int]:
sorted_vals = sorted(set(lst))
return [bisect.bisect_left(sorted_vals, val) for val in lst]
I use here bisect library, but for the pure independent code it is enough to implement binary search procedure on the sorted array with unique values for a query on existing (in this array) value.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating the similarity of two lists - python

You can use the difflib module ratio() Return a measure of the sequences’ similarity as a float in the range [0, 1]. Which gives : >>> s1=[1,8,3,9,4,9,3,8,1,2,3] >>> s2=[1,8,1,3,9,4,9,3,8,1,2,3] >>> sm=difflib.SequenceMatcher(None,s1,s2) >>> sm.ratio() 0.9565217391304348

Just use the same algorithm for calculating edit distance on strings if the values don't have any particular meaning.

Related

Identify least common number in a list python

How to get values in list at incremental indexes in Python?

Pythonic way to merge two overlapping lists, preserving order

Pattern Matching in Python

Efficient method to calculate the rank vector of a list in Python

Categories

Resources