Pattern Matching in Python - python

This question might go closer to pattern matching in image processing.
Is there any way to get a cost function value, applied on different lists, which will return the inter-list proximity? For example,
a = [4, 7, 9]
b = [5, 8, 10]
c = [2, 3]
Now the cost function value for, may be a 2-tuple, (a, b) should be more than (a, c) and (b, c). This can be a huge computational task since there can be many more number of lists and all permutations would blow up the complexity of the problem. So only the set of 2-tuples would work as well.
EDIT:
The list names indicate the type of actions, and elements in them are the time at which corresponding actions occur. What I'm trying to do is to come up with set(s) of actions which have similar occurrence pattern. Since two actions cannot occur at the same time, it's the combination of intra- and inter-list distance.
Thanks in advance!

You're asking a very difficult question. Without allowing the sizes to change there are already several distance measures you could use (Euclidean, Manhattan, etc, check the See Also section for more). The one you need depends on what you think a good measure of the proximity is for whatever these lists represent.
Without knowing what you're trying to do with these lists no-one can define what a good answer would be, let alone how to compute it efficiently.

For comparing two strings or lists you can use the Levenshtein distance (Python implementation from here):
def levenshtein(s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(l1 + 1)] * (l2 + 1)
for zz in range(l2 + 1):
matrix[zz] = range(zz,zz + l1 + 1)
for zz in range(0,l2):
for sz in range(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz] + 1)
return matrix[l2][l1]
Using that on your lists:
>>> a = [4, 7, 9]
>>> b = [5, 8, 10]
>>> c = [2, 3]
>>> levenshtein(a,b)
3
>>> levenshtein(b,c)
3
>>> levenshtein(a,c)
3
EDIT: with the added explanation in the comments, you could use sets instead of lists. Since every element of a set is unique, adding an existing element again is a no-op. And you can use the set's isdisjoint method to check that two sets do not contain the same elements, or the intersection method to see which elements they have in common:
In [1]: a = {1,3,5}
In [2]: a.add(3)
In [3]: a
Out[3]: set([1, 3, 5])
In [4]: a.add(4)
In [5]: a
Out[5]: set([1, 3, 4, 5])
In [6]: b = {2,3,7}
In [7]: a.isdisjoint(b)
Out[7]: False
In [8]: a.intersection(b)
Out[8]: set([3])
N.B.: this syntax of creating sets requires at least Python 2.7.

Given the answer you gave to Michael's clarification, you should probably look up "Dynamic Time Warping".
I haven't used http://mlpy.sourceforge.net/ but its blurb says it provides DTW. (Might be a hammer to crack a nut; depends on your use case.)

Related

Compare Sequences Python

Is there a way in python to compare 2 sequences in lists even if they are not normalized (i think this is the right word). For example:
a = [1,1,2,3,3,1,5]
b = [2,3,3,1,5,1,1]
c = [1,1,1,2,3,3,5]
a == b should return True as they contain the same sequence just from a different starting point.
c == a should return False as although they contain the same elements, they do not contain the same sequence
The only thing I can thing of is rather inelegant. I would compare 2 lists and if they are not equal, shift the last element of the list to the front and compare again. Repeat this until I have shifted the entire list once. However, I will be working with some very large lists so this will be very inefficient
This might be more efficient than shifting elements:
>>> a = [1, 1, 2, 3, 3, 1, 5]
>>> b = [2, 3, 3, 1, 5, 1, 1]
>>> c = [1, 1, 1, 2, 3, 3, 5]
>>> astr, bstr, cstr = ["".join(map(str, x)) for x in (a, b, c)]
>>> astr in bstr*2
True
>>> cstr in astr*2
False
What it does is basically join the lists to strings and check if the first string is contained in the other 'doubled'.
Using strings is probably the fastest and should work for simple cases like in the OP. As a more general approach, you can apply the same idea to list slices, e.g.:
>>> any(idx for idx in range(len(a)) if (b*2)[idx:idx+len(a)] == a)
True

How to refer to next element in a loop?

I've been looking around for this but can't find what I would like to. I'm sure I've seen this done before but I can't seem to find it. Here's an example:
In this case I would like to take the difference of each element in an array,
#Generate sample list
B = [a**2 for a in range(10)]
#Take difference of each element
C = [(b+1)-b for b in B]
the (b+1) is to denote the next element in the array which I don't know how to do and obviously doesn't work, giving the result:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
the result I would like is:
[1, 3, 5, 7, 9, 11, 13, 15, 17]
I understand that this result is shorter than the original array however the reason for this would be to replace ugly expressions such as:
C = [B[i+1]-B[i] for i in range(len(B)-1)]
In this case it really isn't that bad at all, but there are cases that I need to iterate through multiple variables with long expressions and it gets annoying to keep having to write the index in each time. Right now I'm hoping that there is an easy pythonic way to do this that I don't know about...
EDIT: An example of what I mean about having to do this with multiple variables would be:
X = [a for a in range(10)]
Y = [a**2 for a in range(10)]
Z = [a**3 for a in range(10)]
for x,y,z in zip(X,Y,Z):
x + (x+1) + (y-(y+1))/(z-(z+1))
where (x+1),(y+1),(z+1) denote the next element rather than:
for i in range(len(X)):
x[i] + x[i+1] + (y[i]-y[i+1])/(z[i]-z[i+1])
I am using python 2.7.5 btw
re: your edit
zip is still the right solution. You just need to zip together two iterators over the lists, the second of which should be advanced one tick.
from itertools import izip,tee
cur,nxt = tee(izip(X,Y,Z))
next(nxt,None) #advance nxt iterator
for (x1,y1,z1),(x2,y2,z2) in izip(cur,nxt):
print x1 + x2 + (y1-y2)/(z1-z2)
If you don't like the inline next call, you can use islice like #FelixKling mentioned: izip(cur,islice(nxt, 1, None)).
Alternative you can use zip, to create tuples of the current value, next value:
C = [b - a for a, b in zip(B, B[1:])]
I believe zip returns a generator in Python 3. In Python 2, you might want to use izip. And B[1:], you could use islice: islice(B, 1, None).
Maybe you want enumerate. As follows:
C = [B[b+1]-item for b,item in enumerate(B[:-1])]
or simply:
C = [B[b+1]-B[b] for b in range(len(B[:-1]))]
They both work.
Examples
>>> B = [a**2 for a in range(10)]
>>> C = [B[b+1]-item for b,item in enumerate(B[:-1])]
>>> print C
[1, 3, 5, 7, 9, 11, 13, 15, 17]
This is a pretty weird way, but it works!
b = [a**2 for a in range(10)]
print b
print reduce(lambda x, y:len(x) and x[:-1]+[y-x[-1], y] or [y], b, [])
I have created a bunk on CodeBunk so you can run the it too
http://codebunk.com/b/-JJzLIA-KZgASR_3a-I8

good practice for string.partition in python

Sometime I write code like this:
a,temp,b = s.partition('-')
I just need to pick the first and 3rd elements. temp would never be used. Is there a better way to do this?
In other terms, is there a better way to pick distinct elements to make a new list?
For example, I want to make a new list using the elements 0,1,3,7 from the old list. The
code would be like this:
newlist = [oldlist[0],oldlist[1],oldlist[3],oldlist[7]]
It's pretty ugly, isn't it?
Be careful using
a, _, b = s.partition('-')
sometimes _ is use for internationalization (gettext), so you wouldn't want to accidentally overwrite it.
Usually I would do this for partition rather than creating a variable I don't need
a, b = s.partition('-')[::2]
and this in the general case
from operator import itemgetter
ig0137 = itemgetter(0, 1, 3, 7)
newlist = ig0137(oldlist)
The itemgetter is more efficient than a list comprehension if you are using it in a loop
For the first there's also this alternative:
a, b = s.partition('-')[::2]
For the latter, since there's no clear interval there is no way to do it too clean. But this might suit your needs:
newlist = [oldlist[k] for k in (0, 1, 3, 7)]
You can use Python's extended slicing feature to access a list periodically:
>>> a = range(10)
>>> # Pick every other element in a starting from a[1]
>>> b = a[1::2]
>>> print b
>>> [1, 3, 5, 7, 9]
Negative indexing works as you'd expect:
>>> c = a[-1::-2]
>>> print c
>>> [9, 7, 5, 3, 1]
For your case,
>>> a, b = s.partition('-')[::2]
the common practice in Python to pick 1st and 3rd values is:
a, _, b = s.partition('-')
And to pick specified elements in a list you can do :
newlist = [oldlist[k] for k in (0, 1, 3, 7)]
If you don't need to retain the middle field you can use split (and similarly rsplit) with the optional maxsplit parameter to limit the splits to the first (or last) match of the separator:
a, b = s.split('-', 1)
This avoids a throwaway temporary or additional slicing.
The only caveat is that with split, unlike partition, the original string is returned if the separator is not found. The attempt to unpack will fail as a result. The partition method always returns a 3-tuple.

Intersection complexity

In Python you can get the intersection of two sets doing:
>>> s1 = {1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> s2 = {0, 3, 5, 6, 10}
>>> s1 & s2
set([3, 5, 6])
>>> s1.intersection(s2)
set([3, 5, 6])
Anybody knows the complexity of this intersection (&) algorithm?
EDIT: In addition, does anyone know what is the data structure behind a Python set?
The data structure behind the set is a hash table where the typical performance is an amortized O(1) lookup and insertion.
The intersection algorithm loops exactly min(len(s1), len(s2)) times. It performs one lookup per loop and if there is a match performs an insertion. In pure Python, it looks like this:
def intersection(self, other):
if len(self) <= len(other):
little, big = self, other
else:
little, big = other, self
result = set()
for elem in little:
if elem in big:
result.add(elem)
return result
The answer appears to be a search engine query away. You can also use this direct link to the Time Complexity page at python.org. Quick summary:
Average: O(min(len(s), len(t))
Worst case: O(len(s) * len(t))
EDIT: As Raymond points out below, the "worst case" scenario isn't likely to occur. I included it originally to be thorough, and I'm leaving it to provide context for the discussion below, but I think Raymond's right.
Set intersection of two sets of sizes m,n can be achieved with O(max{m,n} * log(min{m,n})) in the following way:
Assume m << n
1. Represent the two sets as list/array(something sortable)
2. Sort the **smaller** list/array (cost: m*logm)
3. Do until all elements in the bigger list has been checked:
3.1 Sort the next **m** items on the bigger list(cost: m*logm)
3.2 With a single pass compare the smaller list and the m items you just sorted and take the ones that appear in both of them(cost: m)
4. Return the new set
The loop in step 3 will run for n/m iterations and each iteration will take O(m*logm), so you will have time complexity of O(nlogm) for m << n.
I think that's the best lower bound that exists

Calculating the similarity of two lists

I have two lists:
eg.
a = [1,8,3,9,4,9,3,8,1,2,3]
and
b = [1,8,1,3,9,4,9,3,8,1,2,3]
Both contain ints. There is no meaning behind the ints (eg. 1 is not 'closer' to 3 than it is to 8).
I'm trying to devise an algorithm to calculate the similarity between two ORDERED lists. Ordered is keyword right here (so I can't just take the set of both lists and calculate their set_difference percentage). Sometimes numbers do repeat (for example 3, 8, and 9 above, and I cannot ignore the repeats).
In the example above, the function I would call would tell me that a and b are ~90% similar for example. How can I do that? Edit distance was something which came to mind. I know how to use it with strings but I'm not sure how to use it with a list of ints. Thanks!
You can use the difflib module
ratio()
Return a measure of the sequences’ similarity as a float in the range [0, 1].
Which gives :
>>> s1=[1,8,3,9,4,9,3,8,1,2,3]
>>> s2=[1,8,1,3,9,4,9,3,8,1,2,3]
>>> sm=difflib.SequenceMatcher(None,s1,s2)
>>> sm.ratio()
0.9565217391304348
It sounds like edit (or Levenshtein) distance is precisely the right tool for the job.
Here is one Python implementation that can be used on lists of integers: http://hetland.org/coding/python/levenshtein.py
Using that code, levenshtein([1,8,3,9,4,9,3,8,1,2,3], [1,8,1,3,9,4,9,3,8,1,2,3]) returns 1, which is the edit distance.
Given the edit distance and the lengths of the two arrays, computing a "percentage similarity" metric should be pretty trivial.
One way to tackle this is to utilize histogram. As an example (demonstration with numpy):
In []: a= array([1,8,3,9,4,9,3,8,1,2,3])
In []: b= array([1,8,1,3,9,4,9,3,8,1,2,3])
In []: a_c, _= histogram(a, arange(9)+ 1)
In []: a_c
Out[]: array([2, 1, 3, 1, 0, 0, 0, 4])
In []: b_c, _= histogram(b, arange(9)+ 1)
In []: b_c
Out[]: array([3, 1, 3, 1, 0, 0, 0, 4])
In []: (a_c- b_c).sum()
Out[]: -1
There exist now plethora of ways to harness a_c and b_c.
Where the (seemingly) simplest similarity measure is:
In []: 1- abs(-1/ 9.)
Out[]: 0.8888888888888888
Followed by:
In []: norm(a_c)/ norm(b_c)
Out[]: 0.92796072713833688
and:
In []: a_n= (a_c/ norm(a_c))[:, None]
In []: 1- norm(b_c- dot(dot(a_n, a_n.T), b_c))/ norm(b_c)
Out[]: 0.84445724579043624
Thus, you need to be much more specific to find out most suitable similarity measure suitable for your purposes.
Just use the same algorithm for calculating edit distance on strings if the values don't have any particular meaning.
I've implemented something for a similar task a long time ago. Now, I have only a blog entry for that. It was simple: you had to compute the pdf of both sequences then it would find the common area covered by the graphical representation of pdf.
Sorry for the broken images on link, the external server that I've used back then is dead now.
Right now, for your problem the code translates to
def overlap(pdf1, pdf2):
s = 0
for k in pdf1:
if pdf2.has_key(k):
s += min(pdf1[k], pdf2[k])
return s
def pdf(l):
d = {}
s = 0.0
for i in l:
s += i
if d.has_key(i):
d[i] += 1
else:
d[i] = 1
for k in d:
d[k] /= s
return d
def solve():
a = [1, 8, 3, 9, 4, 9, 3, 8, 1, 2, 3]
b = [1, 8, 1, 3, 9, 4, 9, 3, 8, 1, 2, 3]
pdf_a = pdf(a)
pdf_b = pdf(b)
print pdf_a
print pdf_b
print overlap(pdf_a, pdf_b)
print overlap(pdf_b, pdf_a)
if __name__ == '__main__':
solve()
Unfortunately, it gives an unexpected answer, only 0.212292609351
The solution proposed by #kraymer does not work in the case of
s1=[1,2,3,4,5,6,7,8,9,10]
s2=[2,1,3,4,5,6,7,8,9,9]
since it returns 0.8 even though there are 3 different elements and not 2.
A workaround could be:
def find_percentage_agreement(s1, s2):
assert len(s1)==len(s2), "Lists must have the same shape"
nb_agreements = 0 # initialize counter to 0
for idx, value in enumerate(s1):
if s2[idx] == value:
nb_agreements += 1
percentage_agreement = nb_agreements/len(s1)
return percentage_agreement
Which returns the expected result:
>>> s1=[1,2,3,4,5,6,7,8,9,10]
>>> s2=[2,1,3,4,5,6,7,8,9,9]
>>> find_percentage_agreement(s1, s2)
0.7
Unless im missing the point.
from __future__ import division
def similar(x,y):
si = 0
for a,b in zip(x, y):
if a == b:
si += 1
return (si/len(x)) * 100
if __name__ in '__main__':
a = [1,8,3,9,4,9,3,8,1,2,3]
b = [1,8,1,3,9,4,9,3,8,1,2,3]
result = similar(a,b)
if result is not None:
print "%s%s Similar!" % (result,'%')

Categories