Python: comparing a few thousand strings. Any fast alternatives for comparison? - python

I have a set of around 6 000 packets which for comparison purposes I represent as strings (first 28 bytes) to compare against just as many packets, which I also represent as strings of 28 bytes.
I have to match each packet of one set with all of the other. Matchings are always unique.
I found that comparing strings takes a bit of time. Is there any way to speed up the process?
EDIT1: I wouldn't like to permutate string elements because I am always making sure that ordering between packet list and corresponding string list is preserved.
EDIT2: Here's my implementation:
list1, list2 # list of packets (no duplicates present in each list!)
listOfStrings1, listOfStrings2 # corresponding list of strings. Ordering is preserved.
alreadyMatchedlist2Indices = []
for list1Index in xrange(len(listOfStrings1)):
stringToMatch = listOfStrings1[list1Index]
matchinglist2Indices = [i for i, list2Str in enumerate(listOfStrings2)
if list2Str == stringToMatch and i not in alreadyMatchedlist2Indices]
if not matchinglist2Indices:
tmpUnmatched.append(list1Index)
elif len(matchinglist2Indices) == 1:
tmpMatched.append([list1Index, matchinglist2Indices[0]])
alreadyMatchedlist2Indices.append(matchinglist2Indices[0])
else:
list2Index = matchinglist2Indices[0] #taking first matching element anyway
tmpMatched.append([list1Index, list2Index])
alreadyMatchedlist2Indices.append(list2Index)

---Here I'm assuming you're taking every strings one by one and comparing to all others.---
I suggest sorting your list of string and comparing neighboring strings. This should have a runtime of O(nlogn).

Here's a simple linear time approach -- at least if I understand your question correctly:
>>> def get_matches(a, b):
... reverse_map = {x:i for i, x in enumerate(b)}
... return [(i, reverse_map[x]) for i, x in enumerate(a) if x in reverse_map]
...
>>> get_matches(['a', 'b', 'c'], ['c', 'd', 'e'])
[(2, 0)]
This accepts two sequences of strings, a and b, and returns a list of matches represented as tuples of indices into a and b. This is O(n + m) where m and n are the lengths of a and b.

what's wrong with:
matches = [packet for packet in list1 if packet in list2]

Related

Effectively search if part of a tuple exist in a list of tuples

I have a tuple list which contains tuples of 6 digits, ranging from 01 to 99. For example:
tuple_list = {(01,02,03,04,05,06), (20,22,24,26,28,30), (02,03,04,05,06,99)}
For every tuple on this list I need to effectively search if there are any other tuples that have at least 5 numbers in common with it (excluding the searched number). So for the above example, the result will be:
(01,02,03,04,05,06) -> (02,03,04,05,06,99)
(20,22,24,26,28,30) -> []
(02,03,04,05,06,99) -> (01,02,03,04,05,06)
The list itself is big and can hold up to 1,000,000 records.
I tried the naive approach of scanning the list one-by-one, but this has an O(n^2) complexity and takes a lot of time.
I thought about maybe using a dict but I can't find a way to search for part of a key (it would have worked fine if I needed to search for the exact key).
Maybe some sort of a suffix/prefix tree variation is needed, but I can't seem to figure it out.
Any help will be appreciated.
The code below generates a dict where they key is a 5-tuple and the value is a list of all the tuples that have those 5 elements.
It runs in O(nm) where n is the size of the tuple list and m is the size of each tuple. For 6-tuples, it runs in O(6n). See test results below
def getCombos(tup):
"""
Produces all combinations of the tuple with 1 missing
element from the original
"""
combos = []
# sort the input tuple here if it's not already sorted
for i in range(0, len(tup)):
tupAsList = list(tup)
del tupAsList[i]
combos.append(tupAsList)
return combos
def getKey(combo):
"""
Creates a string key for a given combination
"""
strCombo = [str(i) for i in combo]
return ",".join(strCombo)
def findMatches(tuple_list):
"""
Returns dict of tuples that match
"""
matches = {}
for tup in tuple_list:
combos = getCombos(tup)
for combo in combos:
key = getKey(combo)
if key in matches:
matches[key].append(tup)
else:
matches[key] = [tup]
# filter out matches with less than 2 elements (optional)
matches = {k: v for k, v in matches.items() if len(v) > 1}
return matches
tuple_list = [(01,02,03,04,05,06), (20,22,24,26,28,30), (02,03,04,05,06,99)]
print(findMatches(tuple_list)) # output: {'2,3,4,5,6': [(1, 2, 3, 4, 5, 6), (2, 3, 4, 5, 6, 99)]}
I tested this code against the brute force solution. For 1000 tuples, the brute force version took 5.5s whereas this solution took 0.03s. See repl here
You can rearrange the output by iterating through the values but that may be unnecessary depending on how you're using it
This process is inherently O(N^2): you're making a comparison of N items to each of the other N-1 items. This is a distance metric, and the same theoretical results apply (you can look up all-to-all distance algorithms on Stack Overflow and elsewhere). In most cases, there is not enough information to gather from f(A, B) and f(B, C) to predict whether f(A, C) is greater or less than 5.
First of all, quit using tuples: they don't match your use case. Tuples are indexed, and you don't care about the ordering. (01, 02, 03, 04, 05, 06) and (05, 99, 03, 02, 01, 06) match in five numbers, despite having only two positional matches.
Use the natural data types: sets. Your comparison operation is len(A.intersection(B)). Note that you can flip the logic to a straight distance metric: mismatch = len(A-B) and have a little triangle logic, given that all the sets are the same size (see "triangle inequality").
For instance, if len(A-B) is 1, then 5 numbers match. If you also get len(A-C) is 5, then you know that that C differs from B in either 4 or 5 numbers, depending on which number did match.
Given the sparsity of your sets (6 number from at least 99), you can gain a small amount of performance here ... but the overhead and extra checking will likely consume your savings, and the resulting algorithm is still O(N^2).

Is it possible to extract intersection list that contains duplicate values?

I want to get an intersection of lists where duplication is not eliminated.
And I hope that the method is a fast way not to use loops.
Below was my attempt, but this method failed because duplicates were removed.
a = ['a','b','c','f']
b = ['a','b','b','o','k']
tmp = list(set(a) & set(b))
>>>tmp
>>>['b','a']
I want the result to be ['a', 'b', 'b'].
In this method, 'a' is a fixed value and 'b' is a variable value.
And the concept of extracting 'a' value from 'b'.
Is there a way to extract a list of cross-values ​​that do not remove duplicate values?
A solution could be
good = set(a)
result = [x for x in b if x in good]
there are two loops here; one is the set-building loop of set (that is implemented in C, a hundred of times faster than whatever you can do in Python) the other is the comprehension and runs in the interpreter.
The first loop is done to avoid a linear search in a for each element of b (if a becomes big this can be a serious problem).
Note that using filter instead is probably not going to gain much (if anything) because despite the filter loop being in C, for each element it will have to get back to the interpreter to call the filtering function.
Note that if you care about speed then probably Python is not a good choice... for example may be PyPy would be better here and in this case just writing an optimal algorithm explicitly should be ok (avoiding re-searching a for duplicates when they are consecutive in b like happens in your example)
good = set(a)
res = []
i = 0
while i < len(b):
x = b[i]
if x in good:
while i < len(b) and b[i] == x: # is?
res.append(x)
i += 1
else:
i += 1
Of course in performance optimization the only real way is try and measure with real data on the real system... guessing works less and less as technology advances and becomes more complicated.
If you insist on not using for explicitly then this will work:
>>> list(filter(a.__contains__, b))
['a', 'b', 'b']
But directly calling magic methods like __contains__ is not a recommended practice to the best of my knowledge, so consider this instead:
>>> list(filter(lambda x: x in a, b))
['a', 'b', 'b']
And if you want to improve the lookup in a from O(n) to O(1) then create a set of it first:
>>> a_set = set(a)
>>> list(filter(lambda x: x in a_set, b))
['a', 'b', 'b']
>>a = ['a','b','c','f']
>>b = ['a','b','b','o','k']
>>items = set(a)
>>found = [i for i in b if i in items]
>>items
{'f', 'a', 'c', 'b'}
>>found
['a', 'b', 'b']
This should do your work.
I guess it's not faster than a loop and finally you probably still need a loop to extract the result. Anyway...
from collections import Counter
a = ['a','a','b','c','f']
b = ['a','b','b','o','k']
count_b = Counter(b)
count_ab = Counter(set(b)-set(a))
count_b - count_ab
#=> Counter({'a': 1, 'b': 2})
I mean, if res holds the result, you need to:
[ val for sublist in [ [s] * n for s, n in res.items() ] for val in sublist ]
#=> ['a', 'b', 'b']
It isn't clear how duplicates are handled when performing an intersection of lists which contain duplicate elements, as you have given only one test case and its expected result, and you did not explain duplicate handling.
According to how keeping duplicates work currently, the common elements are 'a' and 'b', and the intersection list lists 'a' with multiplicity 1 and 'b' with multiplicity 2. Note 'a' occurs once on both lists a and b, but 'b' occurs twice on b. The intersection list lists the common element with multiplicity equal to the list having that element at the maximum multiplicity.
The answer is yes. However, a loop may implicitly be called - though you want your code to not explicitly use any loop statements. This algorithm, however, will always be iterative.
Step 1: Create the intersection set, Intersect that does not contain duplicates (You already done that). Convert to list to keep indexing.
Step 2: Create a second array, IntersectD. Create a new variable Freq which counts the maximum number of occurrences for that common element, using count. Use Intersect and Freq to append the element Intersect[k] a number of times depending on its corresponding Freq[k].
An example code with 3 lists would be
a = ['a','b','c','1','1','1','1','2','3','o']
b = ['a','b','b','o','1','o','1']
c = ['a','a','a','b','1','2']
intersect = list(set(a) & set(b) & set(c)) # 3-set case
intersectD = []
for k in range(len(intersect)):
cmn = intersect[k]
freq = max(a.count(cmn), b.count(cmn), c.count(cmn)) # 3-set case
for i in range(freq): # Can be done with itertools
intersectD.append(cmn)
>>> intersectD
>>> ['b', 'b', 'a', 'a', 'a', '1', '1', '1', '1']
For cases involving more than two lists, freq for this common element can be computed using a more complex set intersection and max expression. If using a list of lists, freq can be computed using an inner loop. You can also replace the inner i-loop with an itertools expression from How can I count the occurrences of a list item?.

Sorting list of string with constraints of complexity

I need to sort a list of strings, where each string is of length k, and each string consists only of the characters {'a', 'b', 'c', 'd', 'e'}.
I do have the limitations of:
I cannot use any sorting methods include the methods that come with python.
Comparing 2 strings is of complexity O(k).
Sorting is not in-place, a new sorted list needs to be returned.
Space complexity is O(k) [w/o considering the return list of length n].
Time complexity is O(n*k*(5^k)).
The function receives lst, k which are the list to sort and the length of each string in the list. I have 2 methods string_to_int(), int_to_string() that convert a string of length k to a number in the range [0,5^k) and vice-versa, the methods are bijections. Each of these methods are of time complexity of O(k).
My best attempt was:
def sort_strings(lst, k):
int_lst = []
for item in lst:
int_lst.append(string_to_int(item))
sorted_lst = []
for i in range(5**k):
if i in int_lst:
sorted_lst.append(int_to_string(k, i))
return sorted_lst
But here I create int_lst which is of space complexity of O(n) and not O(k).
Any hints of how to approach that?
You have a really long time limit. That's enough time to go through every possible length-k string, in order, and compare those strings against every string in your input list.
What can you do with that?
Try to reuse int_list as the output, along the lines
replace = 0
for i in range(5 ** k):
if i in int_list[replace:]:
int_list[replace] = int_to_string(i, k)
replace += 1
return int_list
Edit: In case input list contains duplicates, consider while i in int_list[replace:]: instead of if .....

Making a list comprehension for a dict within a list within a list

I need to create a list comprehension that extracts values from a dict within a list within a list, and my attempts so far are failing me. The object looks like this:
MyList=[[{'animal':'A','color':'blue'},{'animal':'B','color':'red'}],[{'animal':'C','color':'blue'},{'animal':'D','color':'Y'}]]
I want to extract the values for each element in the dict/list/list so that I get two new lists:
Animals=[[A,B],[C,D]]
Colors=[[blue,red],[blue,Y]]
Any suggestions? Doesn't necessarily need to use a list comprehension; that's just been my starting point so far. Thanks!
Animals = [[d['animal'] for d in sub] for sub in MyList]
Colors = [[d['color'] for d in sub] for sub in MyList]
Gives the desired result:
[['A', 'B'], ['C', 'D']]
[['blue', 'red'], ['blue', 'Y']] # No second 'red'.
What I have done here is take each sub-list, then each dictionary, and then access the correct key.
In a single assignment (with a single list comprehension, and the help of map and zip):
Colors, Animals = map(list,
zip(*[map(list,
zip(*[(d['color'], d['animal']) for d in a]))
for a in MyList]))
If you are fine with tuples, you can avoid the two calls to map => list
EDIT:
Let's see it in some details, by decomposing the nested comprehension.
Let's also assume MyList have m elements, for a total of n objects (dictionaries).
[[d for d in sub] for sub in MyList]
This would iterate through every dictionary in the sublists. For each of them, we create a couple with its color property in the first element and its animal property in the second one:
(d['color'], d['animal'])
So far, this will take time proportional to O(n) - exatly n elements will be processed.
print [[(d['color'], d['animal']) for d in sub] for sub in MyList]
Now, for each of the m sublists of the original list, we have one list of couples that we need to unzip, i.e. transform it into two lists of singletons. In Python, unzip is performed using the zip function by passing a variable number of tuples as arguments (the arity of the first tuple determines the number of tuples it outputs). For instance, passing 3 couples, we get two lists of 3 elements each
>>> zip((1,2), (3,4), (5,6)) #Prints [(1, 3, 5), (2, 4, 6)]
To apply this to our case, we need to pass array of couples to zip as a variable number of arguments: that's done using the splat operator, i.e. *
[zip(*[(d['color'], d['animal']) for d in sub]) for sub in MyList]
This operation requires going through each sublist once, and in turn through each one of the couples we created in the previous step. Total running time is therefore O(n + n + m) = O(n), with approximatively 2*n + 2*m operations.
So far we have m sublists, each one containing two tuples (the first one will gather all the colors for the sublist, the second one all the animals). To obtain two lists with m tuples each, we apply unzip again
zip(*[zip(*[(d['color'], d['animal']) for d in sub]) for sub in MyList]
This will require an additional m steps - the running time will therefore stay O(n), with approximatively 2*n + 4*m operations.
For sake of simplicity we left out mapping tuples to lists in this analysis - which is fine if you are ok with tuples instead.
Tuples are immutable, however, so it might not be the case.
If you need lists of lists, you need to apply the list function to each tuple: once for each of the m sublists (with a total of 2*n elements), and once for each of the 2 first level lists, i.e. Animals and Colors, (which have a total of m elements each). Assuming list requires time proportional to the length of the sequence it is applied to, this extra step requires 2*n + 2*m operations, which is still O(n).

What is the fastest way to generate a list of the lengths of sub-strings in a string, given a separator?

I have a string and I need to generate a list of the lengths of all the sub-strings terminating in a given separator.
For example: string = 'a0ddb0gf0', separator = '0', so I need to generate: lengths = [2,4,3], since len('a0')==2, len('ddb0')=4, and len('gf0')==3.
I am aware that it can be done by the following (for example):
separators = [index for index in range(len(string)) if string[index]==separator]
lengths = [separators[index+1] - separators[index] for index in range(len(separators)-1)]
But I need it to be done extremely fast (on large amounts of data). Generating an intermediate list for large amounts of data is time consuming.
Is there a solution that does this neatly and fast (py2.7)?
Fastest? Don't know. You might like to profile it.
>>> print [len(s) for s in 'a0ddb0gf0'.split('0')]
[1, 3, 2, 0]
And, if you really don't want to include zero length strings:
>>> print [len(s) for s in 'a0ddb0gf0'.split('0') if s]
[1, 3, 2]
Personally, I love itertools.groupby()
>>> from itertools import groupby
>>> sep = '0'
>>> data = 'a0ddb0gf0'
>>> [sum(1 for i in g) for (k, g) in groupby(data, sep.__ne__) if k]
[1, 3, 2]
This groups the data according to whether each element is equal to the separator, then gets the length of each group for which the element was not equal (by summing 1's for each item in the group).
itertools functions are generally quite fast, though I don't know for sure how much better than split() this is. The one point that I think is strongly in its favor is that this can seamlessly handle multiple consecutive occurrences of the separator character. It will also handle any iterable for data, not just strings.
I don't know how fast this will go, but here's another way:
def len_pieces(s, sep):
i = 0
while True:
f = s.find(sep, i)
if f == -1:
yield len(s) - i
return
yield f - i + 1
i = f + 1
>>> [len(i) for i in re.findall('.+?0', 'a0ddb0gf0')]
[2, 4, 3]
You may use re.finditer to avoid an intermediary list, but it may not be much different in performance:
[len(i.group(0)) for i in re.finditer('.+?0', 'a0ddb0gf0')]
Maybe using an re:
[len(m.group()) for m in re.finditer('(.*?)0', s)]

Categories