Partial Match Binary Search of Complex Strings - python

Using python, I'm looking to iterate through a list which contains a few thousand entries. For each item in the list it needs to compare against items in other lists (which contain tens of thousands of entries), and do a partial comparison check. Once it finds a match above a set ratio, it will stop and move onto the next item.
One challenge: I am unable to install any additional python packages to complete this and limited to a python 3.4.2 distribution.
Below is some sample code which I am using. It works very well if the lists are small but once I apply it on very large lists, the runtime could take multiple hours to complete.
from difflib import SequenceMatcher
ref_list = [] #(contains 4k sorted entries - long complex strings)
list1 = [] #(contains 60k sorted entries - long complex strings)
list2 = [] #(contains 30k sorted entries - long complex strings)
all_lists = [list1,list2]
min_ratio = 0.93
partMatch = ''
for ref in ref_list:
for x in range(len(all_lists)):
for str1 in all_lists[x]:
check_ratio = SequenceMatcher(None, ref, str1).quick_ratio()
if check_ratio > min_ratio:
partMatch = str1 #do stuff with partMatch later
break
I'm thinking a binary search on all_lists[x] would fix the issue. If my calculations are correct, a 60k list would only take 16 attempts to find the partial match.
However, the issue is with the type of strings. A typical string could be anywhere from 80 to 500 characters long e.g.
lorem/ipsum/dolor/sit/amet/consectetur/adipiscing/elit/sed/do/eiusmod/tempor/incididunt/ut/labore/et/dolore/magna/aliqua/Ut/enim/ad/minim/veniam/quis/nostrud/exercitation
and although the lists are sorted, I'm not sure how I can validate a midpoint. As an example, if I shorten the strings to make it easier to read and provide the following lists:
ref_list = ['past/pre/dest[5]']
list1 = ['abc/def/ghi','xry/dos/zanth']
list2 = ['a/bat/cat', 'ortho/coli', 'past/pre/dest[6]', 'past/tar/lot', 'rif/six/1', 'tenta[17]', 'ufra/cos/xx']
We can see that the partial match for the string in ref_list is list2[2]. However, with a binary search, how do I determine that the partial match is definitely within the first half of list2?
I'd really appreciate any help with this. Efficiency is the most important factor here considering that I need to work on lists with tens of thousands of entries.

So I did more research into the background of string comparisons and it turns out the initial problem isn't as difficult as I originally thought.
To get the midpoint for a binary search, I can simply use the < and > operators. Since every ASCII character has a value, it seems that python will check the strings on a character-by-character basis. In this case, it doesn't matter how complex the string is.
However, one caveat is that some strings in the lists may have a rare naming difference of an uppercase character. To combat this, I've added str().lower() when generating the high/low/midpoints.
Working code is below. I've lowered the min_ratio value here, to cater to the short test strings but I will increase it in my main program.
#!/usr/bin/env python
# Copyright 2009-2017 BHG http://bw.org/
from difflib import SequenceMatcher
def binary_search_partmatch(arr, x):
low = 0
high = len(arr) - 1
mid = 0
min_ratio = 0.85
partMatch = ''
while low <= high:
mid = (high + low) // 2
# If midpoint is lower, ignore the left half of array
if str(arr[mid]).lower() < str(x).lower():
low = mid + 1
# If midpoint is higher, ignore the right half of array
elif str(arr[mid]).lower() > str(x).lower():
high = mid - 1
# x is present at the midpoint
else:
return -1
# If we reach here, then the exact element was not present. Check for a close match.
check_ratio = SequenceMatcher(None, x, str(arr[mid])).ratio()
if check_ratio > min_ratio:
partMatch = str(arr[mid])
return partMatch
else:
return -2
def main():
ref_list = ['past/pre/dest[5]', 'rif/six/1', 'testcase_no_match']
list1 = ['abc/def/ghi','xry/dos/zanth']
list2 = ['a/bat/cat', 'ortho/coli', 'past/Pre/dest[6]', 'past/tar/lot', 'rif/six/1', 'tenta[17]', 'ufra/cos/xx']
all_lists = [list1,list2]
for ref in ref_list:
for x in range(len(all_lists)):
result = binary_search_partmatch(all_lists[x], ref)
if result == -1:
print('Exact match found for "' + ref + '"' )
break
elif result == -2:
if x == (len(all_lists)-1):
print('No match or partial match found for "' + ref + '"')
else:
print('Partial match found for "' + ref + '": "' + str(result)+ '"')
break
if __name__ == '__main__':
main()
Output:
>>> Partial match found for "past/pre/dest[5]": "past/Pre/dest[6]"
>>> Exact match found for "rif/six/1"
>>> No match or partial match found for "testcase_no_match"
I'd still welcome any recommendations or unforeseen bugs with my test scenario here. I'm not a programmer by trade, so I may be overlooking something important.

Related

Extract words from random strings

Below I have some strings in a list:
some_list = ['a','l','p','p','l','l','i','i','r',i','r','a','a']
Now I want to take the word april from this list. There are only two april in this list. So I want to take that two april from this list and append them to another extract list.
So the extract list should look something like this:
extract = ['aprilapril']
or
extract = ['a','p','r','i','l','a','p','r','i','l']
I tried many times trying to get the everything in extract in order, but I still can't seems to get it.
But I know I can just do this
a_count = some_list.count('a')
p_count = some_list.count('p')
r_count = some_list.count('r')
i_count = some_list.count('i')
l_count = some_list.count('l')
total_count = [a_count,p_count,r_count,i_count,l_count]
smallest_count = min(total_count)
extract = ['april' * smallest_count]
Which I wouldn't be here If I just use the code above.
Because I made some rules for solving this problem
Each of the characters (a,p,r,i and l) are some magical code elements, these code elements can't be created out of thin air; they are some unique code elements, that has some uniquw identifier, like a secrete number that is associated with them. So you don't know how to create this magical code elements, the only way to get the code elements is to extract them to a list.
Each of the characters (a,p,r,i and l) must be in order. Imagine they are some kind of chains, they will only work if they are together. Meaning that we got to put p next to and in front of a, and l must come last.
These important code elements are some kind of top secrete stuff, so if you want to get it, the only way is to extract them to a list.
Below are some examples of a incorrect way to do this: (breaking the rules)
import re
word = 'april'
some_list = ['aaaaaaappppppprrrrrriiiiiilll']
regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)
if match:
lowest_amount = min(len(g) for g in match.groups())
print(word * lowest_amount)
else:
print("no match")
from collections import Counter
def count_recurrence(kernel, string):
# we need to count both strings
kernel_counter = Counter(kernel)
string_counter = Counter(string)
effective_counter = {
k: int(string_counter.get(k, 0)/v)
for k, v in kernel_counter.items()
}
min_recurring_count = min(effective_counter.values())
return kernel * min_recurring_count
This might sounds really stupid, but this is actually a hard problem (well for me). I originally designed this problem for myself to practice python, but it turns out to be way harder than I thought. I just want to see how other people solve this problem.
If anyone out there know how to solve this ridiculous problem, please help me out, I am just a fourteen-year-old trying to do python. Thank you very much.
I'm not sure what do you mean by "cannot copy nor delete the magical codes" - if you want to put them in your output list you will need to "copy" them somehow.
And btw your example code (a_count = some_list.count('a') etc) won't work since count will always return zero.
That said, a possible solution is
worklist = [c for c in some_list[0]]
extract = []
fail = False
while not fail:
lastpos = -1
tempextract = []
for magic in magics:
if magic in worklist:
pos = worklist.index(magic, lastpos+1)
tempextract.append(worklist.pop(pos))
lastpos = pos-1
else:
fail = True
break
else:
extract.append(tempextract)
Alternatively, if you don't want to pop the elements when you find them, you may compute the positions of all the occurences of the first element (the "a"), and set lastpos to each of those positions at the beginning of each iteration
May not be the most efficient way, although code works and is more explicit to understand the program logic:
some_list = ['aaaaaaappppppprrrrrriiiiiilll']
word = 'april'
extract = []
remove = []
string = some_list[0]
for x in range(len(some_list[0])//len(word)): #maximum number of times `word` can appear in `some_list[0]`
pointer = i = 0
while i<len(word):
j=0
while j<(len(string)-pointer):
if string[pointer:][j] == word[i]:
extract.append(word[i])
remove.append(pointer+j)
i+=1
pointer = j+1
break
j+=1
if i==len(word):
for r_i,r in enumerate(remove):
string = string[:r-r_i] + string[r-r_i+1:]
remove = []
elif j==(len(string)-pointer):
break
print(extract,string)

Extracting multiple data from a single list

I working on a text file that contains multiple information. I converted it into a list in python and right now I'm trying to separate the different data into different lists. The data is presented as following:
CODE/ DESCRIPTION/ Unity/ Value1/ Value2/ Value3/ Value4 and then repeat, an example would be:
P03133 Auxiliar helper un 203.02 417.54 437.22 675.80
My approach to it until now has been:
Creating lists to storage each information:
codes = []
description = []
unity = []
cost = []
Through loops finding a code, based on the code's structure, and using the code's index as base to find the remaining values.
Finding a code's easy, it's a distinct type of information amongst the other data.
For the remaining values I made a loop to find the next value that is numeric after a code. That way I can delimitate the rest of the indexes:
The unity would be the code's index + index until isnumeric - 1, hence it's the first information prior to the first numeric value in each line.
The cost would be the code's index + index until isnumeric + 2, the third value is the only one I need to store.
The description is a little harder, the number of elements that compose it varies across the list. So I used slicing starting at code's index + 1 and ending at index until isnumeric - 2.
for i, carc in enumerate(txtl):
if carc[0] == "P" and carc[1].isnumeric():
codes.append(carc)
j = 0
while not txtl[i+j].isnumeric():
j = j + 1
description.append(" ".join(txtl[i+1:i+j-2]))
unity.append(txtl[i+j-1])
cost.append(txtl[i+j])
I'm facing some problems with this approach, although there will always be more elements to the list after a code I'm getting the error:
while not txtl[i+j].isnumeric():
txtl[i+j] list index out of range.
Accepting any solution to debug my code or even new solutions to problem.
OBS: I'm also going to have to do this to a really similar data font, but the code would be just a sequence of 7 numbers, thus harder to find amongst the other data. Any solution that includes this facet is also appreciated!
A slight addition to your code should resolve this:
while i+j < len(txtl) and not txtl[i+j].isnumeric():
j += 1
The first condition fails when out of bounds, so the second one doesn't get checked.
Also, please use a list of dict items instead of 4 different lists, fe:
thelist = []
thelist.append({'codes': 69, 'description': 'random text', 'unity': 'whatever', 'cost': 'your life'})
In this way you always have the correct values together in the list, and you don't need to keep track of where you are with indexes or other black magic...
EDIT after comment interactions:
Ok, so in this case you split the line you are processing on the space character, and then process the words in the line.
from pprint import pprint # just for pretty printing
textl = 'P03133 Auxiliar helper un 203.02 417.54 437.22 675.80'
the_list = []
def handle_line(textl: str):
description = ''
unity = None
values = []
for word in textl.split()[1:]:
# it splits on space characters by default
# you can ignore the first item in the list, as this will always be the code
# str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296
if not word.replace(',', '').replace('.', '').isnumeric():
if len(description) == 0:
description = word
else:
description = f'{description} {word}' # I like f-strings
elif not unity:
# if unity is still None, that means it has not been set yet
unity = word
else:
values.append(word)
return {'code': textl.split()[0], 'description': description, 'unity': unity, 'values': values}
the_list.append(handle_line(textl))
pprint(the_list)
str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296

Create longestPossible(longest_possible in python) helper function that takes 1 integer argument which is a maximum length of a song in seconds

Am kind of new to coding,please help me out with this one with explanations:
songs is an array of objects which are formatted as follows:
{artist: 'Artist', title: 'Title String', playback: '04:30'}
You can expect playback value to be formatted exactly like above.
Output should be a title of the longest song from the database that matches the criteria of not being longer than specified time. If there's no songs matching criteria in the database, return false.
Either you could change playback, so that instead of a string, it's an integer (for instance, the length of the song in seconds) which you convert to a string for display, and test from there, or, during the test, you could take playback and convert it to its length in seconds, like so:
def songLength(playback):
seconds = playback.split(':')
lengthOfSong = int(seconds[0]) * 60 + int(seconds[1])
return lengthOfSong
This will give the following result:
>>> playback = '04:30'
>>> songLength(playback)
270
I'm not as familiar with the particular data structure you're using, but if you can iterate over these, you could do something like this:
def longestPossible(array, maxLength):
longest = 0
songName = ''
for song in array:
lenSong = songLength(song.playback) # I'm formatting song's playback like this because I'm not sure how you're going to be accessing it.
if maxLength >= lenSong and (maxLength - lenSong) < (maxLength - longest):
longest = lenSong
songName = song.title
if longest != 0:
return songName
else:
return '' # Empty strings will evaluate to False.
I haven't tested this, but I think this should at least get you on the right track. There are more Pythonic ways of doing this, so never stop improving your code. Good luck!

make a global condition break

allow me to preface this by saying that i am learning python on my own as part of my own curiosity, and i was recommended a free online computer science course that is publicly available, so i apologize if i am using terms incorrectly.
i have seen questions regarding this particular problem on here before - but i have a separate question from them and did not want to hijack those threads. the question:
"a substring is any consecutive sequence of characters inside another string. The same substring may occur several times inside the same string: for example "assesses" has the substring "sses" 2 times, and "trans-Panamanian banana" has the substring "an" 6 times. Write a program that takes two lines of input, we call the first needle and the second haystack. Print the number of times that needle occurs as a substring of haystack."
my solution (which works) is:
first = str(input())
second = str(input())
count = 0
location = 0
while location < len(second):
if location == 0:
location = str.find(second,first,0)
if location < 0:
break
count = count + 1
location = str.find(second,first,location +1)
if location < 0:
break
count = count + 1
print(count)
if you notice, i have on two separate occasions made the if statement that if location is less than 0, to break. is there some way to make this a 'global' condition so i do not have repetitive code? i imagine efficiency becomes paramount with increasing program sophistication so i am trying to develop good practice now.
how would python gurus optimize this code or am i just being too nitpicky?
I think Matthew and darshan have the best solution. I will just post a variation which is based on your solution:
first = str(input())
second = str(input())
def count_needle(first, second):
location = str.find(second,first)
if location == -1:
return 0 # none whatsoever
else:
count = 1
while location < len(second):
location = str.find(second,first,location +1)
if location < 0:
break
count = count + 1
return count
print(count_needle(first, second))
Idea:
use function to structure the code when appropriate
initialise the variable location before entering the while loop save you from checking location < 0 multiple times
Check out regular expressions, python's re module (http://docs.python.org/library/re.html). For example,
import re
first = str(input())
second = str(input())
regex = first[:-1] + '(?=' + first[-1] + ')'
print(len(re.findall(regex, second)))
As mentioned by Matthew Adams the best way to do it is using python'd re module Python re module.
For your case the solution would look something like this:
import re
def find_needle_in_heystack(needle, heystack):
return len(re.findall(needle, heystack))
Since you are learning python, best way would be to use 'DRY' [Don't Repeat Yourself] mantra. There are lots of python utilities that you can use for many similar situation.
For a quick overview of few very important python modules you can go through this class:
Google Python Class
which should only take you a day.
even your aproach could be imo simplified (which uses the fact, that find returns -1, while you aks it to search from non existent offset):
>>> x = 'xoxoxo'
>>> start = x.find('o')
>>> indexes = []
>>> while start > -1:
... indexes.append(start)
... start = x.find('o',start+1)
>>> indexes
[1, 3, 5]
needle = "ss"
haystack = "ssi lass 2 vecess estan ss."
print 'needle occurs %d times in haystack.' % haystack.count(needle)
Here you go :
first = str(input())
second = str(input())
x=len(first)
counter=0
for i in range(0,len(second)):
if first==second[i:(x+i)]:
counter=counter+1
print(counter)
Answer
needle=input()
haystack=input()
counter=0
for i in range(0,len(haystack)):
if(haystack[i:len(needle)+i]!=needle):
continue
counter=counter+1
print(counter)

Python, I need the following code to finish quicker

I need the following code to finish quicker without threads or multiprocessing. If anyone knows of any tricks that would be greatly appreciated. maybe for i in enumerate() or changing the list to a string before calculating, I'm not sure.
For the example below, I have attempted to recreate the variables using a random sequence, however this has rendered some of the conditions inside the loop useless ... which is ok for this example, it just means the 'true' application for the code will take slightly longer.
Currently on my i7, the example below (which will mostly bypass some of its conditions) completes in 1 second, I would like to get this down as much as possible.
import random
import time
import collections
import cProfile
def random_string(length=7):
"""Return a random string of given length"""
return "".join([chr(random.randint(65, 90)) for i in range(length)])
LIST_LEN = 18400
original = [[random_string() for i in range(LIST_LEN)] for j in range(6)]
LIST_LEN = 5
SufxList = [random_string() for i in range(LIST_LEN)]
LIST_LEN = 28
TerminateHook = [random_string() for i in range(LIST_LEN)]
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Exclude above from benchmark
ListVar = original[:]
for b in range(len(ListVar)):
for c in range(len(ListVar[b])):
#If its an int ... remove
try:
int(ListVar[b][c].replace(' ', ''))
ListVar[b][c] = ''
except: pass
#if any second sufxList delete
for d in range(len(SufxList)):
if ListVar[b][c].find(SufxList[d]) != -1: ListVar[b][c] = ''
for d in range(len(TerminateHook)):
if ListVar[b][c].find(TerminateHook[d]) != -1: ListVar[b][c] = ''
#remove all '' from list
while '' in ListVar[b]: ListVar[b].remove('')
print(ListVar[b])
ListVar = original[:]
That makes a shallow copy of ListVar, so your changes to the second level lists are going to affect the original also. Are you sure that is what you want? Much better would be to build the new modified list from scratch.
for b in range(len(ListVar)):
for c in range(len(ListVar[b])):
Yuck: whenever possible iterate directly over lists.
#If its an int ... remove
try:
int(ListVar[b][c].replace(' ', ''))
ListVar[b][c] = ''
except: pass
You want to ignore spaces in the middle of numbers? That doesn't sound right. If the numbers can be negative you may want to use the try..except but if they are only positive just use .isdigit().
#if any second sufxList delete
for d in range(len(SufxList)):
if ListVar[b][c].find(SufxList[d]) != -1: ListVar[b][c] = ''
Is that just bad naming? SufxList implies you are looking for suffixes, if so just use .endswith() (and note that you can pass a tuple in to avoid the loop). If you really do want to find the the suffix is anywhere in the string use the in operator.
for d in range(len(TerminateHook)):
if ListVar[b][c].find(TerminateHook[d]) != -1: ListVar[b][c] = ''
Again use the in operator. Also any() is useful here.
#remove all '' from list
while '' in ListVar[b]: ListVar[b].remove('')
and that while is O(n^2) i.e. it will be slow. You could use a list comprehension instead to strip out the blanks, but better just to build clean lists to begin with.
print(ListVar[b])
I think maybe your indentation was wrong on that print.
Putting these suggestions together gives something like:
suffixes = tuple(SufxList)
newListVar = []
for row in original:
newRow = []
newListVar.append(newRow)
for value in row:
if (not value.isdigit() and
not value.endswith(suffixes) and
not any(th in value for th in TerminateHook)):
newRow.append(value)
print(newRow)

Categories