Time elapsed between linear search and binary search using Python - python

I have made two Python functions below, one for sequential (linear) search and other for binary search.
I want to do these 3 things for each size value in the given list :
generate a list of random integer values (between 1 to 1000,0000) for a given list size
run a sequential search for -1 on the list and record the time elapsed by sequential search
run a binary search for -1 on the sorted list (after sorting the list), and record the time elapsed by binary search
What I have done is :
def sequentialSearch(alist, item):
pos = 0
found = False
while pos < len(alist) and not found:
if alist[pos] == item:
found = True
else:
pos = pos + 1
return found
def binSearch(list, target):
list.sort()
return binSearchHelper(list, target, 0, len(list) - 1)
def binSearchHelper(list, target, left, right):
if left > right:
return False
middle = (left + right)//2
if list[middle] == target:
return True
elif list[middle] > target:
return binSearchHelper(list, target, left, middle - 1)
else:
return binSearchHelper(list, target, middle + 1, right)
import random
import time
list_sizes = [10,100,1000,10000,100000,1000000]
for size in list_sizes:
list = []
for x in range(size):
list.append(random.randint(1,10000000))
sequential_search_start_time = time.time()
sequentialSearch(list,-1)
sequential_search_end_time = time.time()
print("Time taken by linear search is = ",(sequential_search_end_time-sequential_search_start_time))
binary_search_start_time = time.time()
binSearch(list,-1)
binary_search_end_time = time.time()
print("Time taken by binary search is = ",(binary_search_end_time-binary_search_start_time))
print("\n")
The output I am getting is :
As we know that binary search is much faster than the linear search.
So, I just want to know why it is showing the time consumed by binary search is more than the time consumed by linear search?

1) You need to account for the sorting time. Binary search works only on sorted lists so sorting takes time, which takes the time complexity for it to O(nlogn). And in your case you are sorting it after the timer has started, So it will be higher.
2) You are searching for an element that doesn't exist in the list i.e -1 which is not the average case for Binary Search. Binary search's worst case has to make so many jumps just to never find the element.
3) Please do not use list as a variable it is a python's keyword and you are clearly overriding it. Use something else.
Now if you sort the list without timing it. Results change drastically. Here are mine.
Time taken by linear search is = 9.059906005859375e-06
Time taken by binary search is = 8.58306884765625e-06
Time taken by linear search is = 1.2159347534179688e-05
Time taken by binary search is = 4.5299530029296875e-06
Time taken by linear search is = 0.00011110305786132812
Time taken by binary search is = 5.9604644775390625e-06
Time taken by linear search is = 0.0011129379272460938
Time taken by binary search is = 8.344650268554688e-06
Time taken by linear search is = 0.011270761489868164
Time taken by binary search is = 1.5497207641601562e-05
Time taken by linear search is = 0.11133551597595215
Time taken by binary search is = 1.7642974853515625e-05
What I've done is just sort the list before it was timed. Way, Way better than if you had to sort it and search it all at once.

Related

Partial Match Binary Search of Complex Strings

Using python, I'm looking to iterate through a list which contains a few thousand entries. For each item in the list it needs to compare against items in other lists (which contain tens of thousands of entries), and do a partial comparison check. Once it finds a match above a set ratio, it will stop and move onto the next item.
One challenge: I am unable to install any additional python packages to complete this and limited to a python 3.4.2 distribution.
Below is some sample code which I am using. It works very well if the lists are small but once I apply it on very large lists, the runtime could take multiple hours to complete.
from difflib import SequenceMatcher
ref_list = [] #(contains 4k sorted entries - long complex strings)
list1 = [] #(contains 60k sorted entries - long complex strings)
list2 = [] #(contains 30k sorted entries - long complex strings)
all_lists = [list1,list2]
min_ratio = 0.93
partMatch = ''
for ref in ref_list:
for x in range(len(all_lists)):
for str1 in all_lists[x]:
check_ratio = SequenceMatcher(None, ref, str1).quick_ratio()
if check_ratio > min_ratio:
partMatch = str1 #do stuff with partMatch later
break
I'm thinking a binary search on all_lists[x] would fix the issue. If my calculations are correct, a 60k list would only take 16 attempts to find the partial match.
However, the issue is with the type of strings. A typical string could be anywhere from 80 to 500 characters long e.g.
lorem/ipsum/dolor/sit/amet/consectetur/adipiscing/elit/sed/do/eiusmod/tempor/incididunt/ut/labore/et/dolore/magna/aliqua/Ut/enim/ad/minim/veniam/quis/nostrud/exercitation
and although the lists are sorted, I'm not sure how I can validate a midpoint. As an example, if I shorten the strings to make it easier to read and provide the following lists:
ref_list = ['past/pre/dest[5]']
list1 = ['abc/def/ghi','xry/dos/zanth']
list2 = ['a/bat/cat', 'ortho/coli', 'past/pre/dest[6]', 'past/tar/lot', 'rif/six/1', 'tenta[17]', 'ufra/cos/xx']
We can see that the partial match for the string in ref_list is list2[2]. However, with a binary search, how do I determine that the partial match is definitely within the first half of list2?
I'd really appreciate any help with this. Efficiency is the most important factor here considering that I need to work on lists with tens of thousands of entries.
So I did more research into the background of string comparisons and it turns out the initial problem isn't as difficult as I originally thought.
To get the midpoint for a binary search, I can simply use the < and > operators. Since every ASCII character has a value, it seems that python will check the strings on a character-by-character basis. In this case, it doesn't matter how complex the string is.
However, one caveat is that some strings in the lists may have a rare naming difference of an uppercase character. To combat this, I've added str().lower() when generating the high/low/midpoints.
Working code is below. I've lowered the min_ratio value here, to cater to the short test strings but I will increase it in my main program.
#!/usr/bin/env python
# Copyright 2009-2017 BHG http://bw.org/
from difflib import SequenceMatcher
def binary_search_partmatch(arr, x):
low = 0
high = len(arr) - 1
mid = 0
min_ratio = 0.85
partMatch = ''
while low <= high:
mid = (high + low) // 2
# If midpoint is lower, ignore the left half of array
if str(arr[mid]).lower() < str(x).lower():
low = mid + 1
# If midpoint is higher, ignore the right half of array
elif str(arr[mid]).lower() > str(x).lower():
high = mid - 1
# x is present at the midpoint
else:
return -1
# If we reach here, then the exact element was not present. Check for a close match.
check_ratio = SequenceMatcher(None, x, str(arr[mid])).ratio()
if check_ratio > min_ratio:
partMatch = str(arr[mid])
return partMatch
else:
return -2
def main():
ref_list = ['past/pre/dest[5]', 'rif/six/1', 'testcase_no_match']
list1 = ['abc/def/ghi','xry/dos/zanth']
list2 = ['a/bat/cat', 'ortho/coli', 'past/Pre/dest[6]', 'past/tar/lot', 'rif/six/1', 'tenta[17]', 'ufra/cos/xx']
all_lists = [list1,list2]
for ref in ref_list:
for x in range(len(all_lists)):
result = binary_search_partmatch(all_lists[x], ref)
if result == -1:
print('Exact match found for "' + ref + '"' )
break
elif result == -2:
if x == (len(all_lists)-1):
print('No match or partial match found for "' + ref + '"')
else:
print('Partial match found for "' + ref + '": "' + str(result)+ '"')
break
if __name__ == '__main__':
main()
Output:
>>> Partial match found for "past/pre/dest[5]": "past/Pre/dest[6]"
>>> Exact match found for "rif/six/1"
>>> No match or partial match found for "testcase_no_match"
I'd still welcome any recommendations or unforeseen bugs with my test scenario here. I'm not a programmer by trade, so I may be overlooking something important.

Trying to speed up recursive function, but memozation is making it take longer

I am trying to implement regulation 9.3 from the FIDE Chess Olympiad pairing system.
Below is the script I'm trying to run. When I comment out the #cached line, it actually runs faster. I want to use this function for even values of n up to ~100.
import itertools
from copy import deepcopy
from memoization import cached
#cached
def pairing(n, usedTeams = [], teams = None, reverse = False):
"""
Returns the pairings of a list of teams based on their index in their position in the pool.
Arguments:
n = number of Teams
usedTeams = a parameter used in recursion to carry the found matches to the end of the recursion (i.e. a leaf node)
teams = used in recursion ^^
reverse = if you need to prioritize finding a pairing for the lowest rated team
Returns:
A list of lists of match pairings
"""
# print('trying to pair', n, ' teams')
# if n > 10:
# return None
if teams is None:
teams = list(range(0,n))
global matches
matches = []
if reverse == True:
teams.reverse()
usedTeams = deepcopy(usedTeams)
oppTeams = []
if len(teams) == 2:
usedTeams.append([teams[0], teams[1]])
matches.append(usedTeams)
elif len(teams) > 2:
team = teams[0]
oppTeams = [teams[i] for i in itertools.chain(range(round(n/2), n), range(round(n/2)-1,0,-1))]
currUsed = deepcopy(usedTeams)
for opp in oppTeams:
newUsed = currUsed + [[team, opp]]
if len(oppTeams) > 1:
tmpTeams = [t for t in teams if t not in [team, opp]]
pairing(len(tmpTeams), newUsed, tmpTeams)
return matches
import time
start = time.process_time()
pairing(12, [], None)
print(time.process_time() - start)
Any tips for making this run faster, or using memoization differently?
I modified your code to find out:
import itertools
from copy import deepcopy
from memoization import cached
# set up a records of call parameters
from collections import defaultdict
calls = defaultdict(int)
#cached
def pairing(n, usedTeams=[], teams=None, reverse=False):
# count this call
calls[(
n,
tuple(tuple(t) for t in usedTeams) if usedTeams is not None else None,
tuple(teams) if teams is not None else None,
reverse
)] += 1
... # your same code here, left out for brevity
import time
start = time.process_time()
pairing(12, [], None)
print(time.process_time() - start)
# print the average number of calls for any parameter combination
print(sum(calls.values()) / len(calls))
Output:
0.265625
1.0
The average number of calls using any combination of parameters is 1.0 - in other words, memoization will do exactly nothing, except add overhead. Memoization can only speed up your code if the function gets called with the same parameters repeatedly, and only when that's sufficiently frequent to offset the overhead cost of memoization.
In this case, you're adding the overhead, but since the function is never called with the same parameters, not even once, there is no benefit.
And my test is being generous, assuming that #cached will somehow cleverly figure out that two lists passed in have the same contents for example, without incurring an impossible overhead - which I don't know it does. So, the test assumes the most favourable effectiveness of #cached, but to no avail.
More in general, it's safe to assume there's no magic sauce you can just throw at a program without some analysis and careful application to make it faster. If there were, the language / compiler would likely do it as a default, or offer it as an easy option (for example when trading space for speed, as with memoization). You can of course get lucky and have the particular sauce you throw at it work in some case, but even then it would probably pay to carefully analyse where it does the most good, or any good at all, instead of drowning your code in it.

Trying to simulate a jump process in discrete time

I was trying to simulate a jump process in python, defined as follows:
Actually, in order to simulate the process, I evaluate its value at each time t, building an "history" of the values the process has taken. This is done accordingly to the following function, contained in a class:
def evaluate(self, t):
try:
return self.history[t], {} # History of the process
except:
new_steps = t - len(self.history) # Calculate how much steps will be evaluated in order to reach t
values_to_sample = 1 + new_steps
new_values = np.zeros(values_to_sample)
temp = self.history[-1] # Take last seen value
for i in range(values_to_sample):
temp = temp + self.A*( UNKNOWN TERM - temp ) + np.random.normal(scale=self.sigma)
# UNKNOWN TERM is the element that I want to "simulate"
new_values[i] = temp
self.history = np.concatenate((self.history, new_values))
return self.history[t], {}
The problem is that I do not know how to simulate the Yt term. To my understanding, what I should do is evaluate at each new sample if a jump has occurred, but I do not know how to do that. Can someone help me understand what should I do?

Create longestPossible(longest_possible in python) helper function that takes 1 integer argument which is a maximum length of a song in seconds

Am kind of new to coding,please help me out with this one with explanations:
songs is an array of objects which are formatted as follows:
{artist: 'Artist', title: 'Title String', playback: '04:30'}
You can expect playback value to be formatted exactly like above.
Output should be a title of the longest song from the database that matches the criteria of not being longer than specified time. If there's no songs matching criteria in the database, return false.
Either you could change playback, so that instead of a string, it's an integer (for instance, the length of the song in seconds) which you convert to a string for display, and test from there, or, during the test, you could take playback and convert it to its length in seconds, like so:
def songLength(playback):
seconds = playback.split(':')
lengthOfSong = int(seconds[0]) * 60 + int(seconds[1])
return lengthOfSong
This will give the following result:
>>> playback = '04:30'
>>> songLength(playback)
270
I'm not as familiar with the particular data structure you're using, but if you can iterate over these, you could do something like this:
def longestPossible(array, maxLength):
longest = 0
songName = ''
for song in array:
lenSong = songLength(song.playback) # I'm formatting song's playback like this because I'm not sure how you're going to be accessing it.
if maxLength >= lenSong and (maxLength - lenSong) < (maxLength - longest):
longest = lenSong
songName = song.title
if longest != 0:
return songName
else:
return '' # Empty strings will evaluate to False.
I haven't tested this, but I think this should at least get you on the right track. There are more Pythonic ways of doing this, so never stop improving your code. Good luck!

How to speed up Python string matching code

I have this code which computes the Longest Common Subsequence between random strings to see how accurately one can reconstruct an unknown region of the input. To get good statistics I need to iterate it many times but my current python implementation is far too slow. Even using pypy it currently takes 21 seconds to run once and I would ideally like to run it 100s of times.
#!/usr/bin/python
import random
import itertools
#test to see how many different unknowns are compatible with a set of LCS answers.
def lcs(x, y):
n = len(x)
m = len(y)
# table is the dynamic programming table
table = [list(itertools.repeat(0, n+1)) for _ in xrange(m+1)]
for i in range(n+1): # i=0,1,...,n
for j in range(m+1): # j=0,1,...,m
if i == 0 or j == 0:
table[i][j] = 0
elif x[i-1] == y[j-1]:
table[i][j] = table[i-1][j-1] + 1
else:
table[i][j] = max(table[i-1][j], table[i][j-1])
# Now, table[n, m] is the length of LCS of x and y.
return table[n][m]
def lcses(pattern, text):
return [lcs(pattern, text[i:i+2*l]) for i in xrange(0,l)]
l = 15
#Create the pattern
pattern = [random.choice('01') for i in xrange(2*l)]
#create text start and end and unknown.
start = [random.choice('01') for i in xrange(l)]
end = [random.choice('01') for i in xrange(l)]
unknown = [random.choice('01') for i in xrange(l)]
lcslist= lcses(pattern, start+unknown+end)
count = 0
for test in itertools.product('01',repeat = l):
test=list(test)
testlist = lcses(pattern, start+test+end)
if (testlist == lcslist):
count += 1
print count
I tried converting it to numpy but I must have done it badly as it actually ran more slowly. Can this code be sped up a lot somehow?
Update. Following a comment below, it would be better if lcses used a recurrence directly which gave the LCS between pattern and all sublists of text of the same length. Is it possible to modify the classic dynamic programming LCS algorithm somehow to do this?
The recurrence table table is being recomputed 15 times on every call to lcses() when it is only dependent upon m and n where m has a maximum value of 2*l and n is at most 3*l.
If your program only computed table once, it would be dynamic programming which it is not currently. A Python idiom for this would be
table = None
def use_lcs_table(m, n, l):
global table
if table is None:
table = lcs(2*l, 3*l)
return table[m][n]
Except using an class instance would be cleaner and more extensible than a global table declaration. But this gives you an idea of why its taking so much time.
Added in reply to comment:
Dynamic Programming is an optimization that requires a trade-off of extra space for less time. In your example you appear to be doing a table pre-computation in lcs() but you build the whole list on every single call and then throw it away. I don't claim to understand the algorithm you are trying to implement, but the way you have it coded, it either:
Has no recurrence relation, thus no grounds for DP optimization, or
Has a recurrence relation, the implementation of which you bungled.

Categories