How to speed up Python string matching code

How to speed up Python string matching code - python

I have this code which computes the Longest Common Subsequence between random strings to see how accurately one can reconstruct an unknown region of the input. To get good statistics I need to iterate it many times but my current python implementation is far too slow. Even using pypy it currently takes 21 seconds to run once and I would ideally like to run it 100s of times.
#!/usr/bin/python
import random
import itertools
#test to see how many different unknowns are compatible with a set of LCS answers.
def lcs(x, y):
n = len(x)
m = len(y)
# table is the dynamic programming table
table = [list(itertools.repeat(0, n+1)) for _ in xrange(m+1)]
for i in range(n+1): # i=0,1,...,n
for j in range(m+1): # j=0,1,...,m
if i == 0 or j == 0:
table[i][j] = 0
elif x[i-1] == y[j-1]:
table[i][j] = table[i-1][j-1] + 1
else:
table[i][j] = max(table[i-1][j], table[i][j-1])
# Now, table[n, m] is the length of LCS of x and y.
return table[n][m]
def lcses(pattern, text):
return [lcs(pattern, text[i:i+2*l]) for i in xrange(0,l)]
l = 15
#Create the pattern
pattern = [random.choice('01') for i in xrange(2*l)]
#create text start and end and unknown.
start = [random.choice('01') for i in xrange(l)]
end = [random.choice('01') for i in xrange(l)]
unknown = [random.choice('01') for i in xrange(l)]
lcslist= lcses(pattern, start+unknown+end)
count = 0
for test in itertools.product('01',repeat = l):
test=list(test)
testlist = lcses(pattern, start+test+end)
if (testlist == lcslist):
count += 1
print count
I tried converting it to numpy but I must have done it badly as it actually ran more slowly. Can this code be sped up a lot somehow?
Update. Following a comment below, it would be better if lcses used a recurrence directly which gave the LCS between pattern and all sublists of text of the same length. Is it possible to modify the classic dynamic programming LCS algorithm somehow to do this?

The recurrence table table is being recomputed 15 times on every call to lcses() when it is only dependent upon m and n where m has a maximum value of 2*l and n is at most 3*l.
If your program only computed table once, it would be dynamic programming which it is not currently. A Python idiom for this would be
table = None
def use_lcs_table(m, n, l):
global table
if table is None:
table = lcs(2*l, 3*l)
return table[m][n]
Except using an class instance would be cleaner and more extensible than a global table declaration. But this gives you an idea of why its taking so much time.
Added in reply to comment:
Dynamic Programming is an optimization that requires a trade-off of extra space for less time. In your example you appear to be doing a table pre-computation in lcs() but you build the whole list on every single call and then throw it away. I don't claim to understand the algorithm you are trying to implement, but the way you have it coded, it either:
Has no recurrence relation, thus no grounds for DP optimization, or
Has a recurrence relation, the implementation of which you bungled.

Related

Trying to speed up recursive function, but memozation is making it take longer

I am trying to implement regulation 9.3 from the FIDE Chess Olympiad pairing system.
Below is the script I'm trying to run. When I comment out the #cached line, it actually runs faster. I want to use this function for even values of n up to ~100.
import itertools
from copy import deepcopy
from memoization import cached
#cached
def pairing(n, usedTeams = [], teams = None, reverse = False):
"""
Returns the pairings of a list of teams based on their index in their position in the pool.
Arguments:
n = number of Teams
usedTeams = a parameter used in recursion to carry the found matches to the end of the recursion (i.e. a leaf node)
teams = used in recursion ^^
reverse = if you need to prioritize finding a pairing for the lowest rated team
Returns:
A list of lists of match pairings
"""
# print('trying to pair', n, ' teams')
# if n > 10:
# return None
if teams is None:
teams = list(range(0,n))
global matches
matches = []
if reverse == True:
teams.reverse()
usedTeams = deepcopy(usedTeams)
oppTeams = []
if len(teams) == 2:
usedTeams.append([teams[0], teams[1]])
matches.append(usedTeams)
elif len(teams) > 2:
team = teams[0]
oppTeams = [teams[i] for i in itertools.chain(range(round(n/2), n), range(round(n/2)-1,0,-1))]
currUsed = deepcopy(usedTeams)
for opp in oppTeams:
newUsed = currUsed + [[team, opp]]
if len(oppTeams) > 1:
tmpTeams = [t for t in teams if t not in [team, opp]]
pairing(len(tmpTeams), newUsed, tmpTeams)
return matches
import time
start = time.process_time()
pairing(12, [], None)
print(time.process_time() - start)
Any tips for making this run faster, or using memoization differently?

I modified your code to find out:
import itertools
from copy import deepcopy
from memoization import cached
# set up a records of call parameters
from collections import defaultdict
calls = defaultdict(int)
#cached
def pairing(n, usedTeams=[], teams=None, reverse=False):
# count this call
calls[(
n,
tuple(tuple(t) for t in usedTeams) if usedTeams is not None else None,
tuple(teams) if teams is not None else None,
reverse
)] += 1
... # your same code here, left out for brevity
import time
start = time.process_time()
pairing(12, [], None)
print(time.process_time() - start)
# print the average number of calls for any parameter combination
print(sum(calls.values()) / len(calls))
Output:
0.265625
1.0
The average number of calls using any combination of parameters is 1.0 - in other words, memoization will do exactly nothing, except add overhead. Memoization can only speed up your code if the function gets called with the same parameters repeatedly, and only when that's sufficiently frequent to offset the overhead cost of memoization.
In this case, you're adding the overhead, but since the function is never called with the same parameters, not even once, there is no benefit.
And my test is being generous, assuming that #cached will somehow cleverly figure out that two lists passed in have the same contents for example, without incurring an impossible overhead - which I don't know it does. So, the test assumes the most favourable effectiveness of #cached, but to no avail.
More in general, it's safe to assume there's no magic sauce you can just throw at a program without some analysis and careful application to make it faster. If there were, the language / compiler would likely do it as a default, or offer it as an easy option (for example when trading space for speed, as with memoization). You can of course get lucky and have the particular sauce you throw at it work in some case, but even then it would probably pay to carefully analyse where it does the most good, or any good at all, instead of drowning your code in it.

Mime type optimisation in python

I want to solve the mime challenge in coding games.com. My code can pass all the test but not the optimisation test.
I tried to remove all useless functions like parsing to string but the problem is on the way I think about it.
import sys
import math
# Auto-generated code below aims at helping you parse
# the standard input according to the problem statement.
n = int(input()) # Number of elements which make up the association table.
q = int(input())
# Number Q of file names to be analyzed.
dico = {}
# My function
def check(word):
for item in dico:
if(word[-len(item)-1:].upper() == "."+item.upper()):
return(dico[item])
return("UNKNOWN")
for i in range(n):
# ext: file extension
# mt: MIME type.
ext, mt = input().split()
dico[ext] = mt
for i in range(q):
fname = input()
fname = fname
print(check(fname))
# Write an action using print
# To debug: print("Debug messages...", file=sys.stderr)
#print("Debug message...", file=sys.stderr)
Failure
Process has timed out. This may mean that your solution is not optimized enough to handle some cases.

This is the right idea, but one detail appears to be destroying the performance. The problem is the line for item in dico:, which unnecessarily loops over every entry in the dictionary. This is a linear search O(n), checking for the target item-by-item. But this pretty much defeats the purpose of the dictionary data structure, which is to offer constant-time O(1) lookups. "Constant time" means that no matter how big the dictionary gets, the time it takes to find an item is always the same (thanks to hashing).
To draw a metaphor, imagine you're looking for a spoon in your kitchen. If you know where all the utensils, appliances and cookware are are ahead of time, you don't need to look in every drawer to find the utensils. Instead, you just go straight to the utensils drawer containing the spoon you want, and it's one-shot!
On the other hand, if you're in someone else's kitchen, it can be difficult to find a spoon. You have to start at one end of the cupboard and check every drawer until you find the utensils. In the worst-case, you might get unlucky and have to check every drawer before you find the utensil drawer.
Back to the code, the above snippet is using the latter approach, but we're dealing with trying to find something in 10k unfamiliar kitchens each with 10k drawers. Pretty slow, right?
If you can adjust the solution to check the dictionary in constant time, without a loop, then you can handle n = 10000 and q = 10000 without having to make q * n iterations (you can do it in q iterations instead--so much faster!).

Thank you for your help,
I figured out the solution.
n = int(input()) # Number of elements which make up the association table.
q = int(input()) # Number Q of file names to be analyzed.
dico = {}
# My function
def check(word):
if("." in word):
n = len(word)-(word.rfind(".")+1)
extension = word[-n:].lower()
if(extension in dico):
return(dico[extension])
return("UNKNOWN")
for i in range(n):
# ext: file extension
# mt: MIME type.
ext, mt = input().split()
dico[ext.lower()] = mt
for i in range(q):
fname = input()
print(check(fname))
Your explanation was clear :D
Thank you

Time elapsed between linear search and binary search using Python

I have made two Python functions below, one for sequential (linear) search and other for binary search.
I want to do these 3 things for each size value in the given list :
generate a list of random integer values (between 1 to 1000,0000) for a given list size
run a sequential search for -1 on the list and record the time elapsed by sequential search
run a binary search for -1 on the sorted list (after sorting the list), and record the time elapsed by binary search
What I have done is :
def sequentialSearch(alist, item):
pos = 0
found = False
while pos < len(alist) and not found:
if alist[pos] == item:
found = True
else:
pos = pos + 1
return found
def binSearch(list, target):
list.sort()
return binSearchHelper(list, target, 0, len(list) - 1)
def binSearchHelper(list, target, left, right):
if left > right:
return False
middle = (left + right)//2
if list[middle] == target:
return True
elif list[middle] > target:
return binSearchHelper(list, target, left, middle - 1)
else:
return binSearchHelper(list, target, middle + 1, right)
import random
import time
list_sizes = [10,100,1000,10000,100000,1000000]
for size in list_sizes:
list = []
for x in range(size):
list.append(random.randint(1,10000000))
sequential_search_start_time = time.time()
sequentialSearch(list,-1)
sequential_search_end_time = time.time()
print("Time taken by linear search is = ",(sequential_search_end_time-sequential_search_start_time))
binary_search_start_time = time.time()
binSearch(list,-1)
binary_search_end_time = time.time()
print("Time taken by binary search is = ",(binary_search_end_time-binary_search_start_time))
print("\n")
The output I am getting is :
As we know that binary search is much faster than the linear search.
So, I just want to know why it is showing the time consumed by binary search is more than the time consumed by linear search?

1) You need to account for the sorting time. Binary search works only on sorted lists so sorting takes time, which takes the time complexity for it to O(nlogn). And in your case you are sorting it after the timer has started, So it will be higher.
2) You are searching for an element that doesn't exist in the list i.e -1 which is not the average case for Binary Search. Binary search's worst case has to make so many jumps just to never find the element.
3) Please do not use list as a variable it is a python's keyword and you are clearly overriding it. Use something else.
Now if you sort the list without timing it. Results change drastically. Here are mine.
Time taken by linear search is = 9.059906005859375e-06
Time taken by binary search is = 8.58306884765625e-06
Time taken by linear search is = 1.2159347534179688e-05
Time taken by binary search is = 4.5299530029296875e-06
Time taken by linear search is = 0.00011110305786132812
Time taken by binary search is = 5.9604644775390625e-06
Time taken by linear search is = 0.0011129379272460938
Time taken by binary search is = 8.344650268554688e-06
Time taken by linear search is = 0.011270761489868164
Time taken by binary search is = 1.5497207641601562e-05
Time taken by linear search is = 0.11133551597595215
Time taken by binary search is = 1.7642974853515625e-05
What I've done is just sort the list before it was timed. Way, Way better than if you had to sort it and search it all at once.

Efficiency when printing progress updates, print x vs if x%y==0: print x

I am running an algorithm which reads an excel document by rows, and pushes the rows to a SQL Server, using Python. I would like to print some sort of progression through the loop. I can think of two very simple options and I would like to know which is more lightweight and why.
Option A:
for x in xrange(1, sheet.nrows):
print x
cur.execute() # pushes to sql
Option B:
for x in xrange(1, sheet.nrows):
if x % some_check_progress_value == 0:
print x
cur.execute() # pushes to sql
I have a feeling that the second one would be more efficient but only for larger scale programs. Is there any way to calculate/determine this?

I'm a newbie, so I can't comment. An "answer" might be overkill, but it's all I can do for now.
My favorite thing for this is tqdm. It's minimally invasive, both code-wise and output-wise, and it gets the job done.

I am one of the developers of tqdm, a Python progress bar that tries to be as efficient as possible while providing as many automated features as possible.
The biggest performance sink we had was indeed I/O: printing to the console/file/whatever.
But if your loop is tight (more than 100 iterations/second), then it's useless to print every update, you'd just as well print just 1/10 of the updates and the user would see no difference, while your bar would be 10 times less overhead (faster).
To fix that, at first we added a mininterval parameter which updated the display only every x seconds (which is by default 0.1 seconds, the human eye cannot really see anything faster than that). Something like that:
import time
def my_bar(iterator, mininterval=0.1)
counter = 0
last_print_t = 0
for item in iterator:
if (time.time() - last_print_t) >= mininterval:
last_print_t = time.time()
print_your_bar_update(counter)
counter += 1
This will mostly fix your issue as your bar will always have a constant display overhead which will be more and more negligible as you have bigger iterators.
If you want to go further in the optimization, time.time() is also an I/O operation and thus has a cost greater than simple Python statements. To avoid that, you want to minimize the calls you do to time.time() by introducing another variable: miniters, which is the minimum number of iterations you want to skip before even checking the time:
import time
def my_bar(iterator, mininterval=0.1, miniters=10)
counter = 0
last_print_t = 0
last_print_counter = 0
for item in iterator:
if (counter - last_print_counter) >= miniters:
if (time.time() - last_print_t) >= mininterval:
last_print_t = time.time()
last_print_counter = counter
print_your_bar_update(counter)
counter += 1
You can see that miniters is similar to your Option B modulus solution, but it's better fitted as an added layer over time because time is more easily configured.
With these two parameters, you can manually finetune your progress bar to make it the most efficient possible for your loop.
However, miniters (or modulus) is tricky to get to work generally for everyone without manual finetuning, you need to make good assumptions and clever tricks to automate this finetuning. This is one of the major ongoing work we are doing on tqdm. Basically, what we do is that we try to calculate miniters to equal mininterval, so that time checking isn't even needed anymore. This automagic setting kicks in after mininterval gets triggered, something like that:
from __future__ import division
import time
def my_bar(iterator, mininterval=0.1, miniters=10, dynamic_miniters=True)
counter = 0
last_print_t = 0
last_print_counter = 0
for item in iterator:
if (counter - last_print_counter) >= miniters:
cur_time = time.time()
if (cur_time - last_print_t) >= mininterval:
if dynamic_miniters:
# Simple rule of three
delta_it = counter - last_print_counter
delta_t = cur_time - last_print_t
miniters = delta_it * mininterval / delta_t
last_print_t = cur_time
last_print_counter = counter
print_your_bar_update(counter)
counter += 1
There are various ways to compute miniters automatically, but usually you want to update it to match mininterval.
If you are interested in digging more, you can check the dynamic_miniters internal parameters, maxinterval and an experimental monitoring thread of the tqdm project.

Using the modulus check (counter % N == 0) is almost free compared print and a great solution if you run a high frequency iteration (log a lot).
Specially if you does not need to print for each iteration but want some feedback along the way.

What's wrong with my python multiprocessing code?

I am an almost new programmer learning python for a few months. For the last 2 weeks, I had been coding to make a script to search permutations of numbers that make magic squares.
Finally I succeeded in searching the whole 880 4x4 magic square numbers sets within 30 seconds. After that I made some different Perimeter Magic Square program. It finds out more than 10,000,000 permutations so that I want to store them part by part to files. The problem is that my program doesn't use all my processes that while it is working to store some partial data to a file, it stops searching new number sets. I hope I could make one process of my CPU keep searching on and the others store the searched data to files.
The following is of the similar structure to my magic square program.
while True:
print('How many digits do you want? (more than 20): ', end='')
ansr = input()
if ansr.isdigit() and int(ansr) > 20:
ansr = int(ansr)
break
else:
continue
fileNum = 0
itemCount = 0
def fileMaker():
global fileNum, itemCount
tempStr = ''
for i in permutationList:
itemCount += 1
tempStr += str(sum(i[:3])) + ' : ' + str(i) + ' : ' + str(itemCount) + '\n'
fileNum += 1
file = open('{0} Permutations {1:03}.txt'.format(ansr, fileNum), 'w')
file.write(tempStr)
file.close()
numList = [i for i in range(1, ansr+1)]
permutationList = []
itemCount = 0
def makePermutList(numList, ansr):
global permutationList
for i in numList:
numList1 = numList[:]
numList1.remove(i)
for ii in numList1:
numList2 = numList1[:]
numList2.remove(ii)
for iii in numList2:
numList3 = numList2[:]
numList3.remove(iii)
for iiii in numList3:
numList4 = numList3[:]
numList4.remove(iiii)
for v in numList4:
permutationList.append([i, ii, iii, iiii, v])
if len(permutationList) == 200000:
print(permutationList[-1])
fileMaker()
permutationList = []
fileMaker()
makePermutList(numList, ansr)
I added from multiprocessing import Pool at the top. And I replaced two 'fileMaker()' parts at the end with the following.
if __name__ == '__main__':
workers = Pool(processes=2)
workers.map(fileMaker, ())
The result? Oh no. It just works awkwardly. For now, multiprocessing looks too difficult for me.
Anybody, please, teach me something. How should my code be modified?

Well, addressing some things that are bugging me before getting to your asked question.
numList = [i for i in range(1, ansr+1)]
I know list comprehensions are cool, but please just do list(range(1, ansr+1)) if you need the iterable to be a list (which you probably don't need, but I digress).
def makePermutList(numList, ansr):
...
This is quite the hack. Is there a reason you can't use itertools.permutations(numList,n)? It's certainly going to be faster, and friendlier on memory.
Lastly, answering your question: if you are looking to improve i/o performance, the last thing you should do is make it multithreaded. I don't mean you shouldn't do it, I mean that it should literally be the last thing you do. Refactor/improve other things first.
You need to take all of that top-level code that uses globals, apply the backspace key to it, and rewrite functions that pass data around properly. Then you can think about using threads. I would personally use from threading import Thread and manually spawn Threads to do each unit of I/O rather than using multiprocessing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.