Mime type optimisation in python

Mime type optimisation in python - python

I want to solve the mime challenge in coding games.com. My code can pass all the test but not the optimisation test.
I tried to remove all useless functions like parsing to string but the problem is on the way I think about it.
import sys
import math
# Auto-generated code below aims at helping you parse
# the standard input according to the problem statement.
n = int(input()) # Number of elements which make up the association table.
q = int(input())
# Number Q of file names to be analyzed.
dico = {}
# My function
def check(word):
for item in dico:
if(word[-len(item)-1:].upper() == "."+item.upper()):
return(dico[item])
return("UNKNOWN")
for i in range(n):
# ext: file extension
# mt: MIME type.
ext, mt = input().split()
dico[ext] = mt
for i in range(q):
fname = input()
fname = fname
print(check(fname))
# Write an action using print
# To debug: print("Debug messages...", file=sys.stderr)
#print("Debug message...", file=sys.stderr)
Failure
Process has timed out. This may mean that your solution is not optimized enough to handle some cases.

This is the right idea, but one detail appears to be destroying the performance. The problem is the line for item in dico:, which unnecessarily loops over every entry in the dictionary. This is a linear search O(n), checking for the target item-by-item. But this pretty much defeats the purpose of the dictionary data structure, which is to offer constant-time O(1) lookups. "Constant time" means that no matter how big the dictionary gets, the time it takes to find an item is always the same (thanks to hashing).
To draw a metaphor, imagine you're looking for a spoon in your kitchen. If you know where all the utensils, appliances and cookware are are ahead of time, you don't need to look in every drawer to find the utensils. Instead, you just go straight to the utensils drawer containing the spoon you want, and it's one-shot!
On the other hand, if you're in someone else's kitchen, it can be difficult to find a spoon. You have to start at one end of the cupboard and check every drawer until you find the utensils. In the worst-case, you might get unlucky and have to check every drawer before you find the utensil drawer.
Back to the code, the above snippet is using the latter approach, but we're dealing with trying to find something in 10k unfamiliar kitchens each with 10k drawers. Pretty slow, right?
If you can adjust the solution to check the dictionary in constant time, without a loop, then you can handle n = 10000 and q = 10000 without having to make q * n iterations (you can do it in q iterations instead--so much faster!).

Thank you for your help,
I figured out the solution.
n = int(input()) # Number of elements which make up the association table.
q = int(input()) # Number Q of file names to be analyzed.
dico = {}
# My function
def check(word):
if("." in word):
n = len(word)-(word.rfind(".")+1)
extension = word[-n:].lower()
if(extension in dico):
return(dico[extension])
return("UNKNOWN")
for i in range(n):
# ext: file extension
# mt: MIME type.
ext, mt = input().split()
dico[ext.lower()] = mt
for i in range(q):
fname = input()
print(check(fname))
Your explanation was clear :D
Thank you

Related

Trying to speed up recursive function, but memozation is making it take longer

I am trying to implement regulation 9.3 from the FIDE Chess Olympiad pairing system.
Below is the script I'm trying to run. When I comment out the #cached line, it actually runs faster. I want to use this function for even values of n up to ~100.
import itertools
from copy import deepcopy
from memoization import cached
#cached
def pairing(n, usedTeams = [], teams = None, reverse = False):
"""
Returns the pairings of a list of teams based on their index in their position in the pool.
Arguments:
n = number of Teams
usedTeams = a parameter used in recursion to carry the found matches to the end of the recursion (i.e. a leaf node)
teams = used in recursion ^^
reverse = if you need to prioritize finding a pairing for the lowest rated team
Returns:
A list of lists of match pairings
"""
# print('trying to pair', n, ' teams')
# if n > 10:
# return None
if teams is None:
teams = list(range(0,n))
global matches
matches = []
if reverse == True:
teams.reverse()
usedTeams = deepcopy(usedTeams)
oppTeams = []
if len(teams) == 2:
usedTeams.append([teams[0], teams[1]])
matches.append(usedTeams)
elif len(teams) > 2:
team = teams[0]
oppTeams = [teams[i] for i in itertools.chain(range(round(n/2), n), range(round(n/2)-1,0,-1))]
currUsed = deepcopy(usedTeams)
for opp in oppTeams:
newUsed = currUsed + [[team, opp]]
if len(oppTeams) > 1:
tmpTeams = [t for t in teams if t not in [team, opp]]
pairing(len(tmpTeams), newUsed, tmpTeams)
return matches
import time
start = time.process_time()
pairing(12, [], None)
print(time.process_time() - start)
Any tips for making this run faster, or using memoization differently?

I modified your code to find out:
import itertools
from copy import deepcopy
from memoization import cached
# set up a records of call parameters
from collections import defaultdict
calls = defaultdict(int)
#cached
def pairing(n, usedTeams=[], teams=None, reverse=False):
# count this call
calls[(
n,
tuple(tuple(t) for t in usedTeams) if usedTeams is not None else None,
tuple(teams) if teams is not None else None,
reverse
)] += 1
... # your same code here, left out for brevity
import time
start = time.process_time()
pairing(12, [], None)
print(time.process_time() - start)
# print the average number of calls for any parameter combination
print(sum(calls.values()) / len(calls))
Output:
0.265625
1.0
The average number of calls using any combination of parameters is 1.0 - in other words, memoization will do exactly nothing, except add overhead. Memoization can only speed up your code if the function gets called with the same parameters repeatedly, and only when that's sufficiently frequent to offset the overhead cost of memoization.
In this case, you're adding the overhead, but since the function is never called with the same parameters, not even once, there is no benefit.
And my test is being generous, assuming that #cached will somehow cleverly figure out that two lists passed in have the same contents for example, without incurring an impossible overhead - which I don't know it does. So, the test assumes the most favourable effectiveness of #cached, but to no avail.
More in general, it's safe to assume there's no magic sauce you can just throw at a program without some analysis and careful application to make it faster. If there were, the language / compiler would likely do it as a default, or offer it as an easy option (for example when trading space for speed, as with memoization). You can of course get lucky and have the particular sauce you throw at it work in some case, but even then it would probably pay to carefully analyse where it does the most good, or any good at all, instead of drowning your code in it.

Update: Python average income reading and writing files

I was writing a code to find the average household income, and how many families are below poverty line.
this is my code so far
def povertyLevel():
inFile = open('program10.txt', 'r')
outFile = open('program10-out.txt', 'w')
outFile.write(str("%12s %12s %15s\n" % ("Account #", "Income", "Members")))
lineRead = inFile.readline() # Read first record
while lineRead != '': # While there are more records
words = lineRead.split() # Split the records into substrings
acctNum = int(words[0]) # Convert first substring to integer
annualIncome = float(words[1]) # Convert second substring to float
members = int(words[2]) # Convert third substring to integer
outFile.write(str("%10d %15.2f %10d\n" % (acctNum, annualIncome, members)))
lineRead = inFile.readline() # Read next record
# Close the file.
inFile.close() # Close file
Call the main function.
povertyLevel()
I am trying to find the average of annualIncome and what i tried to do was
avgIncome = (sum(annualIncome)/len(annualIncome))
outFile.write(avgIncome)
i did this inside the while lineRead. however it gave me an error saying
avgIncome = (sum(annualIncome)/len(annualIncome))
TypeError: 'float' object is not iterable
currently i am trying to find which household that exceeds the average income.

avgIncome expects a sequence (such as a list) (Thanks for the correction, Magenta Nova.), but its argument annualIncome is a float:
annualIncome = float(words[1])
It seems to me you want to build up a list:
allIncomes = []
while lineRead != '':
...
allIncomes.append(annualIncome)
averageInc = avgIncome(allIncomes)
(Note that I have one less indentation level for the avgIncome call.)
Also, once you get this working, I highly recommend a trip over to https://codereview.stackexchange.com/. You could get a lot of feedback on ways to improve this.
Edit:
In light of your edits, my advice still stands. You need to first compute the average before you can do comparisons. Once you have the average, you will need to loop over the data again to compare each income. Note: I advise saving the data somehow for the second loop, instead of reparsing the file. (You may even wish to separate reading the data from computing the average entirely.) That might best be accomplished with a new object or a namedtuple or a dict.

sum() and len() both take as their arguments an iterable. read the python documentation for more on iterables. you are passing a float into them as an argument. what would it mean to get the sum, or the length, of a floating point number? even thinking outside the world of coding, it's hard to make sense of that.
it seems like you need to review the basics of python types.

What's wrong with my python multiprocessing code?

I am an almost new programmer learning python for a few months. For the last 2 weeks, I had been coding to make a script to search permutations of numbers that make magic squares.
Finally I succeeded in searching the whole 880 4x4 magic square numbers sets within 30 seconds. After that I made some different Perimeter Magic Square program. It finds out more than 10,000,000 permutations so that I want to store them part by part to files. The problem is that my program doesn't use all my processes that while it is working to store some partial data to a file, it stops searching new number sets. I hope I could make one process of my CPU keep searching on and the others store the searched data to files.
The following is of the similar structure to my magic square program.
while True:
print('How many digits do you want? (more than 20): ', end='')
ansr = input()
if ansr.isdigit() and int(ansr) > 20:
ansr = int(ansr)
break
else:
continue
fileNum = 0
itemCount = 0
def fileMaker():
global fileNum, itemCount
tempStr = ''
for i in permutationList:
itemCount += 1
tempStr += str(sum(i[:3])) + ' : ' + str(i) + ' : ' + str(itemCount) + '\n'
fileNum += 1
file = open('{0} Permutations {1:03}.txt'.format(ansr, fileNum), 'w')
file.write(tempStr)
file.close()
numList = [i for i in range(1, ansr+1)]
permutationList = []
itemCount = 0
def makePermutList(numList, ansr):
global permutationList
for i in numList:
numList1 = numList[:]
numList1.remove(i)
for ii in numList1:
numList2 = numList1[:]
numList2.remove(ii)
for iii in numList2:
numList3 = numList2[:]
numList3.remove(iii)
for iiii in numList3:
numList4 = numList3[:]
numList4.remove(iiii)
for v in numList4:
permutationList.append([i, ii, iii, iiii, v])
if len(permutationList) == 200000:
print(permutationList[-1])
fileMaker()
permutationList = []
fileMaker()
makePermutList(numList, ansr)
I added from multiprocessing import Pool at the top. And I replaced two 'fileMaker()' parts at the end with the following.
if __name__ == '__main__':
workers = Pool(processes=2)
workers.map(fileMaker, ())
The result? Oh no. It just works awkwardly. For now, multiprocessing looks too difficult for me.
Anybody, please, teach me something. How should my code be modified?

Well, addressing some things that are bugging me before getting to your asked question.
numList = [i for i in range(1, ansr+1)]
I know list comprehensions are cool, but please just do list(range(1, ansr+1)) if you need the iterable to be a list (which you probably don't need, but I digress).
def makePermutList(numList, ansr):
...
This is quite the hack. Is there a reason you can't use itertools.permutations(numList,n)? It's certainly going to be faster, and friendlier on memory.
Lastly, answering your question: if you are looking to improve i/o performance, the last thing you should do is make it multithreaded. I don't mean you shouldn't do it, I mean that it should literally be the last thing you do. Refactor/improve other things first.
You need to take all of that top-level code that uses globals, apply the backspace key to it, and rewrite functions that pass data around properly. Then you can think about using threads. I would personally use from threading import Thread and manually spawn Threads to do each unit of I/O rather than using multiprocessing.

How to speed up Python string matching code

I have this code which computes the Longest Common Subsequence between random strings to see how accurately one can reconstruct an unknown region of the input. To get good statistics I need to iterate it many times but my current python implementation is far too slow. Even using pypy it currently takes 21 seconds to run once and I would ideally like to run it 100s of times.
#!/usr/bin/python
import random
import itertools
#test to see how many different unknowns are compatible with a set of LCS answers.
def lcs(x, y):
n = len(x)
m = len(y)
# table is the dynamic programming table
table = [list(itertools.repeat(0, n+1)) for _ in xrange(m+1)]
for i in range(n+1): # i=0,1,...,n
for j in range(m+1): # j=0,1,...,m
if i == 0 or j == 0:
table[i][j] = 0
elif x[i-1] == y[j-1]:
table[i][j] = table[i-1][j-1] + 1
else:
table[i][j] = max(table[i-1][j], table[i][j-1])
# Now, table[n, m] is the length of LCS of x and y.
return table[n][m]
def lcses(pattern, text):
return [lcs(pattern, text[i:i+2*l]) for i in xrange(0,l)]
l = 15
#Create the pattern
pattern = [random.choice('01') for i in xrange(2*l)]
#create text start and end and unknown.
start = [random.choice('01') for i in xrange(l)]
end = [random.choice('01') for i in xrange(l)]
unknown = [random.choice('01') for i in xrange(l)]
lcslist= lcses(pattern, start+unknown+end)
count = 0
for test in itertools.product('01',repeat = l):
test=list(test)
testlist = lcses(pattern, start+test+end)
if (testlist == lcslist):
count += 1
print count
I tried converting it to numpy but I must have done it badly as it actually ran more slowly. Can this code be sped up a lot somehow?
Update. Following a comment below, it would be better if lcses used a recurrence directly which gave the LCS between pattern and all sublists of text of the same length. Is it possible to modify the classic dynamic programming LCS algorithm somehow to do this?

The recurrence table table is being recomputed 15 times on every call to lcses() when it is only dependent upon m and n where m has a maximum value of 2*l and n is at most 3*l.
If your program only computed table once, it would be dynamic programming which it is not currently. A Python idiom for this would be
table = None
def use_lcs_table(m, n, l):
global table
if table is None:
table = lcs(2*l, 3*l)
return table[m][n]
Except using an class instance would be cleaner and more extensible than a global table declaration. But this gives you an idea of why its taking so much time.
Added in reply to comment:
Dynamic Programming is an optimization that requires a trade-off of extra space for less time. In your example you appear to be doing a table pre-computation in lcs() but you build the whole list on every single call and then throw it away. I don't claim to understand the algorithm you are trying to implement, but the way you have it coded, it either:
Has no recurrence relation, thus no grounds for DP optimization, or
Has a recurrence relation, the implementation of which you bungled.

Python: creating a dictionary that writes high scores to a file

First: you don't have to code this for me, unless you're a super awesome nice guy. But since you're all great at programming and understand it so much better than me and all, it might just be easier (since it's probably not too many lines of code) than writing paragraph after paragraph trying to make me understand it.
So - I need to make a list of high scores that updates itself upon new entries. So here it goes:
First step - done
I have player-entered input, which has been taken as a data for a few calculations:
import time
import datetime
print "Current time:", time1.strftime("%d.%m.%Y, %H:%M")
time1 = datetime.datetime.now()
a = raw_input("Enter weight: ")
b = raw_input("Enter height: ")
c = a/b
Second step - making high score list
Here, I would need some sort of a dictionary or a thing that would read the previous entries and check if the score (c) is (at least) better than the score of the last one in "high scores", and if it is, it would prompt you to enter your name.
After you entered your name, it would post your name, your a, b, c, and time in a high score list.
This is what I came up with, and it definitely doesn't work:
list = [("CPU", 200, 100, 2, time1)]
player = "CPU"
a = 200
b = 100
c = 2
time1 = "20.12.2012, 21:38"
list.append((player, a, b, c, time1))
list.sort()
import pickle
scores = open("scores", "w")
pickle.dump(list[-5:], scores)
scores.close()
scores = open("scores", "r")
oldscores = pickle.load(scores)
scores.close()
print oldscores()
I know I did something terribly stupid, but anyways, thanks for reading this and I hope you can help me out with this one. :-)

First, don't use list as a variable name. It shadows the built-in list object. Second, avoid using just plain date strings, since it is much easier to work with datetime objects, which support proper comparisons and easy conversions.
Here is a full example of your code, with individual functions to help divide up the steps. I am trying not to use any more advanced modules or functionality, since you are obviously just learning:
import os
import datetime
import cPickle
# just a constants we can use to define our score file location
SCORES_FILE = "scores.pickle"
def get_user_data():
time1 = datetime.datetime.now()
print "Current time:", time1.strftime("%d.%m.%Y, %H:%M")
a = None
while True:
a = raw_input("Enter weight: ")
try:
a = float(a)
except:
continue
else:
break
b = None
while True:
b = raw_input("Enter height: ")
try:
b = float(b)
except:
continue
else:
break
c = a/b
return ['', a, b, c, time1]
def read_high_scores():
# initialize an empty score file if it does
# not exist already, and return an empty list
if not os.path.isfile(SCORES_FILE):
write_high_scores([])
return []
with open(SCORES_FILE, 'r') as f:
scores = cPickle.load(f)
return scores
def write_high_scores(scores):
with open(SCORES_FILE, 'w') as f:
cPickle.dump(scores, f)
def update_scores(newScore, highScores):
# reuse an anonymous function for looking
# up the `c` (4th item) score from the object
key = lambda item: item[3]
# make a local copy of the scores
highScores = highScores[:]
lowest = None
if highScores:
lowest = min(highScores, key=key)
# only add the new score if the high scores
# are empty, or it beats the lowest one
if lowest is None or (newScore[3] > lowest[3]):
newScore[0] = raw_input("Enter name: ")
highScores.append(newScore)
# take only the highest 5 scores and return them
highScores.sort(key=key, reverse=True)
return highScores[:5]
def print_high_scores(scores):
# loop over scores using enumerate to also
# get an int counter for printing
for i, score in enumerate(scores):
name, a, b, c, time1 = score
# #1 50.0 jdi (20.12.2012, 15:02)
print "#%d\t%s\t%s\t(%s)" % \
(i+1, c, name, time1.strftime("%d.%m.%Y, %H:%M"))
def main():
score = get_user_data()
highScores = read_high_scores()
highScores = update_scores(score, highScores)
write_high_scores(highScores)
print_high_scores(highScores)
if __name__ == "__main__":
main()
What it does now is only add new scores if there were no high scores or it beats the lowest. You could modify it to always add a new score if there are less than 5 previous scores, instead of requiring it to beat the lowest one. And then just perform the lowest check after the size of highscores >= 5

The first thing I noticed is that you did not tell list.sort() that the sorting should be based on the last element of each entry. By default, list.sort() will use Python's default sorting order, which will sort entries based on the first element of each entry (i.e. the name), then mode on to the second element, the third element and so on. So, you have to tell list.sort() which item to use for sorting:
from operator import itemgetter
[...]
list.sort(key=itemgetter(3))
This will sort entries based on the item with index 3 in each tuple, i.e. the fourth item.
Also, print oldscores() will definitely not work since oldscores is not a function, hence you cannot call it with the () operator. print oldscores is probably better.

Here are the things I notice.
These lines seem to be in the wrong order:
print "Current time:", time1.strftime("%d.%m.%Y, %H:%M")
time1 = datetime.datetime.now()
When the user enters the height and weight, they are going to be read in as strings, not integers, so you will get a TypeError on this line:
c = a/b
You could solve this by casting a and b to float like so:
a = float(raw_input("Enter weight: "))
But you'll probably need to wrap this in a try/catch block, in case the user puts in garbage, basically anything that can't be cast to a float. Put the whole thing in a while block until they get it right.
So, something like this:
b = None
while b == None:
try:
b = float(raw_input("Enter height: "))
except:
print "Weight should be entered using only digits, like '187'"
So, on to the second part, you shouldn't use list as a variable name, since it's a builtin, I'll use high_scores.
# Add one default entry to the list
high_scores = [("CPU", 200, 100, 2, "20.12.2012, 4:20")]
You say you want to check the player score against the high score, to see if it's best, but if that's the case, why a list? Why not just a single entry? Anyhow, that's confusing me, not sure if you really want a high score list, or just one high score.
So, let's just add the score, no matter what:
Assume you've gotten their name into the name variable.
high_score.append((name, a, b, c, time1))
Then apply the other answer from #Tamás

You definitely don't want a dictionary here. The whole point of a dictionary is to be able to map keys to values, without any sorting. What you want is a sorted list. And you've already got that.
Well, as Tamás points out, you've actually got a list sorted by the player name, not the score. On top of that, you want to sort in downward order, not upward. You could use the decorate-sort-undecorate pattern, or a key function, or whatever, but you need to do something. Also, you've put it in a variable named list, which is a very bad idea, because that's already the name of the list type.
Anyway, you can find out whether to add something into a sorted list, and where to insert it if so, using the bisect module in the standard library. But it's probably simpler to just use something like SortedCollection or blist.
Here's an example:
highscores = SortedCollection(scores, key=lambda x: -x[3])
Now, when you finish the game:
highscores.insert_right((player, a, b, newscore, time1))
del highscores[-1]
That's it. If you were actually not in the top 10, you'll be added at #11, then removed. If you were in the top 10, you'll be added, and the old #10 will now be #11 and be removed.
If you don't want to prepopulate the list with 10 fake scores the way old arcade games used to, just change it to this:
highscores.insert_right((player, a, b, newscore, time1))
del highscores[10:]
Now, if there were already 10 scores, when you get added, #11 will get deleted, but if there were only 3, nothing gets deleted, and now there are 4.
Meanwhile, I'm not sure why you're writing the new scores out to a pickle file, and then reading the same thing back in. You probably want to do the reading before adding the highscore to the list, and then do the writing after adding it.
You also asked how to "beautify the list". Well, there are three sides to that.
First of all, in the code, (player, a, b, c, time1) isn't very meaningful. Giving the variables better names would help, of course, but ultimately you still come down to the fact that when accessing list, you have to do entry[3] to get the score or entry[4] to get the time.
There are at least three ways to solve this:
Store a list (or SortedCollection) of dicts instead of tuples. The code gets a bit more verbose, but a lot more readable. You write {'player': player, 'height': a, 'weight': b, 'score': c, 'time': time1}, and then when accessing the list, you do entry['score'] instead of entry[3].
Use a collection of namedtuples. Now you can actually just insert ScoreEntry(player, a, b, c, time1), or you can insert ScoreEntry(player=player, height=a, weight=b, score=c, time=time1), whichever is more readable in a given case, and they both work the same way. And you can access entry.score or as entry[3], again using whichever is more readable.
Write an explicit class for score entries. This is pretty similar to the previous one, but there's more code to write, and you can't do indexed access anymore, but on the plus side you don't have to understand namedtuple.
Second, if you just print the entries, they look like a mess. The way to deal with that is string formatting. Instead of print scores, you do something like this:
print '\n'.join("{}: height {}, weight {}, score {} at {}".format(entry)
for entry in highscores)
If you're using a class or namedtuple instead of just a tuple, you can even format by name instead of by position, making the code much more readable.
Finally, the highscore file itself is an unreadable mess, because pickle is not meant for human consumption. If you want it to be human-readable, you have to pick a format, and write the code to serialize that format. Fortunately, the CSV format is pretty human-readable, and most of the code is already written for you in the csv module. (You may want to look at the DictReader and DictWriter classes, especially if you want to write a header line. Again, there's the tradeoff of a bit more code for a lot more readability.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.