Program running too slow! : Suggest Algorithmic/Implementation Optimization

Program running too slow! : Suggest Algorithmic/Implementation Optimization - python

I have a huge python list(A) of lists. The length of list A is around 90,000. Each inner list contain around 700 tuples of (datetime.date,string). Now, I am analyzing this data. What I am doing is I am taking a window of size x in inner lists where- x = len(inner list) * (some fraction <= 1) and I am saving each ordered pair (a,b) where a occurs before b in that window (actually the innerlists are sorted wrt time). I am moving this window upto the last element adding one element at a time from one end and removing from other which takes O(window-size)time as I am considering the new tuples only. My code:
for i in xrange(window_size):
j = i+1;
while j<window_size:
check_and_update(cur, my_list[i][1], my_list[j][1],log);
j=j+1
i=1;
while i<=len(my_list)-window_size:
j=i;
k=i+window_size-1;
while j<k:
check_and_update(cur, my_list[j][1], my_list[k][1],log);
j+=1
i += 1
Here cur is actually a sqlite3 database cursor,my_list is a list containing the tuples and I iterate this code for all the lists in A and log is a opened logfile. In method check_and_update() I am looking up my database to find the tuple if exists or else I insert it, along with its total number of occurrence so far. Code:
def check_and_update(cur,start,end,log):
t = str(start)+":"+ str(end)
cur.execute("INSERT OR REPLACE INTO Extra (tuple,count)\
VALUES ( ? , coalesce((SELECT count +1 from Extra WHERE tuple = ?),1))",[t,t])
As expected this number of tuples is HUGE and I have previously experimented with dictionary which eats up the memory quite fast. So, I resorted to SQLite3, but now it is too slow. I have tried indexing but with no help. Probably the my program is spending way to much time querying and updating the database. Do you have any optimization ideas for this problem? Probably changing the algorithm or some different approach/tools. Thank you!
Edit: My goal here is to find the total number of tuples of strings that occur within the window grouped by the number of different innerlists they occur in. I extract this information with this query:
for i in range(1,size+1):
cur.execute('select * from Extra where count = ?',str(i))
#other stuff
For Example ( I am ignoring the date entries and will write them as 'dt'):
My_list = [
[ ( dt,'user1') , (dt, 'user2'), (dt, 'user3') ]
[ ( dt,'user3') , (dt, 'user4')]
[ ( dt,'user2') , (dt, 'user3'), (dt,'user1') ]
]
here if I take fraction = 1 then, results:
only 1 occurrence in window: 5 (user 1-2,1-3,3-4,2-1,3-1)
only 2 occurrence in window: 2 (user 2-3)

Let me get this straight.
You have up to about 22 billion potential tuples (for 90000 lists, any of 700, any of the following entries, on average 350) which might be less depending on the window size. You want to find, but number of inner lists that they appear in, how many tuples there are.
Data of this size has to live on disk. The rule for data that lives on disk due to size is, "Never randomly access, instead generate and then sort."
So I'd suggest that you write out each tuple to a log file, one tuple per line. Sort that file. Now all instances of any given tuple are in one place. Then run through the file, and for each tuple emit the count of how many times it appears in (that is how many inner lists it is in). Sort that second file. Now run through that file, and you can extract how many tuples appeared 1x, 2x, 3x, etc.
If you have multiple machines, it is easy to convert this into a MapReduce. (Which is morally the same approach, but you get to parallelize a lot of stuff.)

Apache Hadoop is one of the MapReduce implementations that is suited for this kind of problem:

Related

Run time difference for "in" searching through "list" and "set" using Python

My understanding of list and set in Python are mainly that list allows duplicates, list allows ordered information, and list has position information. I found while I was trying to search if an element is with in a list, the runtime is much faster if I convert the list to a set first. For example, I wrote a code trying find the longest consecutive sequence in a list. Use a list from 0 to 10000 as an example, the longest consecutive is 10001. While using a list:
start_time = datetime.now()
nums = list(range(10000))
longest = 0
for number in nums:
if number - 1 not in nums:
length = 0
##Search if number + 1 also in the list##
while number + length in nums:
length += 1
longest = max(length, longest)
end_time = datetime.now()
timecost = 'Duration: {}'.format(end_time - start_time)
print(timecost)
The run time for above code is "Duration: 0:00:01.481939"
With adding only one line to convert the list to set in third row below:
start_time = datetime.now()
nums = list(range(10000))
nums = set(nums)
longest = 0
for number in nums:
if number - 1 not in nums:
length = 0
##Search if number + 1 also in the set(was a list)##
while number + length in nums:
length += 1
longest = max(length, longest)
end_time = datetime.now()
timecost = 'Duration: {}'.format(end_time - start_time)
print(timecost)
The run time for above code by using a set is now "Duration: 0:00:00.005138", Many time shorter than search through a list. Could anyone help me to understand the reason for that? Thank you!

This is a great question.
The issue with arrays is that there is no smarter way to search in some array a besides just comparing every element one by one.
Sometimes you'll get lucky and get a match on the first element of a.
Sometimes you'll get unlucky and not find a match until the last element of a, or perhaps none at all.
On average, you'll have to search half the elements of they array each time.
This is said to have a "time complexity" of O(len(a)), or colloquially, O(n). This means the time taken by the algorithm (searching for a value in array) is linearly propertional to the size of the input (the number of elements in the array to be searched). This is why it's called "linear search". Oh, your array got 2x bigger? Well your searches just got 2x slower. 1000x bigger? 1000x slower.
Arrays are great, but they're 💩 for searching if the element count gets too high.
Sets are clever. In Python, they're implemented as if they were a Dictionary with keys and no values. Like dictionaries, they're backed by data structure called a hash table.
A hash table uses the hash of a value as a quick way to get a "summary" of an object. This "summary" is then used to narrow down its search, so it only needs to linearly search a very small subset of all the elements. Searching in a hash table a time complexity of O(1). Notice that there's no "n" or len(the_set) in there. That's because the time taken to search in a hash table does not grow as the size of the hash table grows. So it's said to have constant time complexity.
By analogy, you only search the dairy isle when you're looking for milk. You know the hash value of milk (say, it's isle) is "dairy" and not "deli", so you don't have to waste any time searching for milk
A natural follow-up question is "then why don't we always use sets?". Well, there's a trade-off.
As you mentioned, sets can't contain duplicates, so if you want to store two of something, it's a non-starter.
Prior to Python 3.7, they were also unordered, so if you cared about
the order of elements, they won't do, either. * Sets generally have a
larger cpu/memory overhead, which adds up when using many sets containing small numbers of elements.
Also, it's possible
that because of fancy CPU features (like CPU caches and branch
prediction), linear searching through small arrays can actually be
faster than the hash-based look-up in sets.
I'd recommend you do some further reading into data structures and algorithms. This stuff is quite language-independent. Now that you know that set and dict use a Hash Table behind the scenes, you can look up resource that cover hash tables in any language, and that should help. There's also some Python-centric resoruces too, like https://www.interviewcake.com/concept/python/hash-map

Function awfully slow

I was looking for historical data from our Brazilian stock market and found it at Bovespa's
website.
The problem is the format the data is in is terrible, it is mingled with all sorts of
other information about any particular stock!
So far so good! A great opportunity to test my fresh python skills (or so I thought)!
I managed to "organize/parse" pretty much all of the data with a few lines of code,
and then stumbled on a very annoying fact about the data. The very information I needed, stock prices(open, high, low, close), had no commas and was formatted like this: 0000000011200, which would be equivalent to 11 digits before the decimal comma.
So basically 0000000011200 = 112,00... You get the gist..
I wrote a few lines of code to edit that and then the nightmare kicked in.
The whole data set is around 358K rows long, and with my current script the deeper it
runs inside the list to edit it the longer it takes per edit.
Here is the code snipped I used for that:
#profile
def dataFix(datas):
x = 0
for entry in datas:
for i in range(9, 16):
data_org[datas.index(entry)][i] = entry[i][:11]+'.'+entry[i][11:]
x += 1
print x
Would anyone mind shining some light into this matter?

datas.index(entry)
There's your problem. datas.index(entry) requires Python to go through the datas list one element at a time, searching for entry. It's an incredibly slow way to do things, slower the bigger the list is, and it doesn't even work, because duplicate elements are always found at their first occurrence instead of the occurrence you're processing.
If you want to use the indices of the elements in a loop, use enumerate:
for index, entry in enumerate(datas):
...

First, probably more easy to convert price directly to a more usable format.
For exemple Decimal format permit you to do easy calculation without loosing precision.
Secondly, i think you didn't even need the index and can just use append.
Thirdly, say welcome to list comprehension and slice :P
from decimal import Decimal
data_org = []
for entries in datas:
data_org.append([Decimal(entry).scaleb(-2) for entry in entries[9:16]])
or even:
data_org = [[Decimal(entry).scaleb(-2) for entry in entries[9:16]] for entries in datas]
or in a generator form:
data_org = ([Decimal(entry).scaleb(-2) for entry in entries[9:16]] for entries in datas)
or if you want to keeping the text form:
data_org = [['.'.join((entry[:-2], entry[-2:])) for entry in entries[9:16]] for entries in datas]
(replaceing [:11] by [:-2] permit to be independent of the input size and get 2 decimal from the end)

Efficient Python way to process two huge files?

I am working on a problem where I have to find if a number falls within a certain range. However, the problem is complicated due to the fact that the files I am dealing with have hundreds of thousands of lines.
Below I try to explain the problem in as simple a language as possible.
Here is a brief description of my input files :
File Ranges.txt has some ranges whose min and max are tab separated.
10 20
30 40
60 70
This can have about 10,000,000 such lines with ranges.
NOTE: The ranges never overlap.
File Numbers.txt has a list of numbers and some values associated with each number.
12 0.34
22 0.14
34 0.79
37 0.87
And so on. Again there are hundreds of thousands of such lines with numbers and their associated values.
What I wish to do is take every number from Numbers.txt and check if it falls within any of the ranges in Ranges.txt.
For all such numbers that fall within a range, I have to get a mean of their associated values (ie a mean per range).
For eg. in the example above in Numbers.txt, there are two numbers 34 and 37 that fall within the range 30-40 in Ranges.txt, so for the range 30-40 I have to calculate the mean of the associated values of 34 and 37. (i.e mean of 0.79 and 0.87), which is 0.82
My final output file should be the Ranges.txt but with the mean of the associated values of all numbers falling within each range. Something like :
Output.txt
10 20 <mean>
30 40 0.82
60 70 <mean>
and so on.
Would appreciate any help and ideas on how this can be written efficiently in Python.

Obviously you need to run each line from Numbers.txt against each line from Ranges.txt.
You could just iterate over Numbers.txt, and, for each line, iterate over Ranges.txt. But this will take forever, reading the whole Ranges.txt file millions of times.
You could read both of them into memory, but that will take a lot of storage, and it means you won't be able to do any processing until you've finished reading and preprocessing both files.
So, what you want to do is read Ranges.txt into memory once and store it as, say, a list of pairs of ints instead, but read Numbers.txt lazily, iterating over the list for each number.
This kind of thing comes up all the time. In general, you want to make the bigger collection into the outer loop, and make it as lazy as possible, while the smaller collection goes into the inner loop, and is pre-processed to make it as fast as possible. But if the bigger collection can be preprocessed more efficiently (and you have enough memory to store it!), reverse that.
And speaking of preprocessing, you can do a lot better than just reading into a list of pairs of ints. If you sorted Ranges.txt, you could find the closest range without going over by bisecting then just check that (18 steps), instead of checking each range exhaustively (100000 steps).
This is a bit of a pain with the stdlib, because it's easy to make off-by-one errors when using bisect, but there are plenty of ActiveState recipes to make it easier (including one linked from the official docs), not to mention third-party modules like blist or bintrees that give you a sorted collection in a simple OO interface.
So, something like this pseudocode:
with open('ranges.txt') as f:
ranges = sorted([map(int, line.split()) for line in f])
range_values = {}
with open('numbers.txt') as f:
rows = (map(int, line.split()) for line in f)
for number, value in rows:
use the sorted ranges to find the appropriate range (if any)
range_values.setdefault(range, []).append(value)
with open('output.txt') as f:
for r, values in range_values.items():
mean = sum(values) / len(values)
f.write('{} {} {}\n'.format(r[0], r[1], mean))
By the way, if the parsing turns out to be any more complicated than just calling split on each line, I'd suggest using the csv module… but it looks like that won't be a problem here.
What if you can't fit Ranges.txt into memory, but can fit Numbers.txt? Well, you can sort that, then iterate over Ranges.txt, find all of the matches in the sorted numbers, and write the results out for that range.
This is a bit more complicated, because it you have to bisect_left and bisect_right and iterate everything in between. But that's the only way in which it's any harder. (And here, a third-party class will help even more. For example, with a bintrees.FastRBTree as your sorted collection, it's just sorted_number_tree[low:high].)
If the ranges can overlap, you need to be a bit smarter—you have to find the closest range without going over the start, and the closest range without going under the end, and check everything in between. But the main trick there is the exact same one used for the last version. The only other trick is to keep two copies of ranges, one sorted by the start value and one by the end, and you'll need to have one of them be a map to indices in the other instead of just a plain list.

The naive approach would be to read Numbers.txt into some structure in number order, then read each line of Ranges, us a binary search to find the lowest number in the range, and the read through the numbers higher than that to find all those within the range, so that you can produce the corresponding line of output.
I assume the problem is that you can't have all of Numbers in memory.
So you could do the problem in phases, where each phase reads a portion of Numbers in, then goes through the process outlined above, but using an annotated version of Ranges, where each line includes the COUNT of the values so far that has produced that mean, and will write a similarly annotated version.
Obviously, the initial pass will not have an annotated version of Ranges, and the final pass will not produce one.

It looks like your data in both the files are already sorted. If not, first sort them by an external tool or using Python.
Then, you can go through the two files in parallel. You read a number from the Numbers.txt file, and see if it is in a range in Ranges.txt file, reading as many lines from that file as needed to answer that question. Then read the next number from Numbers.txt, and repeat. The idea is similar to merging two sorted arrays, and should run in O(n+m) time, n and m are the sizes of the files. If you need to sort the files, the run time is O(n lg(n) + m lg(m)). Here is a quick program I wrote to implement this:
import sys
from collections import Counter
class Gen(object):
__slots__ = ('rfp', 'nfp', 'mn', 'mx', 'num', 'd', 'n')
def __init__(self, ranges_filename, numbers_filename):
self.d = Counter() # sum of elements keyed by range
self.n = Counter() # number of elements keyed by range
self.rfp = open(ranges_filename)
self.nfp = open(numbers_filename)
# Read the first number and the first range values
self.num = float(self.nfp.next()) # Current number
self.mn, self.mx = [int(x) for x in self.rfp.next().split()] # Current range
def go(self):
while True:
if self.mx < self.num:
try:
self.mn, self.mx = [int(x) for x in self.rfp.next().split()]
except StopIteration:
break
else:
if self.mn <= self.num <= self.mx:
self.d[(self.mn, self.mx)] += self.num
self.n[(self.mn, self.mx)] += 1
try:
self.num = float(self.nfp.next())
except StopIteration:
break
self.nfp.close()
self.rfp.close()
return self.d, self.n
def run(ranges_filename, numbers_filename):
r = Gen(ranges_filename, numbers_filename)
d, n = r.go()
for mn, mx in sorted(d):
s, N = d[(mn, mx)], n[(mn, mx)]
if s:
av = s/N
else:
av = 0
sys.stdout.write('%d %d %.3f\n' % (mn, mx, av))
On files with 10,000,000 numbers in each of the files, the above runs in about 1.5 minute on my computer without the output part.

3 questions about appengine indexes

I got the following situation
class M(db.Model):
a = db.ReferenceProperty(A)
x = db.ReferenceProperty(X)
y = db.ReferenceProperty(Y)
z = db.ReferenceProperty(Z)
items = db.StringListProperty()
date = db.DateTimeProperty()
I want to make queries that filter on (a), (x, y or z) and (items), ordered by date i.e.
mm = M.all().filter('a =', a1).filter('x =', x1).filter('items =', i).order('-date')
There will never be a query with filter on x and y at the same time, for example.
So, my questions are:
1) How many (and which) indexes should I create?
2) How many 'strings' can I add on items? (I'd like to add in the order of thousands)
3) How many index records will I have on a single "M" if there are 1000 items?
I don't quite yet understand this index stuff, and is killing me. Your help will be very appreciated :)

This article explains indexes/exploding indexes quite well, and it actually fits your example: https://developers.google.com/appengine/docs/python/datastore/queries#Big_Entities_and_Exploding_Indexes
Your biggest issue will be the fact that you will probably run into the 5000 indexes per entity limit with thousands of items. If you take an index for a, x, items (1000 items), date: |a||x||items|*|date| == 1*1*1000*1 == 1000.
If you have 5001 entries in items, the put() will fail with the appropriate exception.
From the example you provided, whether you filter on x, y or anything else seems irrelevant, as there is only 1 of that property, and therefore you do not run the chance of an exploding index. 1*1 == 1.
Now, if you had two list properties, you would want to make sure that they are indexed separately, otherwise you'd get an exploding index. For example, if you had 2 list properties with 100 items each, that would produce 100*100 indexes, unless you split them, which would then result in only 200 (assuming all other properties were non-lists).

For the criteria you have given, you only need to create three compound indexes: a,x,items,-date, a,y,items,-date, a,z,items,-date. Note that a list property creates an index entry for each property in the list.
There is a limit of total 5000 index entries per entity. If you only have three compound indexes, than it's 5000/3 = 1666 (capped at 1000 for a single list property).
In the case of three compound indexes only, 3*1000 = 3000.
NOTE: the above assumes you do not have built-in indexes per property (= properties are save as unindexed). Otherwise you need to account for built-in indexes at 2N, where N in number of single properties (2 is for asc, desc). In your case this would be 2*(5 + no_items), since items is a list property and every entry creates an index entry.

See also https://developers.google.com/appengine/articles/indexselection, which describes App Engine's (relatively recent) improved query planning capabilities. Basically, you can reduce the number of index entries needed to: (number of filters + 1) * (number of orders).
Though, as the article discusses, there can be reasons that you might still use compound indexes-- essentially, there is a time/space tradeoff.

Best way to sort 1M records in Python

I have a service that runs that takes a list of about 1,000,000 dictionaries and does the following
myHashTable = {}
myLists = { 'hits':{}, 'misses':{}, 'total':{} }
sorted = { 'hits':[], 'misses':[], 'total':[] }
for item in myList:
id = item.pop('id')
myHashTable[id] = item
for k, v in item.iteritems():
myLists[k][id] = v
So, if I had the following list of dictionaries:
[ {'id':'id1', 'hits':200, 'misses':300, 'total':400},
{'id':'id2', 'hits':300, 'misses':100, 'total':500},
{'id':'id3', 'hits':100, 'misses':400, 'total':600}
]
I end up with
myHashTable =
{
'id1': {'hits':200, 'misses':300, 'total':400},
'id2': {'hits':300, 'misses':100, 'total':500},
'id3': {'hits':100, 'misses':400, 'total':600}
}
and
myLists =
{
'hits': {'id1':200, 'id2':300, 'id3':100},
'misses': {'id1':300, 'id2':100, 'id3':400},
'total': {'id1':400, 'id2':500, 'id3':600}
}
I then need to sort all of the data in each of the myLists dictionaries.
What I doing currently is something like the following:
def doSort(key):
sorted[key] = sorted(myLists[key].items(), key=operator.itemgetter(1), reverse=True)
which would yield, in the case of misses:
[('id3', 400), ('id1', 300), ('id2', 200)]
This works great when I have up to 100,000 records or so, but with 1,000,000 it is taking at least 5 - 10 minutes to sort each with a total of 16 (my original list of dictionaries actually has 17 fields including id which is popped)
* EDIT * This service is a ThreadingTCPServer which has a method
allowing a client to connect and add
new data. The new data may include
new records (meaning dictionaries with
unique 'id's to what is already in
memory) or modified records (meaning
the same 'id' with different data for
the other key value pairs
So, once this is running I would pass
in
[
{'id':'id1', 'hits':205, 'misses':305, 'total':480},
{'id':'id4', 'hits':30, 'misses':40, 'total':60},
{'id':'id5', 'hits':50, 'misses':90, 'total':20
]
I have been using dictionaries to
store the data so that I don't end up
with duplicates. After the
dictionaries are updated with the
new/modified data I resort each of
them.
* END EDIT *
So, what is the best way for me to sort these? Is there a better method?

You may find this related answer from Guido: Sorting a million 32-bit integers in 2MB of RAM using Python

What you really want is an ordered container, instead of an unordered one. That would implicitly sort the results as they're inserted. The standard data structure for this is a tree.
However, there doesn't seem to be one of these in Python. I can't explain that; this is a core, fundamental data type in any language. Python's dict and set are both unordered containers, which map to the basic data structure of a hash table. It should definitely have an optimized tree data structure; there are many things you can do with them that are impossible with a hash table, and they're quite tricky to implement well, so people generally don't want to be doing it themselves.
(There's also nothing mapping to a linked list, which also should be a core data type. No, a deque is not equivalent.)
I don't have an existing ordered container implementation to point you to (and it should probably be implemented natively, not in Python), but hopefully this will point you in the right direction.
A good tree implementation should support iterating across a range by value ("iterate all values from [2,100] in order"), find next/prev value from any other node in O(1), efficient range extraction ("delete all values in [2,100] and return them in a new tree"), etc. If anyone has a well-optimized data structure like this for Python, I'd love to know about it. (Not all operations fit nicely in Python's data model; for example, to get next/prev value from another value, you need a reference to a node, not the value itself.)

If you have a fixed number of fields, use tuples instead of dictionaries. Place the field you want to sort on in first position, and just use mylist.sort()

This seems to be pretty fast.
raw= [ {'id':'id1', 'hits':200, 'misses':300, 'total':400},
{'id':'id2', 'hits':300, 'misses':100, 'total':500},
{'id':'id3', 'hits':100, 'misses':400, 'total':600}
]
hits= [ (r['hits'],r['id']) for r in raw ]
hits.sort()
misses = [ (r['misses'],r['id']) for r in raw ]
misses.sort()
total = [ (r['total'],r['id']) for r in raw ]
total.sort()
Yes, it makes three passes through the raw data. I think it's faster than pulling out the data in one pass.

Instead of trying to keep your list ordered, maybe you can get by with a heap queue. It lets you push any item, keeping the 'smallest' one at h[0], and popping this item (and 'bubbling' the next smallest) is an O(nlogn) operation.
so, just ask yourself:
do i need the whole list ordered all the time? : use an ordered structure (like Zope's BTree package, as mentioned by Ealdwulf)
or the whole list ordered but only after a day's work of random insertions?: use sort like you're doing, or like S.Lott's answer
or just a few 'smallest' items at any moment? : use heapq

Others have provided some excellent advices, try them out.
As a general advice, in situations like that you need to profile your code. Know exactly where most of the time is spent. Bottlenecks hide well, in places you least expect them to be.
If there is a lot of number crunching involved then a JIT compiler like the (now-dead) psyco might also help. When processing takes minutes or hours 2x speed-up really counts.
http://docs.python.org/library/profile.html
http://www.vrplumber.com/programming/runsnakerun/
http://psyco.sourceforge.net/

sorted(myLists[key], key=mylists[key].get, reverse=True)
should save you some time, though not a lot.

I would look into using a different sorting algorithm. Something like a Merge Sort might work. Break the list up into smaller lists and sort them individually. Then loop.
Pseudo code:
list1 = [] // sorted separately
list2 = [] // sorted separately
// Recombine sorted lists
result = []
while (list1.hasMoreElements || list2.hasMoreElements):
if (! list1.hasMoreElements):
result.addAll(list2)
break
elseif (! list2.hasMoreElements):
result.AddAll(list1)
break
if (list1.peek < list2.peek):
result.add(list1.pop)
else:
result.add(list2.pop)

Glenn Maynard is correct that a sorted mapping would be appropriate here. This is one for python: http://wiki.zope.org/ZODB/guide/node6.html#SECTION000630000000000000000

I've done some quick profiling of both the original way and SLott's proposal. In neither case does it take 5-10 minutes per field. The actual sorting is not the problem. It looks like most of the time is spent in slinging data around and transforming it. Also, my memory usage is skyrocketing - my python is over 350 megs of ram! are you sure you're not using up all your ram and paging to disk? Even with my crappy 3 year old power saving processor laptop, I am seeing results way less than 5-10 minutes per key sorted for a million items. What I can't explain is the variability in the actual sort() calls. I know python sort is extra good at sorting partially sorted lists, so maybe his list is getting partially sorted in the transform from the raw data to the list to be sorted.
Here's the results for slott's method:
done creating data
done transform. elapsed: 16.5160000324
sorting one key slott's way takes 1.29699993134
here's the code to get those results:
starttransform = time.time()
hits= [ (r['hits'],r['id']) for r in myList ]
endtransform = time.time()
print "done transform. elapsed: " + str(endtransform - starttransform)
hits.sort()
endslottsort = time.time()
print "sorting one key slott's way takes " + str(endslottsort - endtransform)
Now the results for the original method, or at least a close version with some instrumentation added:
done creating data
done transform. elapsed: 8.125
about to get stuff to be sorted
done getting data. elapsed time: 37.5939998627
about to sort key hits
done sorting on key <hits> elapsed time: 5.54699993134
Here's the code:
for k, v in myLists.iteritems():
time1 = time.time()
print "about to get stuff to be sorted "
tobesorted = myLists[k].items()
time2 = time.time()
print "done getting data. elapsed time: " + str(time2-time1)
print "about to sort key " + str(k)
mysorted[k] = tobesorted.sort( key=itemgetter(1))
time3 = time.time()
print "done sorting on key <" + str(k) + "> elapsed time: " + str(time3-time2)

Honestly, the best way is to not use Python. If performance is a major concern for this, use a faster language.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.