Converting assembly instructions BL and B to binary

Converting assembly instructions BL and B to binary - python

I am trying to understand how a binary containing binary codes get converted to assembly instruction.
For example here is a example output from objdump for an ARM based application:
00008420 <main>:
8420: e92d4800 push {fp, lr}
8424: e28db004 add fp, sp, #4
8428: e24dd008 sub sp, sp, #8
842c: e59f2054 ldr r2, [pc, #84] ; 8488 <main+0x68>
8430: e24b300c sub r3, fp, #12
8434: e1a00002 mov r0, r2
8438: e1a01003 mov r1, r3
843c: ebffffc6 bl 835c <__isoc99_scanf#plt>
8440: e3a03000 mov r3, #0
8444: e50b3008 str r3, [fp, #-8]
8448: ea000006 b 8468 <main+0x48>
844c: e51b3008 ldr r3, [fp, #-8]
8450: e2833001 add r3, r3, #1
8454: e50b3008 str r3, [fp, #-8]
8458: e59f302c ldr r3, [pc, #44] ; 848c <main+0x6c>
845c: e1a00003 mov r0, r3
8460: e51b1008 ldr r1, [fp, #-8]
8464: ebffffb3 bl 8338 <printf#plt>
8468: e51b300c ldr r3, [fp, #-12]
846c: e51b2008 ldr r2, [fp, #-8]
8470: e1520003 cmp r2, r3
8474: bafffff4 blt 844c <main+0x2c>
8478: e3a03000 mov r3, #0
847c: e1a00003 mov r0, r3
8480: e24bd004 sub sp, fp, #4
8484: e8bd8800 pop {fp, pc}
8488: 00008500 .word 0x00008500
848c: 00008504 .word 0x00008504
as you can see in the offset 8464, the binary code ebffffb3 get converted to bl 8338. I want to understand it.
The explicit reason to do it is because I want to add additional regex for instructions existing in the following python code:
[b"[\x00\x08\x10\x18\x20\x28\x30\x38\x40\x48\x70]{1}\x47", 2, 2], # bx reg
[b"[\x80\x88\x90\x98\xa0\xa8\xb0\xb8\xc0\xc8\xf0]{1}\x47", 2, 2], # blx reg
[b"[\x00-\xff]{1}\xbd", 2, 2] # pop {,pc}
As you can see the regex for an bx instruction in the binary is "\x00\x08\x10\x18\x20\x28\x30\x38\x40\x48\x70]{1}\x47" and for blx it is "\x80\x88\x90\x98\xa0\xa8\xb0\xb8\xc0\xc8\xf0". Now I want to add two more instructions B and BL (these are ARM instructions) but I have no idea how to convert the instruction to the similar binary code.
(The source code coming from ROPGadget in github. )

I am trying to understand how a binary containing binary codes get converted to assembly instruction.
Aside: All traditional CPU hardware uses binary logic using some standard transistor configurations for implementing NOT, NOR, NAND, etc. From these few logic gates, you can implement many more sophisticated device and logic using combinations of the logic elements.
So, all CPUs will extract bit fields (a few bit positions, but not necessarily adjacent) and determine which type of instruction it is. Other bit fields will give parameters to the particular opcode.
In 'C', this convert to some mask and compare operations where you extract the bits to be examined and then see if the bit pattern is equal. The specific implementation for the GNU tools (binutils) is arm-dis.c.
This sourceforge project is one source of information, although there are others (including the arm-dis.c file).
|31..28|27..25| 24|23 .. 0|
+------+------+---+----------+
|cond | 101 | L | offset |
+------+------+---+----------+
The only constant part is the '101'. Your python reg-ex looks like hexadecimal. The leading nibble is a condition which if true the instruction will take; otherwise it is like a no op. There was a never (leading hex 'F') condition in very old ARM CPU documentation; it has been deprecated to extend the instruction set. So the leading nibble (four bits) can be ignored and then look for either '1010b' or 0xa (for the branch) and '1011b' or 0xb (for the bl or branch and link).
For example, arm-dis.c has,
{ARM_FEATURE_CORE_LOW (ARM_EXT_V1),
0x0a000000, 0x0e000000, "b%24'l%c\t%b"},
That said, the b and bl instructions are not that useful for ROP as they do not have register arguments, so you can not alter the control flow. Normally, you would just arrange to have the control flow directly in your ROP gadget instead of trying to get to them through a jump.
The ARM version of b Rn is mov pc, rN; but there are many other rich constructs such as add with shift and using ldr with tables of pointer, etc. Afaik, the ROPGadget was detecting these when I ran it on an ARM glibc.

Quoting from https://www.ic.unicamp.br/~ranido/mc404/arm/arm-instructionset.pdf
Branch instructions contain a signed 2’s complement 24 bit offset.
This is shifted left two bits, sign extended to 32 bits, and added to
the PC. The instruction can therefore specify a branch of +/-
32Mbytes. The branch offset must take account of the prefetch
operation, which causes the PC to be 2 words (8 bytes) ahead of the
current instruction. Branches beyond +/- 32Mbytes must use an offset
or absolute destination which has been previously loaded into a
register. In this case the PC should be manually saved in R14 if a
Branch with Link type operation is required.
So let's look at your branch example
8464: ebffffb3 bl 8338 <printf#plt>
The processor logic takes the 24-bit offset ffffb3 and multiplies it by 4 (that's efficiently coded because of 4-byte alignment). It then adds this offset to the current instruction's Program Counter + 8. This gives the sum:
ffffffb3 * 4
--------
fffffecc +
8464 +
8 +
--------
8338 QED

Related

Raspberry pi adc: MPC3001 using python and spidev

I am trying to use a MPC3001 ADC with the Raspberry pi over SPI using Python.
While doing so I am getting some strange results:
The code I am using:
import sys
import spidev
spi = spidev.SpiDev()
spi.open(0,0)
def readAdc(channel):
r = spi.xfer2([1, 8 + channel << 4, 0])
return ((r[1]&3) << 8) + r[2]
while True:
print readAdc(0)
time.sleep(0.5)
running the script above, while measuring the center point of a voltage divider, yields a random switching between 2 values: 504 and 1016.
Since 504 is the value I would expect to be correct, in combination with the binary representation of the two results;
504 --> 00111111000
1016 --> 01111111000
I assume I am accidentally 'creating' a 1 somewhere.
Can someone point me in the right direction.
Thanks in advance
BTW: Is it me, or is there no decent documentation for the spidev lib?

In the data sheet figure 5-2 it shows what the issue is. The data returned from the device, bit by bit, looks like this (where X = don't care, and Bx is bit number x of the data; B0 is LSB and B9 MSB):
BYTE0: X X 0 B9 B8 B7 B6 B5
BYTE1: B4 B3 B2 B1 B0 B1 B2 B3
BYTE2: B4 B5 B6 B7 B8 B9 X X
If you change the return statement to this:
return ((r[0] & 0x1F) << 5) | ((r[1] >> 3) & 0x1F)
it might work.
But if I were you I wouldn't trust me (I no longer have a hardware setup where I could test anything). I think it's best to print the individual values of the bytes in r and ponder the result until it starts to make sense. Be aware that some SPI implementations let you read in the data on either the rising or the falling edge. If you have that wrong your data will never make sense.
Another clue here is that the three LSBs of an A/D converter reading shouldn't be consistently 000, for reading after reading. There is almost always noise in the least significant bit, at least, if not two or three bits.
BTW I agree with you about spidev. I switched to quick2wire and then things went much more smoothly. It's better written and much better documented.

Trying to read an ADC with Cython on an RPi 2 b+ via SPI (MCP3304, MCP3204, or MCP3008)?

I'd like to read differential voltage values from an MCP3304 (5v VDD, 3.3v Vref, main channel = 7, diff channel = 6) connected to an RPi 2 b+ as close as possible to the MCP3304's max sample rate of 100ksps. Preferably, I'd get > 1 sample per 100µs (> 10 ksps).
A kind user recently suggested I try porting my code to C for some speed gains. I'm VERY new to C, so thought I'd give Cython a shot, but can't seem to figure out how to tap into the C-based speed gains.
My guess is that I need to write a .pyx file that includes a more-raw means of accessing the ADC's bits/bytes via SPI than the python package I'm currently using (the python-wrapped gpiozero package). 1) Does this seem correct, and, if so, might someone 2) please help me understand how to manipulate bits/bytes appropriately for the MCP3304 in a way that will produce speed gains from Cython? I've seen C tutorials for the MCP3008, but am having trouble adapting this code to fit the timing laid out in the MCP3304's spec sheet; though I might be able to adapt a Cython-specific MCP3008 (or other ADC) tutorial to fit the MCP3304.
Here's a little .pyx loop I wrote to test how fast I'm reading voltage values. (Timing how long it takes to read 25,000 samples). It's ~ 9% faster than running it straight in Python.
# Load packages
import time
from gpiozero import MCP3304
# create a class to ping PD every ms for 1 minute
def pd_ping():
cdef char *FILENAME = "testfile.txt"
cdef double v
# cdef int adc_channel_pd = 7
cdef size_t i
# send message to user re: expected timing
print "Runing for ", 25000 , "iterations. Fingers crossed!"
print time.clock()
s = []
for i in range(25000):
v = MCP3304(channel = 7, diff = True).value * 3.3
# concatenate
s.append( str(v) )
print "donezo" , time.clock()
# write to file
out = '\n'.join(s)
f = open(FILENAME, "w")
f.write(out)

There is probably no need to create an MCP3304 object for each iteration. Also conversion from float to string can be delayed.
s = []
mcp = MCP3304(channel = 7, diff = True)
for i in range(25000):
v = mcp.value * 3.3
s.append(v)
out = '\n'.join('{:.2f}'.format(v) for v in s) + '\n'
If that multiplication by 3.3 is not strictly necessary at that point, it could be done later.

How to convert midi files to keypresses (in Python)?

I'm trying to read a MIDI file and then convert each note (midi number) to a simulated keypress on the keyboard (A,f,h,J,t...).
I'm able to read any MIDI file with the python-midi library like this:
pattern = midi.read_midifile("example.mid")
and I can also simulate keypresses with pywin32 like this:
shell = win32com.client.Dispatch("WScript.Shell")
shell.SendKeys(key)
But I have no idea on how to actually convert midi numbers (that are in midi files) to keypresses.
To be more precise I'm trying to convert MIDI numbers to notes and then notes to keypresses according to virtualpiano.net (61-key) so that the program would play that piano by pressing the corresponding button on the keyboard (you can press key assist in settings of the piano to see which key is what button)
Of course I would also have to wait between keypresses but that's easy enough.
Any help is appreciated. (Windows 10 64-bit (32-bit Python 2.7))

If you take a look at the midi module that you are using, you will see that there are some constants that can be used to convert notes to their MIDI number and vice versa.
>>> import midi
>>> midi.C_0 # note C octave 0
0
>>> midi.G_3 # G octave 3
43
>>> midi.Gs_4 # G# octave 4
56
>>> midi.A_8 # A octave 8
105
>>> midi.NOTE_VALUE_MAP_SHARP[0]
C_0
>>> midi.NOTE_VALUE_MAP_SHARP[56]
Gs_4
>>> midi.NOTE_VALUE_MAP_SHARP[105]
A_8
Opening a MIDI file with read_midifile() returns a Pattern object which looks like this (taken from the examples):
>>> midi.read_midifile('example.mid')
midi.Pattern(format=1, resolution=220, tracks=\
[midi.Track(\
[midi.NoteOnEvent(tick=0, channel=0, data=[43, 20]),
midi.NoteOffEvent(tick=100, channel=0, data=[43, 0]),
midi.EndOfTrackEvent(tick=1, data=[])])])
The NoteOnEvent contains timing, MIDI number/pitch and velocity which you can retrieve:
>>> on = midi.NoteOnEvent(tick=0, channel=0, data=[43, 20])
>>> on.pitch
43
>>> midi.NOTE_VALUE_MAP_SHARP[on.pitch]
'G_3'
Now all of that is interesting, but you don't really need to convert the MIDI number to a note, you just need to convert it to the keyboard key for that note as used by http://virtualpiano.net/.
Middle C is equal to MIDI 60 and this note corresponds to the 25th key on the virtualpiano keyboard which is activated by pressing the letter t. The next note, Cs_5, is MIDI 61 which is uppercase T (<shift>-t). From there you can work out the mapping for the MIDI numbers to the supported virtualpiano keys; it's this:
midi_to_vk = (
[None]*36 +
list('1!2#34$5%6^78*9(0qQwWeErtTyYuiIoOpPasSdDfgGhHjJklLzZxcCvVbBnm') +
[None]*31
)
The next problem that you will face is sending the key events. Note that in MIDI multiple notes can be played simultaneously, or can overlap in time. This means that you might need to be able to send more than one key press event at the same time.
I don't think that you can handle the velocity using a computer keyboard. There is also the issue of timing, but you said that that's not a problem for you.

Check this question first. This is general idea on how to simulate keypresses, it may look a lot but it's just a big list of keys. Then to convert midi to keyboard buttons you would create a dictionary mapping between notes and Keyboard buttons.

I understand that what you really need is a way to convert a midi note number to a note in standard notation.
Here are some elements from Wikipedia:
... GM specifies that note number 69 plays A440, which in turn fixes middle C as note number 60 ...
(from MIDI - General MIDI
Converting from midi note number (d) to frequency (f) is given by the following formula: f = 2 (d-69)/12 * 440 Hz
(from MIDI Tuning Standard - Frequency values)
And finally from C (musical note) - Designation by octave
Scientific designation |Octave name | Frequency (Hz) | Other names
C4 | One-lined | 261.626 | Middle C
So C4 is midi note 69, and midi notes are separated with a semitone(*). As you have 12 semitones in one octave, you will get C5 at midi note 81, and the famous A440 (midi note 69) is A4 in scientific notation.
As an example for the table
Midi note | Scientific notation
60 | C4
62 | D4
64 | E4
65 | F4
67 | G4
69 | A4
70 | B4
71 | C5
And you would get F♯4 at midi 66...
(*) it is musically what is called equal temperament. Midi allows for finer pitch definition but it would be far beyond this answer.

High-speed alternatives to replace byte array processing bottlenecks

>> See EDIT below <<
I am working on processing data from a special pixelated CCD camera over serial, using FTDI D2xx drivers via pyUSB.
The camera can operate at high bandwidth to the PC, up to 80 frames/sec. I would love that speed, but know that it isn't feasible with Python, due to it being a scripted language, but would like to know how close I can get - whether it be some optimizations that I missed in my code, threading, or using some other approach. I immediately think that breaking-out the most time consuming loops and putting them in C code, but I don't have much experience with C code and not sure the best way to get Python to interact inline with it, if that's possible. I have complex algorithms heavily developed in Python with SciPy/Numpy, which are already optimized and have acceptable performance, so I would need a way to just speed-up the acquisition of the data to feed-back to Python, if that's the best approach.
The difficulty, and the reason I used Python, and not some other language, is due to the need to be able to easily run it cross-platform (I develop in Windows, but am putting the code on an embedded Linux board, making a stand-alone system). If you suggest that I use another code, like C, how would I be able to work cross-platform? I have never worked with compiling a lower-level language like C between Windows and Linux, so I would want to be sure of that process - I would have to compile it for each system, right? What do you suggest?
Here are my functions, with current execution times:
ReadStream: 'RXcount' is 114733 for a device read, formatting from string to byte equivalent
Returns a list of bytes (0-255), representing binary values
Current execution time: 0.037 sec
def ReadStream(RXcount):
global ftdi
RXdata = ftdi.read(RXcount)
RXdata = list(struct.unpack(str(len(RXdata)) + 'B', RXdata))
return RXdata
ProcessRawData: To reshape the byte list into an array that matches the pixel orientations
Results in a 3584x32 array, after trimming off some un-needed bytes.
Data is unique in that every block of 14 rows represents 14-bits of one row of pixels on the device (with 32 bytes across # 8 bits/byte = 256 bits across), which is 256x256 pixels. The processed array has 32 columns of bytes because each byte, in binary, represents 8 pixels (32 bytes * 8 bits = 256 pixels). Still working on how to do that one... I have already posted a question for that previously
Current execution time: 0.01 sec ... not bad, it's just Numpy
def ProcessRawData(RawData):
if len(RawData) == 114733:
ProcessedMatrix = np.ndarray((1, 114733), dtype=int)
np.copyto(ProcessedMatrix, RawData)
ProcessedMatrix = ProcessedMatrix[:, 1:-44]
ProcessedMatrix = np.reshape(ProcessedMatrix, (-1, 32))
return ProcessedMatrix
else:
return None
Finally,
GetFrame: The device has a mode where it just outputs whether a pixel detected anything or not, using the lowest bit of the array (every 14th row) - Get that data and convert to int for each pixel
Results in 256x256 array, after processing every 14th row, which are bytes to be read as binary (32 bytes across ... 32 bytes * 8 bits = 256 pixels across)
Current execution time: 0.04 sec
def GetFrame(ProcessedMatrix):
if np.shape(ProcessedMatrix) == (3584, 32):
FrameArray = np.zeros((256, 256), dtype='B')
DataRows = ProcessedMatrix[13::14]
for i in range(256):
RowData = ""
for j in range(32):
RowData = RowData + "{:08b}".format(DataRows[i, j])
FrameArray[i] = [int(RowData[b:b+1], 2) for b in range(256)]
return FrameArray
else:
return False
Goal:
I would like to target a total execution time of ~0.02 secs/frame by whatever suggestions you make (currently it's 0.25 secs/frame with the GetFrame function being the weakest). The device I/O is not the limiting factor, as that outputs a data packet every 0.0125 secs. If I get the execution time down, then can I just run the acquisition and processing in parallel with some threading?
Let me know what you suggest as the best path forward - Thank you for the help!
EDIT, thanks to #Jaime:
Functions are now:
def ReadStream(RXcount):
global ftdi
return np.frombuffer(ftdi.read(RXcount), dtype=np.uint8)
... time 0.013 sec
def ProcessRawData(RawData):
if len(RawData) == 114733:
return RawData[1:-44].reshape(-1, 32)
return None
... time 0.000007 sec!
def GetFrame(ProcessedMatrix):
if ProcessedMatrix.shape == (3584, 32):
return np.unpackbits(ProcessedMatrix[13::14]).reshape(256, 256)
return False
... time 0.00006 sec!
So, with pure Python, I am now able to acquire the data at the desired frame rate! After a few tweaks to the D2xx USB buffers and latency timing, I just clocked it at 47.6 FPS!
Last step is if there is any way to make this run in parallel with my processing algorithms? Need some way to pass the result of GetFrame to another loop running in parallel.

There are several places where you can speed things up significantly. Perhaps the most obvious is rewriting GetFrame:
def GetFrame(ProcessedMatrix):
if ProcessedMatrix.shape == (3584, 32):
return np.unpackbits(ProcessedMatrix[13::14]).reshape(256, 256)
return False
This requires that ProcessedMatrix be an ndarray of type np.uint8, but other than that, on my systems it runs 1000x faster.
With your other two functions, I think that in ReadStream you should do something like:
def ReadStream(RXcount):
global ftdi
return np.frombuffer(ftdi.read(RXcount), dtype=np.uint8)
Even if it doesn't speed up that function much, because it is the reading taking up most of the time, it will already give you a numpy array of bytes to work on. With that, you can then go on to ProcessRawData and try:
def ProcessRawData(RawData):
if len(RawData) == 114733:
return RawData[1:-44].reshape(-1, 32)
return None
Which is 10x faster than your version.

What language could I use for fast execution of this database summarization task?

So I wrote a Python program to handle a little data processing
task.
Here's a very brief specification in a made-up language of the computation I want:
parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \
flatten | format "%s %lf %s" aa bb cc
That is, for each line, parse out a word, a floating-point number, and another word. Think of them as a player ID, a score, and a date. I want the top five scores and dates for each player. The data size is not trivial, but not huge; about 630 megabytes.
I want to know what real, executable language I should have written it in to
get it to be similarly short (as the Python below) but much faster.
#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
top_5 = {}
for line in sys.stdin:
aa, bb, cc = line.split()
# We want the top 5 for each distinct value of aa. There are
# hundreds of thousands of values of aa.
bb = float(bb)
if aa not in top_5: top_5[aa] = []
current = top_5[aa]
current.append((bb, cc))
# Every once in a while, we drop the values that are not in
# the top 5, to keep our memory footprint down, because some
# values of aa have thousands of (bb, cc) pairs.
if len(current) > 10:
current.sort()
current[:-5] = []
for aa in top_5:
current = top_5[aa]
current.sort()
for bb, cc in current[-5:]:
print aa, bb, cc
Here’s some sample input data:
3 1.5 a
3 1.6 b
3 0.8 c
3 0.9 d
4 1.2 q
3 1.5 e
3 1.8 f
3 1.9 g
Here’s the output I get from it:
3 1.5 a
3 1.5 e
3 1.6 b
3 1.8 f
3 1.9 g
4 1.2 q
There are seven values for 3, and so we drop the c and d values
because their bb value puts them out of the top 5. Because 4 has
only one value, its “top 5” consists of just that one value.
This runs faster than doing the same queries in MySQL (at least, the
way we’ve found to do the queries) but I’m pretty sure it's spending
most of its time in the Python bytecode interpreter. I think that in
another language, I could probably get it to process hundreds of
thousands of rows per second instead of per minute. So I’d like to
write it in a language that has a faster implementation.
But I’m not sure what language to choose.
I haven’t been able to figure out how to express this as a single query in SQL, and
actually I’m really unimpressed with MySQL’s ability even to merely
select * from foo into outfile 'bar'; the input data.
C is an obvious choice, but things like line.split(), sorting a list
of 2-tuples, and making a hash table require writing some code that’s
not in the standard library, so I would end up with 100 lines of code
or more instead of 14.
C++ seems like it might be a better choice (it has strings, maps,
pairs, and vectors in the standard library) but it seems like the code
would be a lot messier with STL.
OCaml would be fine, but does it have an equivalent of line.split(),
and will I be sad about the performance of its map?
Common Lisp might work?
Is there some equivalent of Matlab for database computation like this
that lets me push the loops down into fast code? Has anybody tried Pig?
(Edit: responded to davethegr8's comment by providing some sample input and output data, and fixed a bug in the Python program!)
(Additional edit: Wow, this comment thread is really excellent so far. Thanks, everybody!)
Edit:
There was an eerily similar question asked on sbcl-devel in 2007 (thanks, Rainer!), and here's an awk script from Will Hartung for producing some test data (although it doesn't have the Zipfian distribution of the real data):
BEGIN {
for (i = 0; i < 27000000; i++) {
v = rand();
k = int(rand() * 100);
print k " " v " " i;
}
exit;
}

I have a hard time believing that any script without any prior knowledge of the data (unlike MySql which has such info pre-loaded), would be faster than a SQL approach.
Aside from the time spent parsing the input, the script needs to "keep" sorting the order by array etc...
The following is a first guess at what should work decently fast in SQL, assuming a index (*) on the table's aa, bb, cc columns, in that order. (A possible alternative would be an "aa, bb DESC, cc" index
(*) This index could be clustered or not, not affecting the following query. Choice of clustering or not, and of needing an "aa,bb,cc" separate index depends on use case, on the size of the rows in table etc. etc.
SELECT T1.aa, T1.bb, T1.cc , COUNT(*)
FROM tblAbc T1
LEFT OUTER JOIN tblAbc T2 ON T1.aa = T2.aa AND
(T1.bb < T2.bb OR(T1.bb = T2.bb AND T1.cc < T2.cc))
GROUP BY T1.aa, T1.bb, T1.cc
HAVING COUNT(*) < 5 -- trick, remember COUNT(*) goes 1,1,2,3,...
ORDER BY T1.aa, T1.bb, T1.cc, COUNT(*) DESC
The idea is to get a count of how many records, within a given aa value are smaller than self. There is a small trick however: we need to use LEFT OUTER join, lest we discard the record with the biggest bb value or the last one (which may happen to be one of the top 5). As a result of left joining it, the COUNT(*) value counts 1, 1, 2, 3, 4 etc. and the HAVING test therefore is "<5" to effectively pick the top 5.
To emulate the sample output of the OP, the ORDER BY uses DESC on the COUNT(), which could be removed to get a more traditional top 5 type of listing. Also, the COUNT() in the select list can be removed if so desired, this doesn't impact the logic of the query and the ability to properly sort.
Also note that this query is deterministic in terms of the dealing with ties, i,e, when a given set of records have a same value for bb (within an aa group); I think the Python program may provide slightly different outputs when the order of the input data is changed, that is because of its occasional truncating of the sorting dictionary.
Real solution: A SQL-based procedural approach
The self-join approach described above demonstrates how declarative statements can be used to express the OP's requirement. However this approach is naive in a sense that its performance is roughly bound to the sum of the squares of record counts within each aa 'category'. (not O(n^2) but roughly O((n/a)^2) where a is the number of different values for the aa column) In other words it performs well with data such that on average the number of records associated with a given aa value doesn't exceed a few dozens. If the data is such that the aa column is not selective, the following approach is much -much!- better suited. It leverages SQL's efficient sorting framework, while implementing a simple algorithm that would be hard to express in declarative fashion. This approach could further be improved for datasets with particularly huge number of records each/most aa 'categories' by introducing a simple binary search of the next aa value, by looking ahead (and sometimes back...) in the cursor. For cases where the number of aa 'categories' relative to the overall row count in tblAbc is low, see yet another approach, after this next one.
DECLARE #aa AS VARCHAR(10), #bb AS INT, #cc AS VARCHAR(10)
DECLARE #curAa AS VARCHAR(10)
DECLARE #Ctr AS INT
DROP TABLE tblResults;
CREATE TABLE tblResults
( aa VARCHAR(10),
bb INT,
cc VARCHAR(10)
);
DECLARE abcCursor CURSOR
FOR SELECT aa, bb, cc
FROM tblABC
ORDER BY aa, bb DESC, cc
FOR READ ONLY;
OPEN abcCursor;
SET #curAa = ''
FETCH NEXT FROM abcCursor INTO #aa, #bb, #cc;
WHILE ##FETCH_STATUS = 0
BEGIN
IF #curAa <> #aa
BEGIN
SET #Ctr = 0
SET #curAa = #aa
END
IF #Ctr < 5
BEGIN
SET #Ctr = #Ctr + 1;
INSERT tblResults VALUES(#aa, #bb, #cc);
END
FETCH NEXT FROM AbcCursor INTO #aa, #bb, #cc;
END;
CLOSE abcCursor;
DEALLOCATE abcCursor;
SELECT * from tblResults
ORDER BY aa, bb, cc -- OR .. bb DESC ... for a more traditional order.
Alternative to the above for cases when aa is very unselective. In other words, when we have relatively few aa 'categories'. The idea is to go through the list of distinct categories and to run a "LIMIT" (MySql) "TOP" (MSSQL) query for each of these values.
For reference purposes, the following ran in 63 seconds for tblAbc of 61 Million records divided in 45 aa values, on MSSQL 8.0, on a relatively old/weak host.
DECLARE #aa AS VARCHAR(10)
DECLARE #aaCount INT
DROP TABLE tblResults;
CREATE TABLE tblResults
( aa VARCHAR(10),
bb INT,
cc VARCHAR(10)
);
DECLARE aaCountCursor CURSOR
FOR SELECT aa, COUNT(*)
FROM tblABC
GROUP BY aa
ORDER BY aa
FOR READ ONLY;
OPEN aaCountCursor;
FETCH NEXT FROM aaCountCursor INTO #aa, #aaCount
WHILE ##FETCH_STATUS = 0
BEGIN
INSERT tblResults
SELECT TOP 5 aa, bb, cc
FROM tblproh
WHERE aa = #aa
ORDER BY aa, bb DESC, cc
FETCH NEXT FROM aaCountCursor INTO #aa, #aaCount;
END;
CLOSE aaCountCursor
DEALLOCATE aaCountCursor
SELECT * from tblResults
ORDER BY aa, bb, cc -- OR .. bb DESC ... for a more traditional order.
On the question of needing an index or not. (cf OP's remark)
When merely running a "SELECT * FROM myTable", a table scan is effectively the fastest appraoch, no need to bother with indexes. However, the main reason why SQL is typically better suited for this kind of things (aside from being the repository where the data has been accumulating in the first place, whereas any external solution needs to account for the time to export the relevant data), is that it can rely on indexes to avoid scanning. Many general purpose languages are far better suited to handle raw processing, but they are fighting an unfair battle with SQL because they need to rebuilt any prior knowledge of the data which SQL has gathered in the course of its data collection / import phase. Since sorting is a typically a time and sometimes space consuming task, SQL and its relatively slower processing power often ends up ahead of alternative solutions.
Also, even without pre-built indexes, modern query optimizers may decide on a plan that involves the creation of a temporary index. And, because sorting is an intrinsic part of DDMS, the SQL servers are generally efficient in that area.
So... Is SQL better?
This said, if we are trying to compare SQL and other languages for pure ETL jobs, i.e. for dealing with heaps (unindexed tables) as its input to perform various transformations and filtering, it is likely that multi-thread-able utilities written in say C, and leveraging efficient sorting libaries, would likely be faster. The determining question to decide on a SQL vs. Non-SQL approach is where the data is located and where should it eventually reside. If we merely to convert a file to be supplied down "the chain" external programs are better suited. If we have or need the data in a SQL server, there are only rare cases that make it worthwhile exporting and processing externally.

You could use smarter data structures and still use python.
I've ran your reference implementation and my python implementation on my machine and even compared the output to be sure in results.
This is yours:
$ time python ./ref.py < data-large.txt > ref-large.txt
real 1m57.689s
user 1m56.104s
sys 0m0.573s
This is mine:
$ time python ./my.py < data-large.txt > my-large.txt
real 1m35.132s
user 1m34.649s
sys 0m0.261s
$ diff my-large.txt ref-large.txt
$ echo $?
0
And this is the source:
#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
import heapq
top_5 = {}
for line in sys.stdin:
aa, bb, cc = line.split()
# We want the top 5 for each distinct value of aa. There are
# hundreds of thousands of values of aa.
bb = float(bb)
if aa not in top_5: top_5[aa] = []
current = top_5[aa]
if len(current) < 5:
heapq.heappush(current, (bb, cc))
else:
if current[0] < (bb, cc):
heapq.heapreplace(current, (bb, cc))
for aa in top_5:
current = top_5[aa]
while len(current) > 0:
bb, cc = heapq.heappop(current)
print aa, bb, cc
Update: Know your limits.
I've also timed a noop code, to know the fastest possible python solution with code similar to the original:
$ time python noop.py < data-large.txt > noop-large.txt
real 1m20.143s
user 1m19.846s
sys 0m0.267s
And the noop.py itself:
#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
import heapq
top_5 = {}
for line in sys.stdin:
aa, bb, cc = line.split()
bb = float(bb)
if aa not in top_5: top_5[aa] = []
current = top_5[aa]
if len(current) < 5:
current.append((bb, cc))
for aa in top_5:
current = top_5[aa]
current.sort()
for bb, cc in current[-5:]:
print aa, bb, cc

This took 45.7s on my machine with 27M rows of data that looked like this:
42 0.49357 0
96 0.48075 1
27 0.640761 2
8 0.389128 3
75 0.395476 4
24 0.212069 5
80 0.121367 6
81 0.271959 7
91 0.18581 8
69 0.258922 9
Your script took 1m42 on this data, the c++ example too 1m46 (g++ t.cpp -o t to compile it, I don't know anything about c++).
Java 6, not that it matters really. Output isn't perfect, but it's easy to fix.
package top5;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Arrays;
import java.util.Map;
import java.util.TreeMap;
public class Main {
public static void main(String[] args) throws Exception {
long start = System.currentTimeMillis();
Map<String, Pair[]> top5map = new TreeMap<String, Pair[]>();
BufferedReader br = new BufferedReader(new FileReader("/tmp/file.dat"));
String line = br.readLine();
while(line != null) {
String parts[] = line.split(" ");
String key = parts[0];
double score = Double.valueOf(parts[1]);
String value = parts[2];
Pair[] pairs = top5map.get(key);
boolean insert = false;
Pair p = null;
if (pairs != null) {
insert = (score > pairs[pairs.length - 1].score) || pairs.length < 5;
} else {
insert = true;
}
if (insert) {
p = new Pair(score, value);
if (pairs == null) {
pairs = new Pair[1];
pairs[0] = new Pair(score, value);
} else {
if (pairs.length < 5) {
Pair[] newpairs = new Pair[pairs.length + 1];
System.arraycopy(pairs, 0, newpairs, 0, pairs.length);
pairs = newpairs;
}
int k = 0;
for(int i = pairs.length - 2; i >= 0; i--) {
if (pairs[i].score <= p.score) {
pairs[i + 1] = pairs[i];
} else {
k = i + 1;
break;
}
}
pairs[k] = p;
}
top5map.put(key, pairs);
}
line = br.readLine();
}
for(Map.Entry<String, Pair[]> e : top5map.entrySet()) {
System.out.print(e.getKey());
System.out.print(" ");
System.out.println(Arrays.toString(e.getValue()));
}
System.out.println(System.currentTimeMillis() - start);
}
static class Pair {
double score;
String value;
public Pair(double score, String value) {
this.score = score;
this.value = value;
}
public int compareTo(Object o) {
Pair p = (Pair) o;
return (int)Math.signum(score - p.score);
}
public String toString() {
return String.valueOf(score) + ", " + value;
}
}
}
AWK script to fake the data:
BEGIN {
for (i = 0; i < 27000000; i++) {
v = rand();
k = int(rand() * 100);
print k " " v " " i;
}
exit;
}

This is a sketch in Common Lisp
Note that for long files there is a penalty for using READ-LINE, because it conses a fresh string for each line. Then use one of the derivatives of READ-LINE that are floating around that are using a line buffer. Also you might check if you want the hash table be case sensitive or not.
second version
Splitting the string is no longer needed, because we do it here. It is low level code, in the hope that some speed gains will be possible. It checks for one or more spaces as field delimiter and also tabs.
(defun read-a-line (stream)
(let ((line (read-line stream nil nil)))
(flet ((delimiter-p (c)
(or (char= c #\space) (char= c #\tab))))
(when line
(let* ((s0 (position-if #'delimiter-p line))
(s1 (position-if-not #'delimiter-p line :start s0))
(s2 (position-if #'delimiter-p line :start (1+ s1)))
(s3 (position-if #'delimiter-p line :from-end t)))
(values (subseq line 0 s0)
(list (read-from-string line nil nil :start s1 :end s2)
(subseq line (1+ s3)))))))))
Above function returns two values: the key and a list of the rest.
(defun dbscan (top-5-table stream)
"get triples from each line and put them in the hash table"
(loop with aa = nil and bbcc = nil do
(multiple-value-setq (aa bbcc) (read-a-line stream))
while aa do
(setf (gethash aa top-5-table)
(let ((l (merge 'list (gethash aa top-5-table) (list bbcc)
#'> :key #'first)))
(or (and (nth 5 l) (subseq l 0 5)) l)))))
(defun dbprint (table output)
"print the hashtable contents"
(maphash (lambda (aa value)
(loop for (bb cc) in value
do (format output "~a ~a ~a~%" aa bb cc)))
table))
(defun dbsum (input &optional (output *standard-output*))
"scan and sum from a stream"
(let ((top-5-table (make-hash-table :test #'equal)))
(dbscan top-5-table input)
(dbprint top-5-table output)))
(defun fsum (infile outfile)
"scan and sum a file"
(with-open-file (input infile :direction :input)
(with-open-file (output outfile
:direction :output :if-exists :supersede)
(dbsum input output))))
some test data
(defun create-test-data (&key (file "/tmp/test.data") (n-lines 100000))
(with-open-file (stream file :direction :output :if-exists :supersede)
(loop repeat n-lines
do (format stream "~a ~a ~a~%"
(random 1000) (random 100.0) (random 10000)))))
; (create-test-data)
(defun test ()
(time (fsum "/tmp/test.data" "/tmp/result.data")))
third version, LispWorks
Uses some SPLIT-STRING and PARSE-FLOAT functions, otherwise generic CL.
(defun fsum (infile outfile)
(let ((top-5-table (make-hash-table :size 50000000 :test #'equal)))
(with-open-file (input infile :direction :input)
(loop for line = (read-line input nil nil)
while line do
(destructuring-bind (aa bb cc) (split-string '(#\space #\tab) line)
(setf bb (parse-float bb))
(let ((v (gethash aa top-5-table)))
(unless v
(setf (gethash aa top-5-table)
(setf v (make-array 6 :fill-pointer 0))))
(vector-push (cons bb cc) v)
(when (> (length v) 5)
(setf (fill-pointer (sort v #'> :key #'car)) 5))))))
(with-open-file (output outfile :direction :output :if-exists :supersede)
(maphash (lambda (aa value)
(loop for (bb . cc) across value do
(format output "~a ~f ~a~%" aa bb cc)))
top-5-table))))

Here is one more OCaml version - targeted for speed - with custom parser on Streams. Too long, but parts of the parser are reusable. Thanks peufeu for triggering competition :)
Speed :
simple ocaml - 27 sec
ocaml with Stream parser - 15 sec
c with manual parser - 5 sec
Compile with :
ocamlopt -pp camlp4o code.ml -o caml
Code :
open Printf
let cmp x y = compare (fst x : float) (fst y)
let digit c = Char.code c - Char.code '0'
let rec parse f = parser
| [< a=int; _=spaces; b=float; _=spaces;
c=rest (Buffer.create 100); t >] -> f a b c; parse f t
| [< >] -> ()
and int = parser
| [< ''0'..'9' as c; t >] -> int_ (digit c) t
| [< ''-'; ''0'..'9' as c; t >] -> - (int_ (digit c) t)
and int_ n = parser
| [< ''0'..'9' as c; t >] -> int_ (n * 10 + digit c) t
| [< >] -> n
and float = parser
| [< n=int; t=frem; e=fexp >] -> (float_of_int n +. t) *. (10. ** e)
and frem = parser
| [< ''.'; r=frem_ 0.0 10. >] -> r
| [< >] -> 0.0
and frem_ f base = parser
| [< ''0'..'9' as c; t >] ->
frem_ (float_of_int (digit c) /. base +. f) (base *. 10.) t
| [< >] -> f
and fexp = parser
| [< ''e'; e=int >] -> float_of_int e
| [< >] -> 0.0
and spaces = parser
| [< '' '; t >] -> spaces t
| [< ''\t'; t >] -> spaces t
| [< >] -> ()
and crlf = parser
| [< ''\r'; t >] -> crlf t
| [< ''\n'; t >] -> crlf t
| [< >] -> ()
and rest b = parser
| [< ''\r'; _=crlf >] -> Buffer.contents b
| [< ''\n'; _=crlf >] -> Buffer.contents b
| [< 'c; t >] -> Buffer.add_char b c; rest b t
| [< >] -> Buffer.contents b
let () =
let all = Array.make 200 [] in
let each a b c =
assert (a >= 0 && a < 200);
match all.(a) with
| [] -> all.(a) <- [b,c]
| (bmin,_) as prev::tl -> if b > bmin then
begin
let m = List.sort cmp ((b,c)::tl) in
all.(a) <- if List.length tl < 4 then prev::m else m
end
in
parse each (Stream.of_channel stdin);
Array.iteri
(fun a -> List.iter (fun (b,c) -> printf "%i %f %s\n" a b c))
all

Of all the programs in this thread that I've tested so far, the OCaml version is the fastest and also among the shortest. (Line-of-code-based measurements are a little fuzzy, but it's not clearly longer than the Python version or the C or C++ versions, and it is clearly faster.)
Note: I figured out why my earlier runtimes were so nondeterministic! My CPU heatsink was clogged with dust and my CPU was overheating as a result. Now I am getting nice deterministic benchmark times. I think I've now redone all the timing measurements in this thread now that I have a reliable way to time things.
Here are the timings for the different versions so far, running on a 27-million-row 630-megabyte input data file. I'm on Ubuntu Intrepid Ibex on a dual-core 1.6GHz Celeron, running a 32-bit version of the OS (the Ethernet driver was broken in the 64-bit version). I ran each program five times and report the range of times those five tries took. I'm using Python 2.5.2, OpenJDK 1.6.0.0, OCaml 3.10.2, GCC 4.3.2, SBCL 1.0.8.debian, and Octave 3.0.1.
SquareCog's Pig version: not yet tested (because I can't just apt-get install pig), 7 lines of code.
mjv's pure SQL version: not yet tested, but I predict a runtime of several days; 7 lines of code.
ygrek's OCaml version: 68.7 seconds ±0.9 in 15 lines of code.
My Python version: 169 seconds ±4 or 86 seconds ±2 with Psyco, in 16 lines of code.
abbot's heap-based Python version: 177 seconds ±5 in 18 lines of code, or 83 seconds ±5 with Psyco.
My C version below, composed with GNU sort -n: 90 + 5.5 seconds (±3, ±0.1), but gives the wrong answer because of a deficiency in GNU sort, in 22 lines of code (including one line of shell.)
hrnt's C++ version: 217 seconds ±3 in 25 lines of code.
mjv's alternative SQL-based procedural approach: not yet tested, 26 lines of code.
mjv's first SQL-based procedural approach: not yet tested, 29 lines of code.
peufeu's Python version with Psyco: 181 seconds ±4, somewhere around 30 lines of code.
Rainer Joswig's Common Lisp version: 478 seconds (only run once) in 42 lines of code.
abbot's noop.py, which intentionally gives incorrect results to establish a lower bound: not yet tested, 15 lines of code.
Will Hartung's Java version: 96 seconds ±10 in, according to David A. Wheeler’s SLOCCount, 74 lines of code.
Greg's Matlab version: doesn't work.
Schuyler Erle's suggestion of using Pyrex on one of the Python versions: not yet tried.
I supect abbot's version comes out relatively worse for me than for them because the real dataset has a highly nonuniform distribution: as I said, some aa values (“players”) have thousands of lines, while others only have one.
About Psyco: I applied Psyco to my original code (and abbot's version) by putting it in a main function, which by itself cut the time down to about 140 seconds, and calling psyco.full() before calling main(). This added about four lines of code.
I can almost solve the problem using GNU sort, as follows:
kragen#inexorable:~/devel$ time LANG=C sort -nr infile -o sorted
real 1m27.476s
user 0m59.472s
sys 0m8.549s
kragen#inexorable:~/devel$ time ./top5_sorted_c < sorted > outfile
real 0m5.515s
user 0m4.868s
sys 0m0.452s
Here top5_sorted_c is this short C program:
#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
enum { linesize = 1024 };
char buf[linesize];
char key[linesize]; /* last key seen */
int main() {
int n = 0;
char *p;
while (fgets(buf, linesize, stdin)) {
for (p = buf; *p && !isspace(*p); p++) /* find end of key on this line */
;
if (p - buf != strlen(key) || 0 != memcmp(buf, key, p - buf))
n = 0; /* this is a new key */
n++;
if (n <= 5) /* copy up to five lines for each key */
if (fputs(buf, stdout) == EOF) abort();
if (n == 1) { /* save new key in `key` */
memcpy(key, buf, p - buf);
key[p-buf] = '\0';
}
}
return 0;
}
I first tried writing that program in C++ as follows, and I got runtimes which were substantially slower, at 33.6±2.3 seconds instead of 5.5±0.1 seconds:
#include <map>
#include <iostream>
#include <string>
int main() {
using namespace std;
int n = 0;
string prev, aa, bb, cc;
while (cin >> aa >> bb >> cc) {
if (aa != prev) n = 0;
++n;
if (n <= 5) cout << aa << " " << bb << " " << cc << endl;
prev = aa;
}
return 0;
}
I did say almost. The problem is that sort -n does okay for most of the data, but it fails when it's trying to compare 0.33 with 3.78168e-05. So to get this kind of performance and actually solve the problem, I need a better sort.
Anyway, I kind of feel like I'm whining, but the sort-and-filter approach is about 5× faster than the Python program, while the elegant STL program from hrnt is actually a little slower — there seems to be some kind of gross inefficiency in <iostream>. I don't know where the other 83% of the runtime is going in that little C++ version of the filter, but it isn't going anywhere useful, which makes me suspect I don't know where it's going in hrnt's std::map version either. Could that version be sped up 5× too? Because that would be pretty cool. Its working set might be bigger than my L2 cache, but as it happens it probably isn't.
Some investigation with callgrind says my filter program in C++ is executing 97% of its instructions inside of operator >>. I can identify at least 10 function calls per input byte, and cin.sync_with_stdio(false); doesn’t help. This probably means I could get hrnt’s C program to run substantially faster by parsing input lines more efficiently.
Edit: kcachegrind claims that hrnt’s program executes 62% of its instructions (on a small 157000 line input file) extracting doubles from an istream. A substantial part of this is because the istreams library apparently executes about 13 function calls per input byte when trying to parse a double. Insane. Could I be misunderstanding kcachegrind's output?
Anyway, any other suggestions?

Pretty straightforward Caml (27 * 10^6 rows -- 27 sec, C++ by hrnt -- 29 sec)
open Printf
open ExtLib
let (>>) x f = f x
let cmp x y = compare (fst x : float) (fst y)
let wsp = Str.regexp "[ \t]+"
let () =
let all = Hashtbl.create 1024 in
Std.input_lines stdin >> Enum.iter (fun line ->
let [a;b;c] = Str.split wsp line in
let b = float_of_string b in
try
match Hashtbl.find all a with
| [] -> assert false
| (bmin,_) as prev::tl -> if b > bmin then
begin
let m = List.sort ~cmp ((b,c)::tl) in
Hashtbl.replace all a (if List.length tl < 4 then prev::m else m)
end
with Not_found -> Hashtbl.add all a [b,c]
);
all >> Hashtbl.iter (fun a -> List.iter (fun (b,c) -> printf "%s %f %s\n" a b c))

Here is a C++ solution. I didn't have a lot of data to test it with, however, so I don't know how fast it actually is.
[edit] Thanks to the test data provided by the awk script in this thread, I
managed to clean up and speed up the code a bit. I am not trying to find out the fastest possible version - the intent is to provide a reasonably fast version that isn't as ugly as people seem to think STL solutions can be.
This version should be about twice as fast as the first version (goes through 27 million lines in about 35 seconds). Gcc users, remember to
compile this with -O2.
#include <map>
#include <iostream>
#include <functional>
#include <utility>
#include <string>
int main() {
using namespace std;
typedef std::map<string, std::multimap<double, string> > Map;
Map m;
string aa, cc;
double bb;
std::cin.sync_with_stdio(false); // Dunno if this has any effect, but anyways.
while (std::cin >> aa >> bb >> cc)
{
if (m[aa].size() == 5)
{
Map::mapped_type::iterator iter = m[aa].begin();
if (bb < iter->first)
continue;
m[aa].erase(iter);
}
m[aa].insert(make_pair(bb, cc));
}
for (Map::const_iterator iter = m.begin(); iter != m.end(); ++iter)
for (Map::mapped_type::const_iterator iter2 = iter->second.begin();
iter2 != iter->second.end();
++iter2)
std::cout << iter->first << " " << iter2->first << " " << iter2->second <<
std::endl;
}

Interestingly, the original Python solution is by far the cleanest looking (although the C++ example comes close).
How about using Pyrex or Psyco on your original code?

Has anybody tried doing this problem with just awk. Specifically 'mawk'? It should be faster than even Java and C++, according to this blog post: http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/
EDIT: Just wanted to clarify that the only claim being made in that blog post is that for a certain class of problems that are specifically suited to awk-style processing, the mawk virtual machine can beat 'vanilla' implementations in Java and C++.

Since you asked about Matlab, here's how I did something like what you're asking for. I tried to do it without any for loops, but I do have one because I didn't care to take a long time with it. If you were worried about memory then you could pull data from the stream in chunks with fscanf rather than reading the entire buffer.
fid = fopen('fakedata.txt','r');
tic
A=fscanf(fid,'%d %d %d\n');
A=reshape(A,3,length(A)/3)'; %Matlab reads the data into one long column'
Names = unique(A(:,1));
for i=1:length(Names)
indices = find(A(:,1)==Names(i)); %Grab all instances of key i
[Y,I] = sort(A(indices,2),1,'descend'); %sort in descending order of 2nd record
A(indices(I(1:min([5,length(indices(I))]))),:) %Print the top five
end
toc
fclose(fid)

Speaking of lower bounds on compute time :
Let's analyze my algo above :
for each row (key,score,id) :
create or fetch a list of top scores for the row's key
if len( this list ) < N
append current
else if current score > minimum score in list
replace minimum of list with current row
update minimum of all lists if needed
Let N be the N in top-N
Let R be the number of rows in your data set
Let K be the number of distinct keys
What assumptions can we make ?
R * sizeof( row ) > RAM or at least it's big enough that we don't want to load it all, use a hash to group by key, and sort each bin. For the same reason we don't sort the whole stuff.
Kragen likes hashtables, so K * sizeof(per-key state) << RAM, most probably it fits in L2/3 cache
Kragen is not sorting, so K*N << R ie each key has much more than N entries
(note : A << B means A is small relative to B)
If the data has a random distribution, then
after a small number of rows, the majority of rows will be rejected by the per-key minimum condition, the cost is 1 comparison per row.
So the cost per row is 1 hash lookup + 1 comparison + epsilon * (list insertion + (N+1) comparisons for the minimum)
If the scores have a random distribution (say between 0 and 1) and the conditions above hold, both epsilons will be very small.
Experimental proof :
The 27 million rows dataset above produces 5933 insertions into the top-N lists. All other rows are rejected by a simple key lookup and comparison. epsilon = 0.0001
So roughly, the cost is 1 lookup + coparison per row, which takes a few nanoseconds.
On current hardware, there is no way this is not going to be negligible versus IO cost and especially parsing costs.

Isn't this just as simple as
SELECT DISTINCT aa, bb, cc FROM tablename ORDER BY bb DESC LIMIT 5
?
Of course, it's hard to tell what would be fastest without testing it against the data. And if this is something you need to run very fast, it might make sense to optimize your database to make the query faster, rather than optimizing the query.
And, of course, if you need the flat file anyway, you might as well use that.

Pick "top 5" would look something like this. Note that there's no sorting. Nor does any list in the top_5 dictionary ever grow beyond 5 elements.
from collections import defaultdict
import sys
def keep_5( aList, aPair ):
minbb= min( bb for bb,cc in aList )
bb, cc = aPair
if bb < minbb: return aList
aList.append( aPair )
min_i= 0
for i in xrange(1,6):
if aList[i][0] < aList[min_i][0]
min_i= i
aList.pop(min_i)
return aList
top_5= defaultdict(list)
for row in sys.stdin:
aa, bb, cc = row.split()
bb = float(bb)
if len(top_5[aa]) < 5:
top_5[aa].append( (bb,cc) )
else:
top_5[aa]= keep_5( top_5[aa], (bb,cc) )

The Pig version would go something like this (untested):
Data = LOAD '/my/data' using PigStorage() as (aa:int, bb:float, cc:chararray);
grp = GROUP Data by aa;
topK = FOREACH grp (
sorted = ORDER Data by bb DESC;
lim = LIMIT sorted 5;
GENERATE group as aa, lim;
)
STORE topK INTO '/my/output' using PigStorage();
Pig isn't optimized for performance; it's goal is to enable processing of multi-terabyte datasets using parallel execution frameworks. It does have a local mode, so you can try it, but I doubt it will beat your script.

That was a nice lunch break challenge, he, he.
Top-N is a well-known database killer. As shown by the post above, there is no way to efficiently express it in common SQL.
As for the various implementations, you got to keep in mind that the slow part in this is not the sorting or the top-N, it's the parsing of text. Have you looked at the source code for glibc's strtod() lately ?
For instance, I get, using Python :
Read data : 80.5 s
My TopN : 34.41 s
HeapTopN : 30.34 s
It is quite likely that you'll never get very fast timings, no matter what language you use, unless your data is in some format that is a lot faster to parse than text. For instance, loading the test data into postgres takes 70 s, and the majority of that is text parsing, too.
If the N in your topN is small, like 5, a C implementation of my algorithm below would probably be the fastest. If N can be larger, heaps are a much better option.
So, since your data is probably in a database, and your problem is getting at the data, not the actual processing, if you're really in need of a super fast TopN engine, what you should do is write a C module for your database of choice. Since postgres is faster for about anything, I suggest using postgres, plus it isn't difficult to write a C module for it.
Here's my Python code :
import random, sys, time, heapq
ROWS = 27000000
def make_data( fname ):
f = open( fname, "w" )
r = random.Random()
for i in xrange( 0, ROWS, 10000 ):
for j in xrange( i,i+10000 ):
f.write( "%d %f %d\n" % (r.randint(0,100), r.uniform(0,1000), j))
print ("write: %d\r" % i),
sys.stdout.flush()
print
def read_data( fname ):
for n, line in enumerate( open( fname ) ):
r = line.strip().split()
yield int(r[0]),float(r[1]),r[2]
if not (n % 10000 ):
print ("read: %d\r" % n),
sys.stdout.flush()
print
def topn( ntop, data ):
ntop -= 1
assert ntop > 0
min_by_key = {}
top_by_key = {}
for key,value,label in data:
tup = (value,label)
if key not in top_by_key:
# initialize
top_by_key[key] = [ tup ]
else:
top = top_by_key[ key ]
l = len( top )
if l > ntop:
# replace minimum value in top if it is lower than current value
idx = min_by_key[ key ]
if top[idx] < tup:
top[idx] = tup
min_by_key[ key ] = top.index( min( top ) )
elif l < ntop:
# fill until we have ntop entries
top.append( tup )
else:
# we have ntop entries in list, we'll have ntop+1
top.append( tup )
# initialize minimum to keep
min_by_key[ key ] = top.index( min( top ) )
# finalize:
return dict( (key, sorted( values, reverse=True )) for key,values in top_by_key.iteritems() )
def grouptopn( ntop, data ):
top_by_key = {}
for key,value,label in data:
if key in top_by_key:
top_by_key[ key ].append( (value,label) )
else:
top_by_key[ key ] = [ (value,label) ]
return dict( (key, sorted( values, reverse=True )[:ntop]) for key,values in top_by_key.iteritems() )
def heaptopn( ntop, data ):
top_by_key = {}
for key,value,label in data:
tup = (value,label)
if key not in top_by_key:
top_by_key[ key ] = [ tup ]
else:
top = top_by_key[ key ]
if len(top) < ntop:
heapq.heappush(top, tup)
else:
if top[0] < tup:
heapq.heapreplace(top, tup)
return dict( (key, sorted( values, reverse=True )) for key,values in top_by_key.iteritems() )
def dummy( data ):
for row in data:
pass
make_data( "data.txt" )
t = time.clock()
dummy( read_data( "data.txt" ) )
t_read = time.clock() - t
t = time.clock()
top_result = topn( 5, read_data( "data.txt" ) )
t_topn = time.clock() - t
t = time.clock()
htop_result = heaptopn( 5, read_data( "data.txt" ) )
t_htopn = time.clock() - t
# correctness checking :
for key in top_result:
print key, " : ", " ".join (("%f:%s"%(value,label)) for (value,label) in top_result[key])
print key, " : ", " ".join (("%f:%s"%(value,label)) for (value,label) in htop_result[key])
print
print "Read data :", t_read
print "TopN : ", t_topn - t_read
print "HeapTopN : ", t_htopn - t_read
for key in top_result:
assert top_result[key] == htop_result[key]

I love lunch break challenges. Here's a 1 hour implementation.
OK, when you don't want do some extremely exotic crap like additions, nothing stops you from using a custom base-10 floating point format whose only implemented operator is comparison, right ? lol.
I had some fast-atoi code lying around from a previous project, so I just imported that.
http://www.copypastecode.com/11541/
This C source code takes about 6.6 seconds to parse the 580MB of input text (27 million lines), half of that time is fgets, lol. Then it takes approximately 0.05 seconds to compute the top-n, but I don't know for sure, since the time it takes for the top-n is less than the timer noise.
You'll be the one to test it for correctness though XDDDDDDDDDDD
Interesting huh ?

Well, please grab a coffee and read the source code for strtod -- it's mindboggling, but needed, if you want to float -> text -> float to give back the same float you started with.... really...
Parsing integers is a lot faster (not so much in python, though, but in C, yes).
Anyway, putting the data in a Postgres table :
SELECT count( key ) FROM the dataset in the above program
=> 7 s (so it takes 7 s to read the 27M records)
CREATE INDEX topn_key_value ON topn( key, value );
191 s
CREATE TEMPORARY TABLE topkeys AS SELECT key FROM topn GROUP BY key;
12 s
(You can use the index to get distinct values of 'key' faster too but it requires some light plpgsql hacking)
CREATE TEMPORARY TABLE top AS SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1) AS r FROM topkeys a) foo;
Temps : 15,310 ms
INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 1) AS r FROM topkeys a) foo;
Temps : 17,853 ms
INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 2) AS r FROM topkeys a) foo;
Temps : 13,983 ms
INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 3) AS r FROM topkeys a) foo;
Temps : 16,860 ms
INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 4) AS r FROM topkeys a) foo;
Temps : 17,651 ms
INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 5) AS r FROM topkeys a) foo;
Temps : 19,216 ms
SELECT * FROM top ORDER BY key,value;
As you can see computing the top-n is extremely fast (provided n is small) but creating the (mandatory) index is extremely slow because it involves a full sort.
Your best bet is to use a format that is fast to parse (either binary, or write a custom C aggregate for your database, which would be the best choice IMHO). The runtime in the C program shouldn't be more than 1s if python can do it in 1 s.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.