Parallel execution of bash task from python, extracting csv columns - python

I have a csv file with 7,221,032 columns and 37 rows. I need to map each column to a separate file, ideally from a python script. My attempt so far:
num_features = 7221032
binary_dir = "data_binary"
command_template = command = 'awk -F "\\"*,\\"*" \'{print $%s}\' %s/images_binary.txt > %s/feature_files/pixel_%s.vector'
batch_size = 100
batch_indexes = np.arange(1, num_features, batch_size)
for batch_index in batch_indexes[1:5]:
indexes = range(batch_index-batch_size, batch_index)
commands = [command_template % (str(i), binary_dir, binary_dir, str(i)) for i in indexes]
map(os.system, commands)
But, this appears to be rather a slow process.. Any advice on how to speed it up?

Revised solution - using Perl
Run with perl prog.pl < /path/to/images_binary.txt
Runtime is 10 sec for 100,000 items. Will take about 7 hours for complete data set. Not sure that running parallel will do better, since the bottleneck is open/close of files. You best choice to improve perform will be to reduce the number of generated files, somehow get the input written in column order first.
#! /usr/bin/perl
while ( my $x = <> ) {
chomp $x ;
my #v = split(',', $x) ;
foreach my $i (0..$#v) {
open OF, ">data_binary/feature_files/pixel_$i.vector" ;
print OF $v[$i], "\n" ;
close OF ;
} ;
} ;

Related

Throttling a step in beam application

I'm using python beam on google dataflow, my pipeline looks like this:
Read image urls from file >> Download images >> Process images
The problem is that I can't let Download images step scale as much as it needs because my application can get blocked from the image server.
Is it a way that I can throttle the step ? Either on input or output per minute.
Thank you.
One possibility, maybe naïve, is to introduce a sleep in the step. For this you need to know the maximum number of instances of the ParDo that can be running at the same time. If autoscalingAlgorithm is set to NONE you can obtain that from numWorkers and workerMachineType (DataflowPipelineOptions). Precisely, the effective rate will be divided by the total number of threads: desired_rate/(num_workers*num_threads(per worker)). The sleep time will be the inverse of that effective rate:
Integer desired_rate = 1; // QPS limit
if (options.getNumWorkers() == 0) { num_workers = 1; }
else { num_workers = options.getNumWorkers(); }
if (options.getWorkerMachineType() != null) {
machine_type = options.getWorkerMachineType();
num_threads = Integer.parseInt(machine_type.substring(machine_type.lastIndexOf("-") + 1));
}
else { num_threads = 1; }
Double sleep_time = (double)(num_workers * num_threads) / (double)(desired_rate);
Then you can use TimeUnit.SECONDS.sleep(sleep_time.intValue()); or equivalent inside the throttled Fn. In my example, as a use case, I wanted to read from a public file, parse out the empty lines and call the Natural Language Processing API with a maximum rate of 1 QPS (I initialized desired_rate to 1 previously):
p
.apply("Read Lines", TextIO.read().from("gs://apache-beam-samples/shakespeare/kinglear.txt"))
.apply("Omit Empty Lines", ParDo.of(new OmitEmptyLines()))
.apply("NLP requests", ParDo.of(new ThrottledFn()))
.apply("Write Lines", TextIO.write().to(options.getOutput()));
The rate-limited Fn is ThrottledFn, notice the sleep function:
static class ThrottledFn extends DoFn<String, String> {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
// Instantiates a client
try (LanguageServiceClient language = LanguageServiceClient.create()) {
// The text to analyze
String text = c.element();
Document doc = Document.newBuilder()
.setContent(text).setType(Type.PLAIN_TEXT).build();
// Detects the sentiment of the text
Sentiment sentiment = language.analyzeSentiment(doc).getDocumentSentiment();
String nlp_results = String.format("Sentiment: score %s, magnitude %s", sentiment.getScore(), sentiment.getMagnitude());
TimeUnit.SECONDS.sleep(sleep_time.intValue());
Log.info(nlp_results);
c.output(nlp_results);
}
}
}
With this I get a 1 element/s rate as seen in the image below and avoid hitting quota when using multiple workers, even if requests are not really spread out (you might get 8 simultaneous requests and then 8s sleep, etc.). This was just a test, possibly a better implemention would be using guava's rateLimiter.
If the pipeline is using autoscaling (THROUGHPUT_BASED) then it would be more complicated and the number of workers should be updated (for example, Stackdriver Monitoring has a job/current_num_vcpus metric). Other general considerations would be controlling the number of parallel ParDos by using a dummy GroupByKey or splitting the source with splitIntoBundles, etc. I'd like to see if there are other nicer solutions.

NetCDF file - why is file 1/3 size after fixing record dimension?

I am struggling to get to grips with this.
I create a netcdf4 file with the following dimensions and variables (note in particular the unlimited point dimension):
dimensions:
point = UNLIMITED ; // (275935 currently)
realization = 24 ;
variables:
short mod_hs(realization, point) ;
mod_hs:scale_factor = 0.01 ;
short mod_ws(realization, point) ;
mod_ws:scale_factor = 0.01 ;
short obs_hs(point) ;
obs_hs:scale_factor = 0.01 ;
short obs_ws(point) ;
obs_ws:scale_factor = 0.01 ;
short fchr(point) ;
float obs_lat(point) ;
float obs_lon(point) ;
double obs_datetime(point) ;
}
I have a Python program that populated this file with data in a loop (hence the unlimited record dimension - I don't know apriori how big the file will be).
After populating the file, it is 103MB in size.
My issue is that reading data from this file is quite slow. I guessed that this is something to do with chunking and the unlmited point dimension?
I ran ncks --fix_rec_dmn on the file and (after a lot of churning) it produced a new netCDF file that is only 32MB in size (which is about the right size for the data it contains).
This is a massive difference in size - why is the original file so bloated? Also - accessing the data in this file is orders of magnitude quicker. For example, in Python, to read in the contents of the hs variable takes 2 seconds on the original file and 40 milliseconds on the fixed record dimension file.
The problem I have is that some of my files contain a lot of points and seem to be too big to run ncks on (my machine runs out of memoery and I have 8GB), so I can't convert all the data to fixed record dimension.
Can anyone explain why the file sizes are so different and how I can make the original files smaller and more efficient to read?
By the way - I am not using zlib compression (I have opted for scaling floating point values to an integer short).
Chris
EDIT
My Python code is essentially building up one single timeseries file of collocated model and observation data from multiple individual model forecast files over 3 months. My forecast model runs 4 times a day, and I am aggregateing 3 months of data, so that is ~120 files.
The program extracts a subset of the forecast period from each file (e.t. T+24h -> T+48h), so it is not a simple matter of concatenating the files.
This is a rough approxiamtion of what my code is doing (it actually reads/writes more variables, but I am just showing 2 here for clarity):
# Create output file:
dout = nc.Dataset(fn, mode='w', clobber=True, format="NETCDF4")
dout.createDimension('point', size=None)
dout.createDimension('realization', size=24)
for varname in ['mod_hs','mod_ws']:
v = ncd.createVariable(varname, np.short,
dimensions=('point', 'realization'), zlib=False)
v.scale_factor = 0.01
# Cycle over dates
date = <some start start>
end_dat = <some end date>
# Keeo track if record dimension ('point') size:
n = 0
while date < end_date:
din = nc.Dataset("<path to input file>", mode='r')
fchr = din.variables['fchr'][:]
# get mask for specific forecast hour range
m = np.logical_and(fchr >= 24, fchr < 48)
sz = np.count_nonzero(m)
if sz == 0:
continue
dout.variables['mod_hs'][n:n+sz,:] = din.variables['mod_hs'][:][m,:]
dout.variables['mod_ws'][n:n+sz,:] = din.variables['mod_wspd'][:][m,:]
# Increment record dimension count:
n += sz
din.close()
# Goto next file
date += dt.timedelta(hours=6)
dout.close()
Interestingly, if I make the output file format NETCDF3_CLASSIC rather that NETCDF4 the output size the size that I would expect. NETCDF4 output seesm to be bloated.
My experience has been that the default chunksize for record dimensions depends on the version of the netCDF library underneath. For 4.3.3.1, it is 524288. 275935 records is about half a record-chunk. ncks automatically chooses (without telling you) more sensible chunksizes than netCDF defaults, so the output is better optimized. I think this is what is happening. See http://nco.sf.net/nco.html#cnk
Please try to provide a code that works without modification if possible, I had to edit to get it working, but it wasn't too difficult.
import netCDF4 as nc
import numpy as np
dout = nc.Dataset('testdset.nc4', mode='w', clobber=True, format="NETCDF4")
dout.createDimension('point', size=None)
dout.createDimension('realization', size=24)
for varname in ['mod_hs','mod_ws']:
v = dout.createVariable(varname, np.short,
dimensions=('point', 'realization'), zlib=False,chunksizes=[1000,24])
v.scale_factor = 0.01
date = 1
end_date = 5000
n = 0
while date < end_date:
sz=100
dout.variables['mod_hs'][n:n+sz,:] = np.ones((sz,24))
dout.variables['mod_ws'][n:n+sz,:] = np.ones((sz,24))
n += sz
date += 1
dout.close()
The main difference is in createVariable command. For file size, without providing "chunksizes" in creating variable, I also got twice as large file compared to when I added it. So for file size it should do the trick.
For reading variables from file, I did not notice any difference actually, maybe I should add more variables?
Anyway, it should be clear how to add chunk size now, You probably need to test a bit to get good conf for Your problem. Feel free to ask more if it still does not work for You, and if You want to understand more about chunking, read the hdf5 docs
I think your problem is that the default chunk size for unlimited dimensions is 1, which creates a huge number of internal HDF5 structures. By setting the chunksize explicitly (obviously ok for unlimited dimensions), the second example does much better in space and time.
Unlimited dimensions require chunking in HDF5/netCDF4, so if you want unlimited dimensions you have to think about chunking performance, as you have discovered.
More here:
https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_perf_chunking.html

Optimization: Python, Perl, and a C Suffix Tree Library

I've got about 3,500 files that consist of single line character strings. The files vary in size (from about 200b to 1mb). I'm trying to compare each file with each other file and find a common subsequence of length 20 characters between two files. Note that the subsequence is only common between two files during each comparison, and not common among all files.
I've stuggled with this problem a bit, and since I'm not an expert, I've ended up with a bit of an ad-hoc solution. I use itertools.combinations to build a list in Python that ends up with around 6,239,278 combinations. I then pass the files two at a time to a Perl script that acts a wrapper for a suffix tree library written in C called libstree. I've tried to avoid this type of solution but the only comparable C suffix tree wrapper in Python suffers from a memory leak.
So here's my problem. I've timed it, and on my machine, the solution processes about 500 comparisons in 25 seconds. So that means, it'll take around 3 days of continuous processing to complete the task. And then I have to do it all again to look at say 25 characters instead of 20. Please note that I'm way out of my comfort zone and not a very good programmer, so I'm sure there is a much more elegant way to do this. I thought I'd ask it here and produce my code to see if anyone has any suggestion as to how I could complete this task faster.
Python code:
from itertools import combinations
import glob, subprocess
glist = glob.glob("Data/*.g")
i = 0
for a,b in combinations(glist, 2):
i += 1
p = subprocess.Popen(["perl", "suffix_tree.pl", a, b, "20"], shell=False, stdout=subprocess.PIPE)
p = p.stdout.read()
a = a.split("/")
b = b.split("/")
a = a[1].split(".")
b = b[1].split(".")
print str(i) + ":" + str(a[0]) + " --- " + str(b[0])
if p != "" and len(p) == 20:
with open("tmp.list", "a") as openf:
openf.write(a[0] + " " + b[0] + "\n")
Perl code:
use strict;
use Tree::Suffix;
open FILE, "<$ARGV[0]";
my $a = do { local $/; <FILE> };
open FILE, "<$ARGV[1]";
my $b = do { local $/; <FILE> };
my #g = ($a,$b);
my $st = Tree::Suffix->new(#g);
my ($c) = $st->lcs($ARGV[2],-1);
print "$c";
Rather than writing Python to call Perl to call C, I'm sure you would be better off dropping the Python code and writing it all in Perl.
If your files are certain to contain exactly one line then you can read the pairs more simply by writing just
my #g = <>;
I believe the program below performs the same function as your Python and Perl code combined, but I cannot test it as I am unable to install libstree at present.
But as ikegami has pointed out, it would be far better to calculate and store the longest common subsequence for each pair of files and put them into categories afterwards. I won't go on to code this as I don't know what information you need - whether it is just subsequence length or if you need the characters and/or the position of the subsequences as well.
use strict;
use warnings;
use Math::Combinatorics;
use Tree::Suffix;
my #glist = glob "Data/*.g";
my $iterator = Math::Combinatorics->new(count => 2, data => \#glist);
open my $fh, '>', 'tmp.list' or die $!;
my $n = 0;
while (my #pair = $iterator->next_combination) {
$n++;
#ARGV = #pair;
my #g = <>;
my $tree = Tree::Suffix->new(#g);
my $lcs = $tree->lcs;
#pair = map m|/(.+?)\.|, #pair;
print "$n: $pair[0] --- $pair[1]\n";
print $fh, "#pair\n" if $lcs and length $lcs >= 20;
}

What is the best, python or bash for selectively concatenating lots of files?

I have around 20000 files coming from the output of some program, and their names follow the format:
data1.txt
data2.txt
...
data99.txt
data100.txt
...
data999.txt
data1000.txt
...
data20000.txt
I would like to write a script that gets as input argument the number N. Then it makes blocks of N concatenated files, so if N=5, it would make the following new files:
data_new_1.txt: it would contain (concatenated) data1.txt to data5.txt (like cat data1.txt data2.txt ...> data_new_1.txt )
data_new_2.txt: it would contain (concatenated) data6.txt to data10.txt
.....
I wonder what do you think would be the best approach to do this, whether bash, python or another one like awk, perl, etc.
The best approach I mean in terms of simplest code.
Thanks
Here's a Python (2.6) version (if you have Python 2.5, add a first line that says
from __future__ import with_statement
and the script will also work)...:
import sys
def main(N):
rN = range(N)
for iout, iin in enumerate(xrange(1, 99999, N)):
with open('data_new_%s.txt' % (iout+1), 'w') as out:
for di in rN:
try: fin = open('data%s.txt' % (iin + di), 'r')
except IOError: return
out.write(fin.read())
fin.close()
if __name__ == '__main__':
if len(sys.argv) > 1:
N = int(sys.argv[1])
else:
N = 5
main(N)
As you see from other answers & comments, opinions on performance differ -- some believe that the Python startup (and imports of modules) will make this slower than bash (but the import part at least is bogus: sys, the only needed module, is a built-in module, requires no "loading" and therefore basically negligible overhead to import it); I suspect avoiding the repeated fork/exec of cat may slow bash down; others think that I/O will dominate anyway, making the two solutions equivalent. You'll have to benchmark with your own files, on your own system, to solve this performance doubt.
how about a one liner ? :)
ls data[0-9]*txt|sort -nk1.5|awk 'BEGIN{rn=5;i=1}{while((getline _<$0)>0){print _ >"data_new_"i".txt"}close($0)}NR%rn==0{i++}'
I like this one which saves on executing processes, only 1 cat per block
#! /bin/bash
N=5 # block size
S=1 # start
E=20000 # end
for n in $(seq $S $N $E)
do
CMD="cat "
i=$n
while [ $i -lt $((n + N)) ]
do
CMD+="data$((i++)).txt "
done
$CMD > data_new_$((n / N + 1)).txt
done
Best in what sense? Bash can do this quite well, but it may be harder for you to write a good bash script if you are more familiar with another scripting language. Do you want to optimize for something specific?
That said, here's a bash implementation:
declare blocksize=5
declare i=1
declare blockstart=1
declare blockend=$blocksize
declare -a fileset
while [ -f data${i}.txt ] ; do
fileset=("${fileset[#]}" $data${i}.txt)
i=$(($i + 1))
if [ $i -gt $blockend ] ; then
cat "${fileset[#]}" > data_new_${blockstart}.txt
fileset=() # clear
blockstart=$(($blockstart + $blocksize))
blockend=$(($blockend+ $blocksize))
fi
done
EDIT: I see you now say "Best" == "Simplest code", but what's simple depends on you. For me Perl is simpler than Python, for some Awk is simpler than bash. It depends on what you know best.
EDIT again: inspired by dtmilano, I've changed mine to use cat once per blocksize, so now cat will be called 'only' 4000 times.
Let's say, if you have a simple script that concatenates files and keeps a counter for you, like the following:
#!/usr/bin/bash
COUNT=0
if [ -f counter ]; then
COUNT=`cat counter`
fi
COUNT=$[$COUNT+1]
echo $COUNT > counter
cat $# > $COUNT.data
The a command line will do:
find -name "*" -type f -print0 | xargs -0 -n 5 path_to_the_script
Since this can easily be done in any shell I would simply use that.
This should do it:
#!/bin/sh
FILES=$1
FILENO=1
for i in data[0-9]*.txt; do
FILES=`expr $FILES - 1`
if [ $FILES -eq 0 ]; then
FILENO=`expr $FILENO + 1`
FILES=$1
fi
cat $i >> "data_new_${FILENO}.txt"
done
Python version:
#!/usr/bin/env python
import os
import sys
if __name__ == '__main__':
files_per_file = int(sys.argv[1])
i = 0
while True:
i += 1
source_file = 'data%d.txt' % i
if os.path.isfile(source_file):
dest_file = 'data_new_%d.txt' % ((i / files_per_file) + 1)
file(dest_file, 'wa').write(file(source_file).read())
else:
break
Simple enough?
make_cat.py
limit = 1000
n = 5
for i in xrange( 0, (limit+n-1)//n ):
names = [ "data{0}.txt".format(j) for j in range(i*n,i*n+n) ]
print "cat {0} >data_new_{1}.txt".format( " ".join(names), i )
Script
python make_cat.py | sh

What language could I use for fast execution of this database summarization task?

So I wrote a Python program to handle a little data processing
task.
Here's a very brief specification in a made-up language of the computation I want:
parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \
flatten | format "%s %lf %s" aa bb cc
That is, for each line, parse out a word, a floating-point number, and another word. Think of them as a player ID, a score, and a date. I want the top five scores and dates for each player. The data size is not trivial, but not huge; about 630 megabytes.
I want to know what real, executable language I should have written it in to
get it to be similarly short (as the Python below) but much faster.
#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
top_5 = {}
for line in sys.stdin:
aa, bb, cc = line.split()
# We want the top 5 for each distinct value of aa. There are
# hundreds of thousands of values of aa.
bb = float(bb)
if aa not in top_5: top_5[aa] = []
current = top_5[aa]
current.append((bb, cc))
# Every once in a while, we drop the values that are not in
# the top 5, to keep our memory footprint down, because some
# values of aa have thousands of (bb, cc) pairs.
if len(current) > 10:
current.sort()
current[:-5] = []
for aa in top_5:
current = top_5[aa]
current.sort()
for bb, cc in current[-5:]:
print aa, bb, cc
Here’s some sample input data:
3 1.5 a
3 1.6 b
3 0.8 c
3 0.9 d
4 1.2 q
3 1.5 e
3 1.8 f
3 1.9 g
Here’s the output I get from it:
3 1.5 a
3 1.5 e
3 1.6 b
3 1.8 f
3 1.9 g
4 1.2 q
There are seven values for 3, and so we drop the c and d values
because their bb value puts them out of the top 5. Because 4 has
only one value, its “top 5” consists of just that one value.
This runs faster than doing the same queries in MySQL (at least, the
way we’ve found to do the queries) but I’m pretty sure it's spending
most of its time in the Python bytecode interpreter. I think that in
another language, I could probably get it to process hundreds of
thousands of rows per second instead of per minute. So I’d like to
write it in a language that has a faster implementation.
But I’m not sure what language to choose.
I haven’t been able to figure out how to express this as a single query in SQL, and
actually I’m really unimpressed with MySQL’s ability even to merely
select * from foo into outfile 'bar'; the input data.
C is an obvious choice, but things like line.split(), sorting a list
of 2-tuples, and making a hash table require writing some code that’s
not in the standard library, so I would end up with 100 lines of code
or more instead of 14.
C++ seems like it might be a better choice (it has strings, maps,
pairs, and vectors in the standard library) but it seems like the code
would be a lot messier with STL.
OCaml would be fine, but does it have an equivalent of line.split(),
and will I be sad about the performance of its map?
Common Lisp might work?
Is there some equivalent of Matlab for database computation like this
that lets me push the loops down into fast code? Has anybody tried Pig?
(Edit: responded to davethegr8's comment by providing some sample input and output data, and fixed a bug in the Python program!)
(Additional edit: Wow, this comment thread is really excellent so far. Thanks, everybody!)
Edit:
There was an eerily similar question asked on sbcl-devel in 2007 (thanks, Rainer!), and here's an awk script from Will Hartung for producing some test data (although it doesn't have the Zipfian distribution of the real data):
BEGIN {
for (i = 0; i < 27000000; i++) {
v = rand();
k = int(rand() * 100);
print k " " v " " i;
}
exit;
}
I have a hard time believing that any script without any prior knowledge of the data (unlike MySql which has such info pre-loaded), would be faster than a SQL approach.
Aside from the time spent parsing the input, the script needs to "keep" sorting the order by array etc...
The following is a first guess at what should work decently fast in SQL, assuming a index (*) on the table's aa, bb, cc columns, in that order. (A possible alternative would be an "aa, bb DESC, cc" index
(*) This index could be clustered or not, not affecting the following query. Choice of clustering or not, and of needing an "aa,bb,cc" separate index depends on use case, on the size of the rows in table etc. etc.
SELECT T1.aa, T1.bb, T1.cc , COUNT(*)
FROM tblAbc T1
LEFT OUTER JOIN tblAbc T2 ON T1.aa = T2.aa AND
(T1.bb < T2.bb OR(T1.bb = T2.bb AND T1.cc < T2.cc))
GROUP BY T1.aa, T1.bb, T1.cc
HAVING COUNT(*) < 5 -- trick, remember COUNT(*) goes 1,1,2,3,...
ORDER BY T1.aa, T1.bb, T1.cc, COUNT(*) DESC
The idea is to get a count of how many records, within a given aa value are smaller than self. There is a small trick however: we need to use LEFT OUTER join, lest we discard the record with the biggest bb value or the last one (which may happen to be one of the top 5). As a result of left joining it, the COUNT(*) value counts 1, 1, 2, 3, 4 etc. and the HAVING test therefore is "<5" to effectively pick the top 5.
To emulate the sample output of the OP, the ORDER BY uses DESC on the COUNT(), which could be removed to get a more traditional top 5 type of listing. Also, the COUNT() in the select list can be removed if so desired, this doesn't impact the logic of the query and the ability to properly sort.
Also note that this query is deterministic in terms of the dealing with ties, i,e, when a given set of records have a same value for bb (within an aa group); I think the Python program may provide slightly different outputs when the order of the input data is changed, that is because of its occasional truncating of the sorting dictionary.
Real solution: A SQL-based procedural approach
The self-join approach described above demonstrates how declarative statements can be used to express the OP's requirement. However this approach is naive in a sense that its performance is roughly bound to the sum of the squares of record counts within each aa 'category'. (not O(n^2) but roughly O((n/a)^2) where a is the number of different values for the aa column) In other words it performs well with data such that on average the number of records associated with a given aa value doesn't exceed a few dozens. If the data is such that the aa column is not selective, the following approach is much -much!- better suited. It leverages SQL's efficient sorting framework, while implementing a simple algorithm that would be hard to express in declarative fashion. This approach could further be improved for datasets with particularly huge number of records each/most aa 'categories' by introducing a simple binary search of the next aa value, by looking ahead (and sometimes back...) in the cursor. For cases where the number of aa 'categories' relative to the overall row count in tblAbc is low, see yet another approach, after this next one.
DECLARE #aa AS VARCHAR(10), #bb AS INT, #cc AS VARCHAR(10)
DECLARE #curAa AS VARCHAR(10)
DECLARE #Ctr AS INT
DROP TABLE tblResults;
CREATE TABLE tblResults
( aa VARCHAR(10),
bb INT,
cc VARCHAR(10)
);
DECLARE abcCursor CURSOR
FOR SELECT aa, bb, cc
FROM tblABC
ORDER BY aa, bb DESC, cc
FOR READ ONLY;
OPEN abcCursor;
SET #curAa = ''
FETCH NEXT FROM abcCursor INTO #aa, #bb, #cc;
WHILE ##FETCH_STATUS = 0
BEGIN
IF #curAa <> #aa
BEGIN
SET #Ctr = 0
SET #curAa = #aa
END
IF #Ctr < 5
BEGIN
SET #Ctr = #Ctr + 1;
INSERT tblResults VALUES(#aa, #bb, #cc);
END
FETCH NEXT FROM AbcCursor INTO #aa, #bb, #cc;
END;
CLOSE abcCursor;
DEALLOCATE abcCursor;
SELECT * from tblResults
ORDER BY aa, bb, cc -- OR .. bb DESC ... for a more traditional order.
Alternative to the above for cases when aa is very unselective. In other words, when we have relatively few aa 'categories'. The idea is to go through the list of distinct categories and to run a "LIMIT" (MySql) "TOP" (MSSQL) query for each of these values.
For reference purposes, the following ran in 63 seconds for tblAbc of 61 Million records divided in 45 aa values, on MSSQL 8.0, on a relatively old/weak host.
DECLARE #aa AS VARCHAR(10)
DECLARE #aaCount INT
DROP TABLE tblResults;
CREATE TABLE tblResults
( aa VARCHAR(10),
bb INT,
cc VARCHAR(10)
);
DECLARE aaCountCursor CURSOR
FOR SELECT aa, COUNT(*)
FROM tblABC
GROUP BY aa
ORDER BY aa
FOR READ ONLY;
OPEN aaCountCursor;
FETCH NEXT FROM aaCountCursor INTO #aa, #aaCount
WHILE ##FETCH_STATUS = 0
BEGIN
INSERT tblResults
SELECT TOP 5 aa, bb, cc
FROM tblproh
WHERE aa = #aa
ORDER BY aa, bb DESC, cc
FETCH NEXT FROM aaCountCursor INTO #aa, #aaCount;
END;
CLOSE aaCountCursor
DEALLOCATE aaCountCursor
SELECT * from tblResults
ORDER BY aa, bb, cc -- OR .. bb DESC ... for a more traditional order.
On the question of needing an index or not. (cf OP's remark)
When merely running a "SELECT * FROM myTable", a table scan is effectively the fastest appraoch, no need to bother with indexes. However, the main reason why SQL is typically better suited for this kind of things (aside from being the repository where the data has been accumulating in the first place, whereas any external solution needs to account for the time to export the relevant data), is that it can rely on indexes to avoid scanning. Many general purpose languages are far better suited to handle raw processing, but they are fighting an unfair battle with SQL because they need to rebuilt any prior knowledge of the data which SQL has gathered in the course of its data collection / import phase. Since sorting is a typically a time and sometimes space consuming task, SQL and its relatively slower processing power often ends up ahead of alternative solutions.
Also, even without pre-built indexes, modern query optimizers may decide on a plan that involves the creation of a temporary index. And, because sorting is an intrinsic part of DDMS, the SQL servers are generally efficient in that area.
So... Is SQL better?
This said, if we are trying to compare SQL and other languages for pure ETL jobs, i.e. for dealing with heaps (unindexed tables) as its input to perform various transformations and filtering, it is likely that multi-thread-able utilities written in say C, and leveraging efficient sorting libaries, would likely be faster. The determining question to decide on a SQL vs. Non-SQL approach is where the data is located and where should it eventually reside. If we merely to convert a file to be supplied down "the chain" external programs are better suited. If we have or need the data in a SQL server, there are only rare cases that make it worthwhile exporting and processing externally.
You could use smarter data structures and still use python.
I've ran your reference implementation and my python implementation on my machine and even compared the output to be sure in results.
This is yours:
$ time python ./ref.py < data-large.txt > ref-large.txt
real 1m57.689s
user 1m56.104s
sys 0m0.573s
This is mine:
$ time python ./my.py < data-large.txt > my-large.txt
real 1m35.132s
user 1m34.649s
sys 0m0.261s
$ diff my-large.txt ref-large.txt
$ echo $?
0
And this is the source:
#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
import heapq
top_5 = {}
for line in sys.stdin:
aa, bb, cc = line.split()
# We want the top 5 for each distinct value of aa. There are
# hundreds of thousands of values of aa.
bb = float(bb)
if aa not in top_5: top_5[aa] = []
current = top_5[aa]
if len(current) < 5:
heapq.heappush(current, (bb, cc))
else:
if current[0] < (bb, cc):
heapq.heapreplace(current, (bb, cc))
for aa in top_5:
current = top_5[aa]
while len(current) > 0:
bb, cc = heapq.heappop(current)
print aa, bb, cc
Update: Know your limits.
I've also timed a noop code, to know the fastest possible python solution with code similar to the original:
$ time python noop.py < data-large.txt > noop-large.txt
real 1m20.143s
user 1m19.846s
sys 0m0.267s
And the noop.py itself:
#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys
import heapq
top_5 = {}
for line in sys.stdin:
aa, bb, cc = line.split()
bb = float(bb)
if aa not in top_5: top_5[aa] = []
current = top_5[aa]
if len(current) < 5:
current.append((bb, cc))
for aa in top_5:
current = top_5[aa]
current.sort()
for bb, cc in current[-5:]:
print aa, bb, cc
This took 45.7s on my machine with 27M rows of data that looked like this:
42 0.49357 0
96 0.48075 1
27 0.640761 2
8 0.389128 3
75 0.395476 4
24 0.212069 5
80 0.121367 6
81 0.271959 7
91 0.18581 8
69 0.258922 9
Your script took 1m42 on this data, the c++ example too 1m46 (g++ t.cpp -o t to compile it, I don't know anything about c++).
Java 6, not that it matters really. Output isn't perfect, but it's easy to fix.
package top5;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Arrays;
import java.util.Map;
import java.util.TreeMap;
public class Main {
public static void main(String[] args) throws Exception {
long start = System.currentTimeMillis();
Map<String, Pair[]> top5map = new TreeMap<String, Pair[]>();
BufferedReader br = new BufferedReader(new FileReader("/tmp/file.dat"));
String line = br.readLine();
while(line != null) {
String parts[] = line.split(" ");
String key = parts[0];
double score = Double.valueOf(parts[1]);
String value = parts[2];
Pair[] pairs = top5map.get(key);
boolean insert = false;
Pair p = null;
if (pairs != null) {
insert = (score > pairs[pairs.length - 1].score) || pairs.length < 5;
} else {
insert = true;
}
if (insert) {
p = new Pair(score, value);
if (pairs == null) {
pairs = new Pair[1];
pairs[0] = new Pair(score, value);
} else {
if (pairs.length < 5) {
Pair[] newpairs = new Pair[pairs.length + 1];
System.arraycopy(pairs, 0, newpairs, 0, pairs.length);
pairs = newpairs;
}
int k = 0;
for(int i = pairs.length - 2; i >= 0; i--) {
if (pairs[i].score <= p.score) {
pairs[i + 1] = pairs[i];
} else {
k = i + 1;
break;
}
}
pairs[k] = p;
}
top5map.put(key, pairs);
}
line = br.readLine();
}
for(Map.Entry<String, Pair[]> e : top5map.entrySet()) {
System.out.print(e.getKey());
System.out.print(" ");
System.out.println(Arrays.toString(e.getValue()));
}
System.out.println(System.currentTimeMillis() - start);
}
static class Pair {
double score;
String value;
public Pair(double score, String value) {
this.score = score;
this.value = value;
}
public int compareTo(Object o) {
Pair p = (Pair) o;
return (int)Math.signum(score - p.score);
}
public String toString() {
return String.valueOf(score) + ", " + value;
}
}
}
AWK script to fake the data:
BEGIN {
for (i = 0; i < 27000000; i++) {
v = rand();
k = int(rand() * 100);
print k " " v " " i;
}
exit;
}
This is a sketch in Common Lisp
Note that for long files there is a penalty for using READ-LINE, because it conses a fresh string for each line. Then use one of the derivatives of READ-LINE that are floating around that are using a line buffer. Also you might check if you want the hash table be case sensitive or not.
second version
Splitting the string is no longer needed, because we do it here. It is low level code, in the hope that some speed gains will be possible. It checks for one or more spaces as field delimiter and also tabs.
(defun read-a-line (stream)
(let ((line (read-line stream nil nil)))
(flet ((delimiter-p (c)
(or (char= c #\space) (char= c #\tab))))
(when line
(let* ((s0 (position-if #'delimiter-p line))
(s1 (position-if-not #'delimiter-p line :start s0))
(s2 (position-if #'delimiter-p line :start (1+ s1)))
(s3 (position-if #'delimiter-p line :from-end t)))
(values (subseq line 0 s0)
(list (read-from-string line nil nil :start s1 :end s2)
(subseq line (1+ s3)))))))))
Above function returns two values: the key and a list of the rest.
(defun dbscan (top-5-table stream)
"get triples from each line and put them in the hash table"
(loop with aa = nil and bbcc = nil do
(multiple-value-setq (aa bbcc) (read-a-line stream))
while aa do
(setf (gethash aa top-5-table)
(let ((l (merge 'list (gethash aa top-5-table) (list bbcc)
#'> :key #'first)))
(or (and (nth 5 l) (subseq l 0 5)) l)))))
(defun dbprint (table output)
"print the hashtable contents"
(maphash (lambda (aa value)
(loop for (bb cc) in value
do (format output "~a ~a ~a~%" aa bb cc)))
table))
(defun dbsum (input &optional (output *standard-output*))
"scan and sum from a stream"
(let ((top-5-table (make-hash-table :test #'equal)))
(dbscan top-5-table input)
(dbprint top-5-table output)))
(defun fsum (infile outfile)
"scan and sum a file"
(with-open-file (input infile :direction :input)
(with-open-file (output outfile
:direction :output :if-exists :supersede)
(dbsum input output))))
some test data
(defun create-test-data (&key (file "/tmp/test.data") (n-lines 100000))
(with-open-file (stream file :direction :output :if-exists :supersede)
(loop repeat n-lines
do (format stream "~a ~a ~a~%"
(random 1000) (random 100.0) (random 10000)))))
; (create-test-data)
(defun test ()
(time (fsum "/tmp/test.data" "/tmp/result.data")))
third version, LispWorks
Uses some SPLIT-STRING and PARSE-FLOAT functions, otherwise generic CL.
(defun fsum (infile outfile)
(let ((top-5-table (make-hash-table :size 50000000 :test #'equal)))
(with-open-file (input infile :direction :input)
(loop for line = (read-line input nil nil)
while line do
(destructuring-bind (aa bb cc) (split-string '(#\space #\tab) line)
(setf bb (parse-float bb))
(let ((v (gethash aa top-5-table)))
(unless v
(setf (gethash aa top-5-table)
(setf v (make-array 6 :fill-pointer 0))))
(vector-push (cons bb cc) v)
(when (> (length v) 5)
(setf (fill-pointer (sort v #'> :key #'car)) 5))))))
(with-open-file (output outfile :direction :output :if-exists :supersede)
(maphash (lambda (aa value)
(loop for (bb . cc) across value do
(format output "~a ~f ~a~%" aa bb cc)))
top-5-table))))
Here is one more OCaml version - targeted for speed - with custom parser on Streams. Too long, but parts of the parser are reusable. Thanks peufeu for triggering competition :)
Speed :
simple ocaml - 27 sec
ocaml with Stream parser - 15 sec
c with manual parser - 5 sec
Compile with :
ocamlopt -pp camlp4o code.ml -o caml
Code :
open Printf
let cmp x y = compare (fst x : float) (fst y)
let digit c = Char.code c - Char.code '0'
let rec parse f = parser
| [< a=int; _=spaces; b=float; _=spaces;
c=rest (Buffer.create 100); t >] -> f a b c; parse f t
| [< >] -> ()
and int = parser
| [< ''0'..'9' as c; t >] -> int_ (digit c) t
| [< ''-'; ''0'..'9' as c; t >] -> - (int_ (digit c) t)
and int_ n = parser
| [< ''0'..'9' as c; t >] -> int_ (n * 10 + digit c) t
| [< >] -> n
and float = parser
| [< n=int; t=frem; e=fexp >] -> (float_of_int n +. t) *. (10. ** e)
and frem = parser
| [< ''.'; r=frem_ 0.0 10. >] -> r
| [< >] -> 0.0
and frem_ f base = parser
| [< ''0'..'9' as c; t >] ->
frem_ (float_of_int (digit c) /. base +. f) (base *. 10.) t
| [< >] -> f
and fexp = parser
| [< ''e'; e=int >] -> float_of_int e
| [< >] -> 0.0
and spaces = parser
| [< '' '; t >] -> spaces t
| [< ''\t'; t >] -> spaces t
| [< >] -> ()
and crlf = parser
| [< ''\r'; t >] -> crlf t
| [< ''\n'; t >] -> crlf t
| [< >] -> ()
and rest b = parser
| [< ''\r'; _=crlf >] -> Buffer.contents b
| [< ''\n'; _=crlf >] -> Buffer.contents b
| [< 'c; t >] -> Buffer.add_char b c; rest b t
| [< >] -> Buffer.contents b
let () =
let all = Array.make 200 [] in
let each a b c =
assert (a >= 0 && a < 200);
match all.(a) with
| [] -> all.(a) <- [b,c]
| (bmin,_) as prev::tl -> if b > bmin then
begin
let m = List.sort cmp ((b,c)::tl) in
all.(a) <- if List.length tl < 4 then prev::m else m
end
in
parse each (Stream.of_channel stdin);
Array.iteri
(fun a -> List.iter (fun (b,c) -> printf "%i %f %s\n" a b c))
all
Of all the programs in this thread that I've tested so far, the OCaml version is the fastest and also among the shortest. (Line-of-code-based measurements are a little fuzzy, but it's not clearly longer than the Python version or the C or C++ versions, and it is clearly faster.)
Note: I figured out why my earlier runtimes were so nondeterministic! My CPU heatsink was clogged with dust and my CPU was overheating as a result. Now I am getting nice deterministic benchmark times. I think I've now redone all the timing measurements in this thread now that I have a reliable way to time things.
Here are the timings for the different versions so far, running on a 27-million-row 630-megabyte input data file. I'm on Ubuntu Intrepid Ibex on a dual-core 1.6GHz Celeron, running a 32-bit version of the OS (the Ethernet driver was broken in the 64-bit version). I ran each program five times and report the range of times those five tries took. I'm using Python 2.5.2, OpenJDK 1.6.0.0, OCaml 3.10.2, GCC 4.3.2, SBCL 1.0.8.debian, and Octave 3.0.1.
SquareCog's Pig version: not yet tested (because I can't just apt-get install pig), 7 lines of code.
mjv's pure SQL version: not yet tested, but I predict a runtime of several days; 7 lines of code.
ygrek's OCaml version: 68.7 seconds ±0.9 in 15 lines of code.
My Python version: 169 seconds ±4 or 86 seconds ±2 with Psyco, in 16 lines of code.
abbot's heap-based Python version: 177 seconds ±5 in 18 lines of code, or 83 seconds ±5 with Psyco.
My C version below, composed with GNU sort -n: 90 + 5.5 seconds (±3, ±0.1), but gives the wrong answer because of a deficiency in GNU sort, in 22 lines of code (including one line of shell.)
hrnt's C++ version: 217 seconds ±3 in 25 lines of code.
mjv's alternative SQL-based procedural approach: not yet tested, 26 lines of code.
mjv's first SQL-based procedural approach: not yet tested, 29 lines of code.
peufeu's Python version with Psyco: 181 seconds ±4, somewhere around 30 lines of code.
Rainer Joswig's Common Lisp version: 478 seconds (only run once) in 42 lines of code.
abbot's noop.py, which intentionally gives incorrect results to establish a lower bound: not yet tested, 15 lines of code.
Will Hartung's Java version: 96 seconds ±10 in, according to David A. Wheeler’s SLOCCount, 74 lines of code.
Greg's Matlab version: doesn't work.
Schuyler Erle's suggestion of using Pyrex on one of the Python versions: not yet tried.
I supect abbot's version comes out relatively worse for me than for them because the real dataset has a highly nonuniform distribution: as I said, some aa values (“players”) have thousands of lines, while others only have one.
About Psyco: I applied Psyco to my original code (and abbot's version) by putting it in a main function, which by itself cut the time down to about 140 seconds, and calling psyco.full() before calling main(). This added about four lines of code.
I can almost solve the problem using GNU sort, as follows:
kragen#inexorable:~/devel$ time LANG=C sort -nr infile -o sorted
real 1m27.476s
user 0m59.472s
sys 0m8.549s
kragen#inexorable:~/devel$ time ./top5_sorted_c < sorted > outfile
real 0m5.515s
user 0m4.868s
sys 0m0.452s
Here top5_sorted_c is this short C program:
#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
enum { linesize = 1024 };
char buf[linesize];
char key[linesize]; /* last key seen */
int main() {
int n = 0;
char *p;
while (fgets(buf, linesize, stdin)) {
for (p = buf; *p && !isspace(*p); p++) /* find end of key on this line */
;
if (p - buf != strlen(key) || 0 != memcmp(buf, key, p - buf))
n = 0; /* this is a new key */
n++;
if (n <= 5) /* copy up to five lines for each key */
if (fputs(buf, stdout) == EOF) abort();
if (n == 1) { /* save new key in `key` */
memcpy(key, buf, p - buf);
key[p-buf] = '\0';
}
}
return 0;
}
I first tried writing that program in C++ as follows, and I got runtimes which were substantially slower, at 33.6±2.3 seconds instead of 5.5±0.1 seconds:
#include <map>
#include <iostream>
#include <string>
int main() {
using namespace std;
int n = 0;
string prev, aa, bb, cc;
while (cin >> aa >> bb >> cc) {
if (aa != prev) n = 0;
++n;
if (n <= 5) cout << aa << " " << bb << " " << cc << endl;
prev = aa;
}
return 0;
}
I did say almost. The problem is that sort -n does okay for most of the data, but it fails when it's trying to compare 0.33 with 3.78168e-05. So to get this kind of performance and actually solve the problem, I need a better sort.
Anyway, I kind of feel like I'm whining, but the sort-and-filter approach is about 5× faster than the Python program, while the elegant STL program from hrnt is actually a little slower — there seems to be some kind of gross inefficiency in <iostream>. I don't know where the other 83% of the runtime is going in that little C++ version of the filter, but it isn't going anywhere useful, which makes me suspect I don't know where it's going in hrnt's std::map version either. Could that version be sped up 5× too? Because that would be pretty cool. Its working set might be bigger than my L2 cache, but as it happens it probably isn't.
Some investigation with callgrind says my filter program in C++ is executing 97% of its instructions inside of operator >>. I can identify at least 10 function calls per input byte, and cin.sync_with_stdio(false); doesn’t help. This probably means I could get hrnt’s C program to run substantially faster by parsing input lines more efficiently.
Edit: kcachegrind claims that hrnt’s program executes 62% of its instructions (on a small 157000 line input file) extracting doubles from an istream. A substantial part of this is because the istreams library apparently executes about 13 function calls per input byte when trying to parse a double. Insane. Could I be misunderstanding kcachegrind's output?
Anyway, any other suggestions?
Pretty straightforward Caml (27 * 10^6 rows -- 27 sec, C++ by hrnt -- 29 sec)
open Printf
open ExtLib
let (>>) x f = f x
let cmp x y = compare (fst x : float) (fst y)
let wsp = Str.regexp "[ \t]+"
let () =
let all = Hashtbl.create 1024 in
Std.input_lines stdin >> Enum.iter (fun line ->
let [a;b;c] = Str.split wsp line in
let b = float_of_string b in
try
match Hashtbl.find all a with
| [] -> assert false
| (bmin,_) as prev::tl -> if b > bmin then
begin
let m = List.sort ~cmp ((b,c)::tl) in
Hashtbl.replace all a (if List.length tl < 4 then prev::m else m)
end
with Not_found -> Hashtbl.add all a [b,c]
);
all >> Hashtbl.iter (fun a -> List.iter (fun (b,c) -> printf "%s %f %s\n" a b c))
Here is a C++ solution. I didn't have a lot of data to test it with, however, so I don't know how fast it actually is.
[edit] Thanks to the test data provided by the awk script in this thread, I
managed to clean up and speed up the code a bit. I am not trying to find out the fastest possible version - the intent is to provide a reasonably fast version that isn't as ugly as people seem to think STL solutions can be.
This version should be about twice as fast as the first version (goes through 27 million lines in about 35 seconds). Gcc users, remember to
compile this with -O2.
#include <map>
#include <iostream>
#include <functional>
#include <utility>
#include <string>
int main() {
using namespace std;
typedef std::map<string, std::multimap<double, string> > Map;
Map m;
string aa, cc;
double bb;
std::cin.sync_with_stdio(false); // Dunno if this has any effect, but anyways.
while (std::cin >> aa >> bb >> cc)
{
if (m[aa].size() == 5)
{
Map::mapped_type::iterator iter = m[aa].begin();
if (bb < iter->first)
continue;
m[aa].erase(iter);
}
m[aa].insert(make_pair(bb, cc));
}
for (Map::const_iterator iter = m.begin(); iter != m.end(); ++iter)
for (Map::mapped_type::const_iterator iter2 = iter->second.begin();
iter2 != iter->second.end();
++iter2)
std::cout << iter->first << " " << iter2->first << " " << iter2->second <<
std::endl;
}
Interestingly, the original Python solution is by far the cleanest looking (although the C++ example comes close).
How about using Pyrex or Psyco on your original code?
Has anybody tried doing this problem with just awk. Specifically 'mawk'? It should be faster than even Java and C++, according to this blog post: http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/
EDIT: Just wanted to clarify that the only claim being made in that blog post is that for a certain class of problems that are specifically suited to awk-style processing, the mawk virtual machine can beat 'vanilla' implementations in Java and C++.
Since you asked about Matlab, here's how I did something like what you're asking for. I tried to do it without any for loops, but I do have one because I didn't care to take a long time with it. If you were worried about memory then you could pull data from the stream in chunks with fscanf rather than reading the entire buffer.
fid = fopen('fakedata.txt','r');
tic
A=fscanf(fid,'%d %d %d\n');
A=reshape(A,3,length(A)/3)'; %Matlab reads the data into one long column'
Names = unique(A(:,1));
for i=1:length(Names)
indices = find(A(:,1)==Names(i)); %Grab all instances of key i
[Y,I] = sort(A(indices,2),1,'descend'); %sort in descending order of 2nd record
A(indices(I(1:min([5,length(indices(I))]))),:) %Print the top five
end
toc
fclose(fid)
Speaking of lower bounds on compute time :
Let's analyze my algo above :
for each row (key,score,id) :
create or fetch a list of top scores for the row's key
if len( this list ) < N
append current
else if current score > minimum score in list
replace minimum of list with current row
update minimum of all lists if needed
Let N be the N in top-N
Let R be the number of rows in your data set
Let K be the number of distinct keys
What assumptions can we make ?
R * sizeof( row ) > RAM or at least it's big enough that we don't want to load it all, use a hash to group by key, and sort each bin. For the same reason we don't sort the whole stuff.
Kragen likes hashtables, so K * sizeof(per-key state) << RAM, most probably it fits in L2/3 cache
Kragen is not sorting, so K*N << R ie each key has much more than N entries
(note : A << B means A is small relative to B)
If the data has a random distribution, then
after a small number of rows, the majority of rows will be rejected by the per-key minimum condition, the cost is 1 comparison per row.
So the cost per row is 1 hash lookup + 1 comparison + epsilon * (list insertion + (N+1) comparisons for the minimum)
If the scores have a random distribution (say between 0 and 1) and the conditions above hold, both epsilons will be very small.
Experimental proof :
The 27 million rows dataset above produces 5933 insertions into the top-N lists. All other rows are rejected by a simple key lookup and comparison. epsilon = 0.0001
So roughly, the cost is 1 lookup + coparison per row, which takes a few nanoseconds.
On current hardware, there is no way this is not going to be negligible versus IO cost and especially parsing costs.
Isn't this just as simple as
SELECT DISTINCT aa, bb, cc FROM tablename ORDER BY bb DESC LIMIT 5
?
Of course, it's hard to tell what would be fastest without testing it against the data. And if this is something you need to run very fast, it might make sense to optimize your database to make the query faster, rather than optimizing the query.
And, of course, if you need the flat file anyway, you might as well use that.
Pick "top 5" would look something like this. Note that there's no sorting. Nor does any list in the top_5 dictionary ever grow beyond 5 elements.
from collections import defaultdict
import sys
def keep_5( aList, aPair ):
minbb= min( bb for bb,cc in aList )
bb, cc = aPair
if bb < minbb: return aList
aList.append( aPair )
min_i= 0
for i in xrange(1,6):
if aList[i][0] < aList[min_i][0]
min_i= i
aList.pop(min_i)
return aList
top_5= defaultdict(list)
for row in sys.stdin:
aa, bb, cc = row.split()
bb = float(bb)
if len(top_5[aa]) < 5:
top_5[aa].append( (bb,cc) )
else:
top_5[aa]= keep_5( top_5[aa], (bb,cc) )
The Pig version would go something like this (untested):
Data = LOAD '/my/data' using PigStorage() as (aa:int, bb:float, cc:chararray);
grp = GROUP Data by aa;
topK = FOREACH grp (
sorted = ORDER Data by bb DESC;
lim = LIMIT sorted 5;
GENERATE group as aa, lim;
)
STORE topK INTO '/my/output' using PigStorage();
Pig isn't optimized for performance; it's goal is to enable processing of multi-terabyte datasets using parallel execution frameworks. It does have a local mode, so you can try it, but I doubt it will beat your script.
That was a nice lunch break challenge, he, he.
Top-N is a well-known database killer. As shown by the post above, there is no way to efficiently express it in common SQL.
As for the various implementations, you got to keep in mind that the slow part in this is not the sorting or the top-N, it's the parsing of text. Have you looked at the source code for glibc's strtod() lately ?
For instance, I get, using Python :
Read data : 80.5 s
My TopN : 34.41 s
HeapTopN : 30.34 s
It is quite likely that you'll never get very fast timings, no matter what language you use, unless your data is in some format that is a lot faster to parse than text. For instance, loading the test data into postgres takes 70 s, and the majority of that is text parsing, too.
If the N in your topN is small, like 5, a C implementation of my algorithm below would probably be the fastest. If N can be larger, heaps are a much better option.
So, since your data is probably in a database, and your problem is getting at the data, not the actual processing, if you're really in need of a super fast TopN engine, what you should do is write a C module for your database of choice. Since postgres is faster for about anything, I suggest using postgres, plus it isn't difficult to write a C module for it.
Here's my Python code :
import random, sys, time, heapq
ROWS = 27000000
def make_data( fname ):
f = open( fname, "w" )
r = random.Random()
for i in xrange( 0, ROWS, 10000 ):
for j in xrange( i,i+10000 ):
f.write( "%d %f %d\n" % (r.randint(0,100), r.uniform(0,1000), j))
print ("write: %d\r" % i),
sys.stdout.flush()
print
def read_data( fname ):
for n, line in enumerate( open( fname ) ):
r = line.strip().split()
yield int(r[0]),float(r[1]),r[2]
if not (n % 10000 ):
print ("read: %d\r" % n),
sys.stdout.flush()
print
def topn( ntop, data ):
ntop -= 1
assert ntop > 0
min_by_key = {}
top_by_key = {}
for key,value,label in data:
tup = (value,label)
if key not in top_by_key:
# initialize
top_by_key[key] = [ tup ]
else:
top = top_by_key[ key ]
l = len( top )
if l > ntop:
# replace minimum value in top if it is lower than current value
idx = min_by_key[ key ]
if top[idx] < tup:
top[idx] = tup
min_by_key[ key ] = top.index( min( top ) )
elif l < ntop:
# fill until we have ntop entries
top.append( tup )
else:
# we have ntop entries in list, we'll have ntop+1
top.append( tup )
# initialize minimum to keep
min_by_key[ key ] = top.index( min( top ) )
# finalize:
return dict( (key, sorted( values, reverse=True )) for key,values in top_by_key.iteritems() )
def grouptopn( ntop, data ):
top_by_key = {}
for key,value,label in data:
if key in top_by_key:
top_by_key[ key ].append( (value,label) )
else:
top_by_key[ key ] = [ (value,label) ]
return dict( (key, sorted( values, reverse=True )[:ntop]) for key,values in top_by_key.iteritems() )
def heaptopn( ntop, data ):
top_by_key = {}
for key,value,label in data:
tup = (value,label)
if key not in top_by_key:
top_by_key[ key ] = [ tup ]
else:
top = top_by_key[ key ]
if len(top) < ntop:
heapq.heappush(top, tup)
else:
if top[0] < tup:
heapq.heapreplace(top, tup)
return dict( (key, sorted( values, reverse=True )) for key,values in top_by_key.iteritems() )
def dummy( data ):
for row in data:
pass
make_data( "data.txt" )
t = time.clock()
dummy( read_data( "data.txt" ) )
t_read = time.clock() - t
t = time.clock()
top_result = topn( 5, read_data( "data.txt" ) )
t_topn = time.clock() - t
t = time.clock()
htop_result = heaptopn( 5, read_data( "data.txt" ) )
t_htopn = time.clock() - t
# correctness checking :
for key in top_result:
print key, " : ", " ".join (("%f:%s"%(value,label)) for (value,label) in top_result[key])
print key, " : ", " ".join (("%f:%s"%(value,label)) for (value,label) in htop_result[key])
print
print "Read data :", t_read
print "TopN : ", t_topn - t_read
print "HeapTopN : ", t_htopn - t_read
for key in top_result:
assert top_result[key] == htop_result[key]
I love lunch break challenges. Here's a 1 hour implementation.
OK, when you don't want do some extremely exotic crap like additions, nothing stops you from using a custom base-10 floating point format whose only implemented operator is comparison, right ? lol.
I had some fast-atoi code lying around from a previous project, so I just imported that.
http://www.copypastecode.com/11541/
This C source code takes about 6.6 seconds to parse the 580MB of input text (27 million lines), half of that time is fgets, lol. Then it takes approximately 0.05 seconds to compute the top-n, but I don't know for sure, since the time it takes for the top-n is less than the timer noise.
You'll be the one to test it for correctness though XDDDDDDDDDDD
Interesting huh ?
Well, please grab a coffee and read the source code for strtod -- it's mindboggling, but needed, if you want to float -> text -> float to give back the same float you started with.... really...
Parsing integers is a lot faster (not so much in python, though, but in C, yes).
Anyway, putting the data in a Postgres table :
SELECT count( key ) FROM the dataset in the above program
=> 7 s (so it takes 7 s to read the 27M records)
CREATE INDEX topn_key_value ON topn( key, value );
191 s
CREATE TEMPORARY TABLE topkeys AS SELECT key FROM topn GROUP BY key;
12 s
(You can use the index to get distinct values of 'key' faster too but it requires some light plpgsql hacking)
CREATE TEMPORARY TABLE top AS SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1) AS r FROM topkeys a) foo;
Temps : 15,310 ms
INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 1) AS r FROM topkeys a) foo;
Temps : 17,853 ms
INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 2) AS r FROM topkeys a) foo;
Temps : 13,983 ms
INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 3) AS r FROM topkeys a) foo;
Temps : 16,860 ms
INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 4) AS r FROM topkeys a) foo;
Temps : 17,651 ms
INSERT INTO top SELECT (r).* FROM (SELECT (SELECT b AS r FROM topn b WHERE b.key=a.key ORDER BY value DESC LIMIT 1 OFFSET 5) AS r FROM topkeys a) foo;
Temps : 19,216 ms
SELECT * FROM top ORDER BY key,value;
As you can see computing the top-n is extremely fast (provided n is small) but creating the (mandatory) index is extremely slow because it involves a full sort.
Your best bet is to use a format that is fast to parse (either binary, or write a custom C aggregate for your database, which would be the best choice IMHO). The runtime in the C program shouldn't be more than 1s if python can do it in 1 s.

Categories