Python vs perl sort performance

Python vs perl sort performance - python

Solution
This solved all issues with my Perl code (plus extra implementation code.... :-) ) In conlusion both Perl and Python are equally awesome.
use WWW::Curl::Easy;
Thanks to ALL who responded, very much appreciated.
Edit
It appears that the Perl code I am using is spending the majority of its time performing the http get, for example:
my $start_time = gettimeofday;
$request = HTTP::Request->new('GET', 'http://localhost:8080/data.json');
$response = $ua->request($request);
$page = $response->content;
my $end_time = gettimeofday;
print "Time taken #{[ $end_time - $start_time ]} seconds.\n";
The result is:
Time taken 74.2324419021606 seconds.
My python code in comparison:
start = time.time()
r = requests.get('http://localhost:8080/data.json', timeout=120, stream=False)
maxsize = 100000000
content = ''
for chunk in r.iter_content(2048):
content += chunk
if len(content) > maxsize:
r.close()
raise ValueError('Response too large')
end = time.time()
timetaken = end-start
print timetaken
The result is:
20.3471381664
In both cases the sort times are sub second. So first of all I apologise for the misleading question, and it is another lesson for me to never ever make assumptions.... :-)
I'm not sure what is the best thing to do with this question now. Perhaps someone can propose a better way of performing the request in perl?
End of edit
This is just a quick question regarding sort performance differences in Perl vs Python. This is not a question about which language is better/faster etc, for the record, I first wrote this in perl, noticed the time the sort was taking, and then tried to write the same thing in python to see how fast it would be. I simply want to know, how can I make the perl code perform as fast as the python code?
Lets say we have the following json:
["3434343424335": {
"key1": 2322,
"key2": 88232,
"key3": 83844,
"key4": 444454,
"key5": 34343543,
"key6": 2323232
},
"78237236343434": {
"key1": 23676722,
"key2": 856568232,
"key3": 838723244,
"key4": 4434544454,
"key5": 3432323543,
"key6": 2323232
}
]
Lets say we have a list of around 30k-40k records which we want to sort by one of the sub keys. We then want to build a new array of records ordered by the sub key.
Perl - Takes around 27 seconds
my #list;
$decoded = decode_json($page);
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,"key1"=>$decoded->{$id}{key1}...etc));
}
Python - Takes around 6 seconds
list = []
data = json.loads(content)
data2 = sorted(data, key = lambda x: data[x]['key5'], reverse=True)
for key in data2:
tmp= {'id':key,'key1':data[key]['key1'],etc.....}
list.append(tmp)
For the perl code, I have tried using the following tweaks:
use sort '_quicksort'; # use a quicksort algorithm
use sort '_mergesort'; # use a mergesort algorithm

Your benchmark is flawed, you're benchmarking multiple variables, not one. It is not just sorting data, but it is also doing JSON decoding, and creating strings, and appending to an array. You can't know how much time is spent sorting and how much is spent doing everything else.
The matter is made worse in that there are several different JSON implementations in Perl each with their own different performance characteristics. Change the underlying JSON library and the benchmark will change again.
If you want to benchmark sort, you'll have to change your benchmark code to eliminate the cost of loading your test data from the benchmark, JSON or not.
Perl and Python have their own internal benchmarking libraries that can benchmark individual functions, but their instrumentation can make them perform far less well than they would in the real world. The performance drag from each benchmarking implementation will be different and might introduce a false bias. These benchmarking libraries are more useful for comparing two functions in the same program. For comparing between languages, keep it simple.
Simplest thing to do to get an accurate benchmark is to time them within the program using the wall clock.
# The current time to the microsecond.
use Time::HiRes qw(gettimeofday);
my #list;
my $decoded = decode_json($page);
my $start_time = gettimeofday;
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,"key1"=>$decoded->{$id}{key1}...etc));
}
my $end_time = gettimeofday;
print "sort and append took #{[ $end_time - $start_time ]} seconds\n";
(I leave the Python version as an exercise)
From here you can improve your technique. You can use CPU seconds instead of wall clock. The array append and cost of creating the string are still involved in the benchmark, they can be eliminated so you're just benchmarking sort. And so on.
Additionally, you can use a profiler to find out where your programs are spending their time. These have the same raw performance caveats as benchmarking libraries, the results are only useful to find out what percentage of its time a program is using where, but it will prove useful to quickly see if your benchmark has unexpected drag.
The important thing is to benchmark what you think you're benchmarking.

Something else is at play here; I can run your sort in half a second. Improving that is not going to depend on sorting algorithm so much as reducing the amount of code run per comparison; a Schwartzian Transform gets it to a third of a second, a Guttman-Rosler Transform gets it down to a quarter of a second:
#!/usr/bin/perl
use 5.014;
use warnings;
my $decoded = { map( (int rand 1e9, { map( ("key$_", int rand 1e9), 1..6 ) } ), 1..40000 ) };
use Benchmark 'timethese';
timethese( -5, {
'original' => sub {
my #list;
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
'st' => sub {
my #list;
foreach my $id (
map $_->[1],
sort { $b->[0] <=> $a->[0] }
map [ $decoded->{$_}{key5}, $_ ],
keys %{$decoded}
) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
'grt' => sub {
my $maxkeylen=15;
my #list;
foreach my $id (
map substr($_,$maxkeylen),
sort { $b cmp $a }
map sprintf('%0*s', $maxkeylen, $decoded->{$_}{key5}) . $_,
keys %{$decoded}
) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
});

Don't create a new hash for each record. Just add the key to the existing one.
$decoded->{$_}{key} = $_
for keys(%$decoded);
my #list = sort { $b->{key5} <=> $a->{key5} } values(%$decoded);
Using Sort::Key will make it even faster.
use Sort::Key qw( rukeysort );
$decoded->{$_}{key} = $_
for keys(%$decoded);
my #list = rukeysort { $_->{key5} } values(%$decoded);

Related

Bulk loading timestamp sensitive data HBase

We have a lot of historical data that we need to migrate into HBase. The setup of our HBase is that the (timestamp) versioning is relevant and using some domain knowledge we know at which time the different columns were available. The amount of data is vast so I was wondering what would be a good way of doing this bulk load. Scala or Python is fine, preferably with Spark.

I've published a gist that gets you most of the way there. I'll reproduce the most relevant method here:
def write[TK, TF, TQ, TV](
tableName: String,
ds: Dataset[(TK, Map[TF, Map[TQ, TV]])],
batch: Int = 1000
)(implicit
fk: TK => HBaseData,
ff: TF => HBaseData,
fq: TQ => HBaseData,
fv: TV => HBaseData
): Unit = {
ds.foreachPartition(p => {
val hbase = HBase.getHBase
val table = hbase.getTable(TableName.valueOf(tableName))
val puts = ArrayBuffer[Put]()
p.foreach(r => {
val put = new Put(r._1)
r._2.foreach( f => {
f._2.foreach( q => {
put.addColumn(f._1, q._1, q._2)
})
})
puts += put
if (puts.length >= batch) {
table.put(puts.asJava)
puts.clear()
}
})
if (puts.nonEmpty) {
table.put(puts.asJava)
puts.clear()
}
table.close()
})
}
The caveat is that this method only uses the HBase timestamp in it's default behavior, so it will have to be extended to include providing your own timestamp. Essentially, just make the TV type into a Map[Long, TV], and add the appropriate additional nested loop.
The HBaseData type is a case class with several of implicit methods to convert from the most common types to an Array[Byte] for efficient HBase storage.
The getHbase method ensures only one connection to HBase from each partition, to avoid connecting/disconnecting for every record.
Hopefully this is all sensible, as I implemented this as a beginner in generics.

Returning the minimum value in a JSON array

Good evening folks! I have been wracking my brain on this one for a good few hours now and could do with a little bit of a pointer in the right direction. I'm playing around with some API calls and trying to make a little project for myself.
The JSON data is stored in Arrays, and as such to get the information I want (it is from a Transport API) I have been making the following
x = apirequest
x = x.json
for i in range(0,4):
print(x['routes'][i]['duration'])
print(x['routes'][i]['departure_time'])
print(x['routes'][i]['arrival_time'])
This will return the following
06:58:00
23:39
06:37
05:08:00
05:14
10:22
03:41:00
05:30
09:11
03:47:00
06:24
10:11
What I am trying to do, is return only the shortest journeys - I could do it if it was a single layer JSON string but I am not too familliar with multi-level arrays. I can't return ['duration'] without utilising ['routes'] and route indicator (in this case 0 through 3 or 4).
I can use an if statement to iterate through them easily enough, but there must be a way to accomplish it directly through the JSON that I am missing. I also thought about adding the results to a separate array and then filtering that - but there is a few other fields I want to grab from the data when I've cracked this part.
What I am finding as I learn is that I tend to do things a long winded way, often finding out my 10-15 line solutions on codewars are actually aimed at being done in 2-3 lines.
Example JSON data
{
"request_time": "2018-05-29T19:03:04+01:00",
"source": "Traveline southeast journey planning API",
"acknowledgements": "Traveline southeast",
"routes": [{
"duration": "06:58:00",
"route_parts": [{
"mode": "foot",
"from_point_name": "Corunna Court, Wrexham",
"to_point_name": "Wrexham General Rail Station",
"destination": "",
"line_name": "",
"duration": "00:36:00",
"departure_time": "23:39",
"arrival_time": "00:15"
}]
}]
}
Hope you can help steer me in the right direction!

Here's one solution using datetime.timedelta. Data from #fferri.
from datetime import timedelta
x = {'routes': [{'duration':'06:58:00','departure_time':'23:39','arrival_time':'06:37'},
{'duration':'05:08:00','departure_time':'05:14','arrival_time':'10:22'},
{'duration':'03:41:00','departure_time':'05:30','arrival_time':'09:11'},
{'duration':'03:47:00','departure_time':'06:24','arrival_time':'10:11'}]}
def minimum_time(k):
h, m, s = map(int, x['routes'][k]['duration'].split(':'))
return timedelta(hours=h, minutes=m, seconds=s)
res = min(range(4), key=minimum_time) # 2
You can then access the appropriate sub-dictionary via x['routes'][res].

Using min() with a key argument to indicate which field should be used for finding the minimum value:
x={'routes':[
{'duration':'06:58:00','departure_time':'23:39','arrival_time':'06:37'},
{'duration':'05:08:00','departure_time':'05:14','arrival_time':'10:22'},
{'duration':'03:41:00','departure_time':'05:30','arrival_time':'09:11'},
{'duration':'03:47:00','departure_time':'06:24','arrival_time':'10:11'}
]}
best=min(x['routes'], key=lambda d: d['duration'])
# best={'duration': '03:41:00', 'departure_time': '05:30', 'arrival_time': '09:11'}

The min(iterable, key=...) function is what you are looking for:
x = { 'routes': [ {'dur':3, 'depart':1, 'arrive':4},
{'dur':2, 'depart':2, 'arrive':4}]}
min(x['routes'], key=lambda item: item['dur'])
Returns:
{'dur': 2, 'depart': 2, 'arrive': 4}

First, the fact that x is initialized from JSON isn't particularly relevant. It's a dict, and that's all that is important.
To answer your question, you just need the key attribute to min:
shortest = min(x['routes'], key=lambda d: d['duration'])

Does spark utilize the sorted order of hbase keys, when using hbase as data source

I store time-series data in HBase. The rowkey is composed from user_id and timestamp, like this:
{
"userid1-1428364800" : {
"columnFamily1" : {
"val" : "1"
}
}
}
"userid1-1428364803" : {
"columnFamily1" : {
"val" : "2"
}
}
}
"userid2-1428364812" : {
"columnFamily1" : {
"val" : "abc"
}
}
}
}
Now I need to perform per-user analysis. Here is the initialization of hbase_rdd (from here)
sc = SparkContext(appName="HBaseInputFormat")
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
The natural mapreduce-like way to process would be:
hbase_rdd
.map(lambda row: (row[0].split('-')[0], (row[0].split('-')[1], row[1]))) # shift timestamp from key to value
.groupByKey()
.map(processUserData) # process user's data
While executing first map (shift timestamp from key to value) it is crucial to know when the time-series data of the current user is finished and therefore groupByKey transformation could be started. Thus we do not need to map over all table and store all the temporary data. It is possible because hbase stores row-keys in a sorted order.
With hadoop streaming it could be done in such way:
import sys
current_user_data = []
last_userid = None
for line in sys.stdin:
k, v = line.split('\t')
userid, timestamp = k.split('-')
if userid != last_userid and current_user_data:
print processUserData(last_userid, current_user_data)
last_userid = userid
current_user_data = [(timestamp, v)]
else:
current_user_data.append((timestamp, v))
The question is: how to utilize the sorted order of hbase keys within Spark?

I'm not super familiar with the guarantees you get with the way you're pulling data from HBase, but if I understand correctly, I can answer with just plain old Spark.
You've got some RDD[X]. As far as Spark knows, the Xs in that RDD are completely unordered. But you have some outside knowledge, and you can guarantee that the data is in fact grouped by some field of X (and perhaps even sorted by another field).
In that case, you can use mapPartitions to do virtually the same thing you did with hadoop streaming. That lets you iterate over all the records in one partition, so you can look for blocks of records w/ the same key.
val myRDD: RDD[X] = ...
val groupedData: RDD[Seq[X]] = myRdd.mapPartitions { itr =>
var currentUserData = new scala.collection.mutable.ArrayBuffer[X]()
var currentUser: X = null
//itr is an iterator over *all* the records in one partition
itr.flatMap { x =>
if (currentUser != null && x.userId == currentUser.userId) {
// same user as before -- add the data to our list
currentUserData += x
None
} else {
// its a new user -- return all the data for the old user, and make
// another buffer for the new user
val userDataGrouped = currentUserData
currentUserData = new scala.collection.mutable.ArrayBuffer[X]()
currentUserData += x
currentUser = x
Some(userDataGrouped)
}
}
}
// now groupedRDD has all the data for one user grouped together, and we didn't
// need to do an expensive shuffle. Also, the above transformation is lazy, so
// we don't necessarily even store all that data in memory -- we could still
// do more filtering on the fly, eg:
val usersWithLotsOfData = groupedRDD.filter{ userData => userData.size > 10 }
I realize you wanted to use python -- sorry I figure I'm more likely to get the example correct if I write in Scala. and I think the type annotations make the meaning more clear, but that it probably a Scala bias ... :). In any case, hopefully you can understand what is going on and translate it. (Don't worry too much about flatMap & Some & None, probably unimportant if you understand the idea ...)

Can I use the A* algorithm in a game with no tiles? [duplicate]

From http://ccl.northwestern.edu/netlogo/models/community/Astardemo, I coded an A* algorithm by using nodes in a network to define least-cost paths. The code seems to work but it is much too slow when I use it at large spatial scales.My landscape has an extent of 1000 patches x 1000 patches with 1 patch = 1 pixel. Even if I reduce it at 400 patches x 400 patches with 1 patch = 1 pixel, it is yet too slow (I can't modify my landscape below 400 patches x 400 patches). Here is the code:
to find-path [ source-node destination-node]
let search-done? false
let search-path []
let current-node 0
set list-open []
set list-closed []
let list-links-with-nodes-in-list-closed []
let list-links []
set list-open lput source-node list-open
while [ search-done? != true]
[
ifelse length list-open != 0
[
set list-open sort-by [[f] of ?1 < [f] of ?2] list-open
set current-node item 0 list-open
set list-open remove-item 0 list-open
set list-closed lput current-node list-closed
ask current-node
[
if parent-node != 0[
set list-links-with-nodes-in-list-closed lput link-with parent-node list-links-with-nodes-in-list-closed
]
ifelse any? (nodes-on neighbors4) with [ (xcor = [ xcor ] of destination-node) and (ycor = [ycor] of destination-node)]
[
set search-done? true
]
[
ask (nodes-on neighbors4) with [ (not member? self list-closed) and (self != parent-node) ]
[
if not member? self list-open and self != source-node and self != destination-node
[
set list-open lput self list-open
set parent-node current-node
set list-links sentence (list-links-with-nodes-in-list-closed) (link-with parent-node)
set g sum (map [ [link-cost] of ? ] list-links)
set h distance destination-node
set f (g + h)
]
]
]
]
]
[
user-message( "A path from the source to the destination does not exist." )
report []
]
]
set search-path lput current-node search-path
let temp first search-path
while [ temp != source-node ]
[
ask temp
[
set color red
]
set search-path lput [parent-node] of temp search-path
set temp [parent-node] of temp
]
set search-path fput destination-node search-path
set search-path reverse search-path
print search-path
end
Unfortunately, I don't know how to speed up this code. Is there a solution to calculate rapidly least-cost paths at large spatial scales ?
Thanks very much for your help.

Was curious so I tested mine A* and here is mine result
Maze 1280 x 800 x 32 bit pixels
as you can see it took ~23ms
no multithreading (AMD 3.2GHz)
C++ 32bit app (BDS2006 Turbo C++ or Borland C++ builder 2006 if you like)
the slowest path I found was ~44ms (fill almost whole map)
I think that is fast enough ...
Here is source for mine A* class:
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
const DWORD A_star_space=0xFFFFFFFF;
const DWORD A_star_wall =0xFFFFFFFE;
//---------------------------------------------------------------------------
class A_star
{
public:
// variables
DWORD **map; // map[ys][xs]
int xs,ys; // map esolution xs*ys<0xFFFFFFFE !!!
int *px,*py,ps; // output points px[ps],py[ps] after compute()
// internals
A_star();
~A_star();
void _freemap(); // release map memory
void _freepnt(); // release px,py memory
// inteface
void resize(int _xs,int _ys); // realloc map to new resolution
void set(Graphics::TBitmap *bmp,DWORD col_wall); // copy bitmap to map
void get(Graphics::TBitmap *bmp); // draw map to bitmap for debuging
void compute(int x0,int y0,int x1,int y1); // compute path from x0,y0 to x1,y1 output to px,py
};
//---------------------------------------------------------------------------
A_star::A_star() { map=NULL; xs=0; ys=0; px=NULL; py=NULL; ps=0; }
A_star::~A_star() { _freemap(); _freepnt(); }
void A_star::_freemap() { if (map) delete[] map; map=NULL; xs=0; ys=0; }
void A_star::_freepnt() { if (px) delete[] px; px=NULL; if (py) delete[] py; py=NULL; ps=0; }
//---------------------------------------------------------------------------
void A_star::resize(int _xs,int _ys)
{
if ((xs==_xs)&&(ys==_ys)) return;
_freemap();
xs=_xs; ys=_ys;
map=new DWORD*[ys];
for (int y=0;y<ys;y++)
map[y]=new DWORD[xs];
}
//---------------------------------------------------------------------------
void A_star::set(Graphics::TBitmap *bmp,DWORD col_wall)
{
int x,y;
DWORD *p,c;
resize(bmp->Width,bmp->Height);
for (y=0;y<ys;y++)
for (p=(DWORD*)bmp->ScanLine[y],x=0;x<xs;x++)
{
c=A_star_space;
if (p[x]==col_wall) c=A_star_wall;
map[y][x]=c;
}
}
//---------------------------------------------------------------------------
void A_star::get(Graphics::TBitmap *bmp)
{
int x,y;
DWORD *p,c;
bmp->SetSize(xs,ys);
for (y=0;y<ys;y++)
for (p=(DWORD*)bmp->ScanLine[y],x=0;x<xs;x++)
{
c=map[y][x];
if (c==A_star_wall ) c=0x00000000;
else if (c==A_star_space) c=0x00FFFFFF;
else c=((c>>1)&0x7F)+0x00404040;
p[x]=c;
}
}
//---------------------------------------------------------------------------
void A_star::compute(int x0,int y0,int x1,int y1)
{
int x,y,xmin,xmax,ymin,ymax,xx,yy;
DWORD i,j,e;
// [clear previous paths]
for (y=0;y<ys;y++)
for (x=0;x<xs;x++)
if (map[y][x]!=A_star_wall)
map[y][x]=A_star_space;
/*
// [A* no-optimizatims]
xmin=x0; xmax=x0; ymin=y0; ymax=y0;
if (map[y0][x0]==A_star_space)
for (i=0,j=1,e=1,map[y0][x0]=i;(e)&&(map[y1][x1]==A_star_space);i++,j++)
for (e=0,y=ymin;y<=ymax;y++)
for ( x=xmin;x<=xmax;x++)
if (map[y][x]==i)
{
yy=y-1; xx=x; if ((yy>=0)&&(map[yy][xx]==A_star_space)){ map[yy][xx]=j; e=1; if (ymin>yy) ymin=yy; }
yy=y+1; xx=x; if ((yy<ys)&&(map[yy][xx]==A_star_space)){ map[yy][xx]=j; e=1; if (ymax<yy) ymax=yy; }
yy=y; xx=x-1; if ((xx>=0)&&(map[yy][xx]==A_star_space)){ map[yy][xx]=j; e=1; if (xmin>xx) xmin=xx; }
yy=y; xx=x+1; if ((xx<xs)&&(map[yy][xx]==A_star_space)){ map[yy][xx]=j; e=1; if (xmax<xx) xmax=xx; }
}
*/
// [A* changed points list]
// init space for 2 points list
_freepnt();
int i0=0,i1=xs*ys,n0=0,n1=0,ii;
px=new int[i1*2];
py=new int[i1*2];
// if start is not on space then stop
if (map[y0][x0]==A_star_space)
{
// init start position to first point list
px[i0+n0]=x0; py[i0+n0]=y0; n0++; map[y0][x0]=0;
// search until hit the destination (swap point lists after each iteration and clear the second one)
for (j=1,e=1;(e)&&(map[y1][x1]==A_star_space);j++,ii=i0,i0=i1,i1=ii,n0=n1,n1=0)
// test neibours of all points in first list and add valid new points to second one
for (e=0,ii=i0;ii<i0+n0;ii++)
{
x=px[ii]; y=py[ii];
yy=y-1; xx=x; if ((yy>=0)&&(map[yy][xx]==A_star_space)){ map[yy][xx]=j; e=1; px[i1+n1]=xx; py[i1+n1]=yy; n1++; map[yy][xx]=j; }
yy=y+1; xx=x; if ((yy<ys)&&(map[yy][xx]==A_star_space)){ map[yy][xx]=j; e=1; px[i1+n1]=xx; py[i1+n1]=yy; n1++; map[yy][xx]=j; }
yy=y; xx=x-1; if ((xx>=0)&&(map[yy][xx]==A_star_space)){ map[yy][xx]=j; e=1; px[i1+n1]=xx; py[i1+n1]=yy; n1++; map[yy][xx]=j; }
yy=y; xx=x+1; if ((xx<xs)&&(map[yy][xx]==A_star_space)){ map[yy][xx]=j; e=1; px[i1+n1]=xx; py[i1+n1]=yy; n1++; map[yy][xx]=j; }
}
}
// [reconstruct path]
_freepnt();
if (map[y1][x1]==A_star_space) return;
if (map[y1][x1]==A_star_wall) return;
ps=map[y1][x1]+1;
px=new int[ps];
py=new int[ps];
for (i=0;i<ps;i++) { px[i]=x0; py[i]=y0; }
for (x=x1,y=y1,i=ps-1,j=i-1;i>=0;i--,j--)
{
px[i]=x;
py[i]=y;
if ((y> 0)&&(map[y-1][x]==j)) { y--; continue; }
if ((y<ys-1)&&(map[y+1][x]==j)) { y++; continue; }
if ((x> 1)&&(map[y][x-1]==j)) { x--; continue; }
if ((x<xs-0)&&(map[y][x+1]==j)) { x++; continue; }
break;
}
}
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
I know it is a bit too much code but it is complete. The important stuff is in member function compute so search for [A* changed points list]. The unoptimized A* (rem-ed) is about 100 times slower.
Code use bitmap from Borland VCL so if you do not have it ignore functions get,set and rewrite them to your input/output gfx style. They just load map from bitmap and draw computed map back to bitmap
Usage:
// init
A_star map;
Graphics::TBitmap *maze=new Graphics::TBitmap;
maze->LoadFromFile("maze.bmp");
maze->HandleType=bmDIB;
maze->PixelFormat=pf32bit;
map.set(maze,0); // walls are 0x00000000 (black)
// this can be called repetitive without another init
map.compute(x0,y0,x1,y1); // map.px[map.ps],map.py[map.ps] holds the path
map.get(maze,0); // this is just for drawing the result map back to bitmap for viewing
for more info about A* see Backtracking in A star

TL;DR: Include in your node list (graph) only the patches (or agents) that are important!
One way to speed things up is to not search over every grid space. A* is a graph search, but seems like most coders just dump every point in the grid into the graph. That's not required. Using a sparse search graph, rather than searching every point on the screen, can speed things up.
Even in a complex maze, you can speed up by only including corners and junctions in the graph. Don't add hallway grids to the open list--seek ahead to find the next corner or junction. This is where pre-processing the screen/grid/map to construct the search graph can save time later.
As you can see in this image from my (rather inefficient) A* model on turtlezero.com, a naive approach creates a lot of extra steps. Any open nodes created in a long straight corridor are wasted:
By eliminating these steps from the graph, the solution could be found hundreds of times faster.
Another sparse graph technique is to use a graph that is gradually less dense the further from the walker. That is, make your search space detailed near the walker, and sparse (fewer nodes, less accurate regarding obstacles) away from the walker. This is especially useful where the walker is moving through detailed terrain on a map that is changing or towards a target that is moving and the route has to be recalculated anyway.
For example, in a traffic simulation where roads may become gridlocked, or accidents occur. Likewise, a simulation where one agent is pursuing another agent on a changing landscape. In these cases, only the next few steps need to be exactly plotted. The general route to the destination can be approximate.
One simple way to implement this is to gradually increase the step size of the walker as the path becomes longer. Disregard obstacles or do a quick line-intersection or tangent test. This gives the walker a general idea of where to go.
An improved path can be recalculated with each step, or periodically, or when an obstacle is encountered.
It may only be milliseconds saved, but milliseconds wasted on the soon-to-change end of the path could be better used providing brains for more walkers, or better graphics, or more time with your family.
For an example of a sparse graph of varying density, see chapter 8 of Advanced Java Programming By David Wallace Croft from APress: http://www.apress.com/game-programming/java/9781590591239
He uses a circular graph of increasing sparseness in a demo tank game with an a* algorithm driving the enemy tanks.
Another sparse graph approach is to populate the graph with only way-points of interest. For example, to plot a route across a simple campus of buildings, only entrances, exits, and corners are important. Points along the side of a building or in the open space between are not important, and can be omitted from the search graph. A more detailed map might need more way-points--such as a circle of nodes around a fountain or statue, or where paved paths intersect.
Here's a diagram showing the paths between waypoints.
This was generated by the campus-buildings-path-graph model by me on turtlezero.com: http://www.turtlezero.com/models/view.php?model=campus-buildings-path-graph
It uses simple netlogo patch queries to find points of interest, like outside and inside corners. I'm sure a somewhat more sophisticated set of queries could deal with things like diagonal walls. But even without such fancy further optimization, the A* search space would be reduced by orders of magnitude.
Unfortunately, since now Java 1.7 won't allow unsigned applets, you can't run the model in the webpage without tweaking your java security settings. Sorry about that. But read the description.

A* is two heuristics; Dijkstra's Algorithm & Greedy Search. Dijkstra's Algorithm searches for the shortest path. The Greedy Search looks for the cheapest path. Dijkstra's algorithm is extraordinarily slow because it doesn't take risks. Multiply the effect of the Greedy Search to take more risks.
For example, if A* = Dijkstra + Greedy, then a faster A* = Dijkstra + 1.1 * Greedy. No matter how much you optimize your memory access or your code, it will not fix a bad approach to solving the problem. Make your A* more greedy and it will focus on finding a solution, rather than a perfect solution.
NOTE:
Greedy Search = distance from end
Dijkstra's Algorithm = distance from start
in standard A*, it will seek perfect solutions until reaching an obstacle. This video shows the different search heuristics in action; notice how fast a greedy search can be (skip to 2:22 for A*, 4:40 for Greedy). I myself had a similar issue when I first began with A* and the modified A* I outline above improved my performance exponentially. Moral of the story; use the right tool for the job.

If you plan to reuse the same map multiple times, some form of pre-processing is usually optimal. Effectively, you work out the shortest distances between some common points, and add them to the graphs as edges, this will typically help a* find a solution more quickly. Although its more difficult to implement.
E.g. you might do this for all motorway routes in a map of the UK, so the search algorithm only has to find a route to a motorway, and from the motorway junctions to its destination.

I can't tell what the actual cause of the observed slowlyness might be. Maybe it's just due to shortcomings in efficiency imposed by the programming language at hand. How did you measure your performance? How can we reproduce it?
Besides that, the heuristic (distance metric) being used has a large influence on the amount of exploration that is done in order to find the optimal path and thus also influences the perceived efficiency of the algorithm.
In theory you have to use an admissible heuristic, that is, one that never overestimates the remaining distance.
In practice, depending on the complicatedness of the maze, a conservative choice for a 2d grid maze, like manhattan distance, might significantly underestimate the remaining distance. Therefore a lot of exploration is done in areas of the maze far away from the goal. This leads to a degree of exploration that resembles much rather that of an exhaustive search (e.g., Breadth-First Search), than what one would expect from an informed search algorithm.
This might be something to look into.
Also have a look at my related answer here:
https://stackoverflow.com/a/16656993/1025391
There I have compared different heuristics used with the basic A-Star algorithm and visualized the results. You might find it interesting.

Optimize loops with big datasets Python

It's the first time I go so big with Python so I need some help.
I have a mongodb (or python dict) with the following structure:
{
"_id": { "$oid" : "521b1fabc36b440cbe3a6009" },
"country": "Brazil",
"id": "96371952",
"latitude": -23.815124482000001649,
"longitude": -45.532670811999999216,
"name": "coffee",
"users": [
{
"id": 277659258,
"photos": [
{
"created_time": 1376857433,
"photo_id": "525440696606428630_277659258",
},
{
"created_time": 1377483144,
"photo_id": "530689541585769912_10733844",
}
],
"username": "foo"
},
{
"id": 232745390,
"photos": [
{
"created_time": 1369422344,
"photo_id": "463070647967686017_232745390",
}
],
"username": "bar"
}
]
}
Now, I want to create two files, one with the summaries and the other with the weight of each connection. My loop which works for small datasets is the following:
#a is the dataset
data = db.collection.find()
a =[i for i in data]
#here go the connections between the locations
edges = csv.writer(open("edges.csv", "wb"))
#and here the location data
nodes = csv.writer(open("nodes.csv", "wb"))
for i in a:
#find the users that match
for q in a:
if i['_id'] <> q['_id'] and q.get('users') :
weight = 0
for user_i in i['users']:
for user_q in q['users']:
if user_i['id'] == user_q['id']:
weight +=1
if weight>0:
edges.writerow([ i['id'], q['id'], weight])
#find the number of photos
photos_number =0
for p in i['users']:
photos_number += len(p['photos'])
nodes.writerow([ i['id'],
i['name'],
i['latitude'],
i['longitude'],
len(i['users']),
photos_number
])
The scaling problems: I have 20000 locations, each location might have up to 2000 users, each user might have around 10 photos.
Is there any more efficient way to create the above loops? Maybe Multithreads, JIT, more indexes?
Because if I run the above in a single thread can be up to 20000^2 *2000 *10 results...
So how can I handle more efficiently the above problem?
Thanks

#YuchenXie and #PaulMcGuire's suggested microoptimizations probably aren't your main problem, which is that you're looping over 20,000 x 20,000 = 400,000,000 pairs of entries, and then have an inner loop of 2,000 x 2,000 user pairs. That's going to be slow.
Luckily, the inner loop can be made much faster by pre-caching sets of the user ids in i['users'], and replacing your inner loop with a simple set intersection. That changes an O(num_users^2) operation that's happening in the Python interpreter to an O(num_users) operation happening in C, which should help. (I just timed it with lists of integers of size 2,000; on my computer, it went from 156ms the way you're doing it to 41µs this way, for a 4,000x speedup.)
You can also cut off half your work of the main loop over pairs of locations by noticing that the relationship is symmetric, so there's no point in doing both i = a[1], q = a[2] and i = a[2], q = a[1].
Taking these and #PaulMcGuire's suggestions into account, along with some other stylistic changes, your code becomes (caveat: untested code ahead):
from itertools import combinations, izip
data = db.collection.find()
a = list(data)
user_ids = [{user['id'] for user in i['users']} if 'users' in i else set()
for i in a]
with open("edges.csv", "wb") as f:
edges = csv.writer(f)
for (i, i_ids), (q, q_ids) in combinations(izip(a, user_ids), 2):
weight = len(i_ids & q_ids)
if weight > 0:
edges.writerow([i['id'], q['id'], weight])
edges.writerow([q['id'], i['id'], weight])
with open("nodes.csv", "wb") as f:
nodes = csv.writer(f)
for i in a:
nodes.writerow([
i['id'],
i['name'],
i['latitude'],
i['longitude'],
len(i['users']),
sum(len(p['photos']) for p in i['users']), # total number of photos
])
Hopefully this should be enough of a speedup. If not, it's possible that #YuchenXie's suggestion will help, though I'm doubtful because the stdlib/OS is fairly good at buffering that kind of thing. (You might play with the buffering settings on the file objects.)
Otherwise, it may come down to trying to get the core loops out of Python (in Cython or handwritten C), or giving PyPy a shot. I'm doubtful that'll get you any huge speedups now, though.
You may also be able to push the hard weight calculations into Mongo, which might be smarter about that; I've never really used it so I don't know.

The bottle neck is disk I/O.
It should be much faster when you merge the results and use one or several writerows call instead of many writerow.

Does collapsing this loop:
photos_number =0
for p in i['users']:
photos_number += len(p['photos'])
down to:
photos_number = sum(len(p['photos']) for p in i['users'])
help at all?
Your weight computation:
weight = 0
for user_i in i['users']:
for user_q in q['users']:
if user_i['id'] == user_q['id']:
weight +=1
should also be collapsible down to:
weight = sum(user_i['id'] == user_q['id']
for user_i,user_q in product([i['users'],q['users']))
Since True equates to 1, summing all the boolean conditions is the same as counting all the values that are True.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.