Optimize loops with big datasets Python - python

It's the first time I go so big with Python so I need some help.
I have a mongodb (or python dict) with the following structure:
{
"_id": { "$oid" : "521b1fabc36b440cbe3a6009" },
"country": "Brazil",
"id": "96371952",
"latitude": -23.815124482000001649,
"longitude": -45.532670811999999216,
"name": "coffee",
"users": [
{
"id": 277659258,
"photos": [
{
"created_time": 1376857433,
"photo_id": "525440696606428630_277659258",
},
{
"created_time": 1377483144,
"photo_id": "530689541585769912_10733844",
}
],
"username": "foo"
},
{
"id": 232745390,
"photos": [
{
"created_time": 1369422344,
"photo_id": "463070647967686017_232745390",
}
],
"username": "bar"
}
]
}
Now, I want to create two files, one with the summaries and the other with the weight of each connection. My loop which works for small datasets is the following:
#a is the dataset
data = db.collection.find()
a =[i for i in data]
#here go the connections between the locations
edges = csv.writer(open("edges.csv", "wb"))
#and here the location data
nodes = csv.writer(open("nodes.csv", "wb"))
for i in a:
#find the users that match
for q in a:
if i['_id'] <> q['_id'] and q.get('users') :
weight = 0
for user_i in i['users']:
for user_q in q['users']:
if user_i['id'] == user_q['id']:
weight +=1
if weight>0:
edges.writerow([ i['id'], q['id'], weight])
#find the number of photos
photos_number =0
for p in i['users']:
photos_number += len(p['photos'])
nodes.writerow([ i['id'],
i['name'],
i['latitude'],
i['longitude'],
len(i['users']),
photos_number
])
The scaling problems: I have 20000 locations, each location might have up to 2000 users, each user might have around 10 photos.
Is there any more efficient way to create the above loops? Maybe Multithreads, JIT, more indexes?
Because if I run the above in a single thread can be up to 20000^2 *2000 *10 results...
So how can I handle more efficiently the above problem?
Thanks

#YuchenXie and #PaulMcGuire's suggested microoptimizations probably aren't your main problem, which is that you're looping over 20,000 x 20,000 = 400,000,000 pairs of entries, and then have an inner loop of 2,000 x 2,000 user pairs. That's going to be slow.
Luckily, the inner loop can be made much faster by pre-caching sets of the user ids in i['users'], and replacing your inner loop with a simple set intersection. That changes an O(num_users^2) operation that's happening in the Python interpreter to an O(num_users) operation happening in C, which should help. (I just timed it with lists of integers of size 2,000; on my computer, it went from 156ms the way you're doing it to 41µs this way, for a 4,000x speedup.)
You can also cut off half your work of the main loop over pairs of locations by noticing that the relationship is symmetric, so there's no point in doing both i = a[1], q = a[2] and i = a[2], q = a[1].
Taking these and #PaulMcGuire's suggestions into account, along with some other stylistic changes, your code becomes (caveat: untested code ahead):
from itertools import combinations, izip
data = db.collection.find()
a = list(data)
user_ids = [{user['id'] for user in i['users']} if 'users' in i else set()
for i in a]
with open("edges.csv", "wb") as f:
edges = csv.writer(f)
for (i, i_ids), (q, q_ids) in combinations(izip(a, user_ids), 2):
weight = len(i_ids & q_ids)
if weight > 0:
edges.writerow([i['id'], q['id'], weight])
edges.writerow([q['id'], i['id'], weight])
with open("nodes.csv", "wb") as f:
nodes = csv.writer(f)
for i in a:
nodes.writerow([
i['id'],
i['name'],
i['latitude'],
i['longitude'],
len(i['users']),
sum(len(p['photos']) for p in i['users']), # total number of photos
])
Hopefully this should be enough of a speedup. If not, it's possible that #YuchenXie's suggestion will help, though I'm doubtful because the stdlib/OS is fairly good at buffering that kind of thing. (You might play with the buffering settings on the file objects.)
Otherwise, it may come down to trying to get the core loops out of Python (in Cython or handwritten C), or giving PyPy a shot. I'm doubtful that'll get you any huge speedups now, though.
You may also be able to push the hard weight calculations into Mongo, which might be smarter about that; I've never really used it so I don't know.

The bottle neck is disk I/O.
It should be much faster when you merge the results and use one or several writerows call instead of many writerow.

Does collapsing this loop:
photos_number =0
for p in i['users']:
photos_number += len(p['photos'])
down to:
photos_number = sum(len(p['photos']) for p in i['users'])
help at all?
Your weight computation:
weight = 0
for user_i in i['users']:
for user_q in q['users']:
if user_i['id'] == user_q['id']:
weight +=1
should also be collapsible down to:
weight = sum(user_i['id'] == user_q['id']
for user_i,user_q in product([i['users'],q['users']))
Since True equates to 1, summing all the boolean conditions is the same as counting all the values that are True.

Related

Quickly mapping/modifying values in a large list of Python dicts?

I have some code that I'm trying to speed up. Maybe what I've got is right, but whenever I ask on StackOverflow somebody usually knows a clever little trick "Use map!", "try this lambda", or "import iteratetools" and I'm hoping somebody can help here. This is the section of code I'm concerned with:
#slowest part from here....
for row_dict in json_data:
row_dict_clean = {}
for key, value in row_dict.items():
value_clean = get_cleantext(value)
row_dict_clean[key] = value_clean
json_data_clean.append(row_dict_clean)
total += 1
#to here...
The concept is pretty simple. I have a multi-million long list that contains dictionaries and I need to run each value through a little cleaner. Then I end up with a nice list of cleaned dictionaries. Any clever iterate tool that I'm not aware of that I should be using? Here is a more complete MVE to help play with it:
def get_json_data_clean(json_data):
json_data_clean = []
total = 0
#slowest part from here....
for row_dict in json_data:
row_dict_clean = {}
for key, value in row_dict.items():
value_clean = get_cleantext(value)
row_dict_clean[key] = value_clean
json_data_clean.append(row_dict_clean)
total += 1
#to here...
return json_data_clean
def get_cleantext(value):
#do complex cleaning stuffs on the string, I can't change what this does
value = value.replace("bad", "good")
return value
json_data = [
{"key1":"some bad",
"key2":"bad things",
"key3":"extra bad"},
{"key1":"more bad stuff",
"key2":"wow, so much bad",
"key3":"who dis?"},
# a few million more dictionaries
{"key1":"so much bad stuff",
"key2":"the bad",
"key3":"the more bad"},
]
json_data_clean = get_json_data_clean(json_data)
print(json_data_clean)
Anytime I have nested for loops a little bell rings in my head, there is probably a better way to do this. Any help is appreciated!
Must definitely ask clever guys at https://codereview.stackexchange.com/, but as a quick fix it appears you can just map() your transformation fucntion over a list of dictionaries as below:
def clean_text(value: str)-> str:
# ...
return value.replace("bad", "good")
def clean_dict(d: dict):
return {k:clean_text(v) for k,v in d.items()}
json_data = [
{"key1":"some bad",
"key2":"bad things",
"key3":"extra bad"},
{"key1":"more bad stuff",
"key2":"wow, so much bad",
"key3":"who dis?"},
# a few million more dictionaries
{"key1":"so much bad stuff",
"key2":"the bad",
"key3":"the more bad"},
]
x = list(map(clean_dict, json_data))
A thing that gets left out is your total counter, but it never seems to leave the get_json_data_clean() anyways.
Not sure why #Daniel Gale proposed filter() as you are not throughing any values away, just transforming them.

Returning the minimum value in a JSON array

Good evening folks! I have been wracking my brain on this one for a good few hours now and could do with a little bit of a pointer in the right direction. I'm playing around with some API calls and trying to make a little project for myself.
The JSON data is stored in Arrays, and as such to get the information I want (it is from a Transport API) I have been making the following
x = apirequest
x = x.json
for i in range(0,4):
print(x['routes'][i]['duration'])
print(x['routes'][i]['departure_time'])
print(x['routes'][i]['arrival_time'])
This will return the following
06:58:00
23:39
06:37
05:08:00
05:14
10:22
03:41:00
05:30
09:11
03:47:00
06:24
10:11
What I am trying to do, is return only the shortest journeys - I could do it if it was a single layer JSON string but I am not too familliar with multi-level arrays. I can't return ['duration'] without utilising ['routes'] and route indicator (in this case 0 through 3 or 4).
I can use an if statement to iterate through them easily enough, but there must be a way to accomplish it directly through the JSON that I am missing. I also thought about adding the results to a separate array and then filtering that - but there is a few other fields I want to grab from the data when I've cracked this part.
What I am finding as I learn is that I tend to do things a long winded way, often finding out my 10-15 line solutions on codewars are actually aimed at being done in 2-3 lines.
Example JSON data
{
"request_time": "2018-05-29T19:03:04+01:00",
"source": "Traveline southeast journey planning API",
"acknowledgements": "Traveline southeast",
"routes": [{
"duration": "06:58:00",
"route_parts": [{
"mode": "foot",
"from_point_name": "Corunna Court, Wrexham",
"to_point_name": "Wrexham General Rail Station",
"destination": "",
"line_name": "",
"duration": "00:36:00",
"departure_time": "23:39",
"arrival_time": "00:15"
}]
}]
}
Hope you can help steer me in the right direction!
Here's one solution using datetime.timedelta. Data from #fferri.
from datetime import timedelta
x = {'routes': [{'duration':'06:58:00','departure_time':'23:39','arrival_time':'06:37'},
{'duration':'05:08:00','departure_time':'05:14','arrival_time':'10:22'},
{'duration':'03:41:00','departure_time':'05:30','arrival_time':'09:11'},
{'duration':'03:47:00','departure_time':'06:24','arrival_time':'10:11'}]}
def minimum_time(k):
h, m, s = map(int, x['routes'][k]['duration'].split(':'))
return timedelta(hours=h, minutes=m, seconds=s)
res = min(range(4), key=minimum_time) # 2
You can then access the appropriate sub-dictionary via x['routes'][res].
Using min() with a key argument to indicate which field should be used for finding the minimum value:
x={'routes':[
{'duration':'06:58:00','departure_time':'23:39','arrival_time':'06:37'},
{'duration':'05:08:00','departure_time':'05:14','arrival_time':'10:22'},
{'duration':'03:41:00','departure_time':'05:30','arrival_time':'09:11'},
{'duration':'03:47:00','departure_time':'06:24','arrival_time':'10:11'}
]}
best=min(x['routes'], key=lambda d: d['duration'])
# best={'duration': '03:41:00', 'departure_time': '05:30', 'arrival_time': '09:11'}
The min(iterable, key=...) function is what you are looking for:
x = { 'routes': [ {'dur':3, 'depart':1, 'arrive':4},
{'dur':2, 'depart':2, 'arrive':4}]}
min(x['routes'], key=lambda item: item['dur'])
Returns:
{'dur': 2, 'depart': 2, 'arrive': 4}
First, the fact that x is initialized from JSON isn't particularly relevant. It's a dict, and that's all that is important.
To answer your question, you just need the key attribute to min:
shortest = min(x['routes'], key=lambda d: d['duration'])

For Loop Only Iterating Once

I'm using the pyvmomi module to export data from our vCenter. I'm very close to getting the desired output, but my script is only iterating through once. Why?
If I print(d) in the for loop before updating the dictionary, it will print all of the data.
Script Summary:
top_dict = {"data": []}
def get_obj(content, vimtype, name=None):
return [item for item in content.viewManager.CreateContainerView(
content.rootFolder, [vimtype], recursive=True).view]
## MAIN ##
...
content = si.RetrieveContent()
d = {}
idnum = 0
for dc in get_obj(content, vim.Datacenter):
for cluster in get_obj(content, vim.ClusterComputeResource):
for h in cluster.host:
host_wwpn1 = ''
for adaptr in h.config.storageDevice.hostBusAdapter:
if hasattr(adaptr, 'portWorldWideName'):
if host_wwpn1 == '':
host_wwpn1 = (hex(adaptr.portWorldWideName)[2:18])
else:
host_wwpn2 = (hex(adaptr.portWorldWideName)[2:18])
d['id'] = idnum
d['datacenter'] = dc.name
d['cluster'] = cluster.name
d['host'] = h.name
d['pwwn_F1'] = host_wwpn1
d['pwwn_F2'] = host_wwpn2
idnum = idnum + 1
top_dict.update({"data": [d]})
Current Output:
{
"data": [
{
"id": 0,
"datacenter": "MY_DATACENTER",
"cluster": "MY_CLUSTER",
"host": "MY_HOSTNAME",
"pwwn_F1": "XXXXXXXXXXXXXXXX",
"pwwn_F2": "XXXXXXXXXXXXXXXX"
}
]
}
I'm pretty sure your issue is on the very last line of the code you've shown. That line replaces the entire contents of the top_dict with new values. I'm pretty sure you want to be adding your new dictionary d to the list that's inside of top_dict.
Instead, I think you want to be doing top_dict["data"].append(d). You will also need to move the initialization of d at the same level as the append call (so probably between the first two loops, if you keep the last line indented as it is now).
I'm not sure if that last line is indented the correct amount or not (since I don't actually know what your code is supposed to do). Currently, you might set the values in d several times before using them. You may want to indent the last line to be at the same level as the lines setting values in d (the initialization of d should then also be at that level too).

Python vs perl sort performance

Solution
This solved all issues with my Perl code (plus extra implementation code.... :-) ) In conlusion both Perl and Python are equally awesome.
use WWW::Curl::Easy;
Thanks to ALL who responded, very much appreciated.
Edit
It appears that the Perl code I am using is spending the majority of its time performing the http get, for example:
my $start_time = gettimeofday;
$request = HTTP::Request->new('GET', 'http://localhost:8080/data.json');
$response = $ua->request($request);
$page = $response->content;
my $end_time = gettimeofday;
print "Time taken #{[ $end_time - $start_time ]} seconds.\n";
The result is:
Time taken 74.2324419021606 seconds.
My python code in comparison:
start = time.time()
r = requests.get('http://localhost:8080/data.json', timeout=120, stream=False)
maxsize = 100000000
content = ''
for chunk in r.iter_content(2048):
content += chunk
if len(content) > maxsize:
r.close()
raise ValueError('Response too large')
end = time.time()
timetaken = end-start
print timetaken
The result is:
20.3471381664
In both cases the sort times are sub second. So first of all I apologise for the misleading question, and it is another lesson for me to never ever make assumptions.... :-)
I'm not sure what is the best thing to do with this question now. Perhaps someone can propose a better way of performing the request in perl?
End of edit
This is just a quick question regarding sort performance differences in Perl vs Python. This is not a question about which language is better/faster etc, for the record, I first wrote this in perl, noticed the time the sort was taking, and then tried to write the same thing in python to see how fast it would be. I simply want to know, how can I make the perl code perform as fast as the python code?
Lets say we have the following json:
["3434343424335": {
"key1": 2322,
"key2": 88232,
"key3": 83844,
"key4": 444454,
"key5": 34343543,
"key6": 2323232
},
"78237236343434": {
"key1": 23676722,
"key2": 856568232,
"key3": 838723244,
"key4": 4434544454,
"key5": 3432323543,
"key6": 2323232
}
]
Lets say we have a list of around 30k-40k records which we want to sort by one of the sub keys. We then want to build a new array of records ordered by the sub key.
Perl - Takes around 27 seconds
my #list;
$decoded = decode_json($page);
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,"key1"=>$decoded->{$id}{key1}...etc));
}
Python - Takes around 6 seconds
list = []
data = json.loads(content)
data2 = sorted(data, key = lambda x: data[x]['key5'], reverse=True)
for key in data2:
tmp= {'id':key,'key1':data[key]['key1'],etc.....}
list.append(tmp)
For the perl code, I have tried using the following tweaks:
use sort '_quicksort'; # use a quicksort algorithm
use sort '_mergesort'; # use a mergesort algorithm
Your benchmark is flawed, you're benchmarking multiple variables, not one. It is not just sorting data, but it is also doing JSON decoding, and creating strings, and appending to an array. You can't know how much time is spent sorting and how much is spent doing everything else.
The matter is made worse in that there are several different JSON implementations in Perl each with their own different performance characteristics. Change the underlying JSON library and the benchmark will change again.
If you want to benchmark sort, you'll have to change your benchmark code to eliminate the cost of loading your test data from the benchmark, JSON or not.
Perl and Python have their own internal benchmarking libraries that can benchmark individual functions, but their instrumentation can make them perform far less well than they would in the real world. The performance drag from each benchmarking implementation will be different and might introduce a false bias. These benchmarking libraries are more useful for comparing two functions in the same program. For comparing between languages, keep it simple.
Simplest thing to do to get an accurate benchmark is to time them within the program using the wall clock.
# The current time to the microsecond.
use Time::HiRes qw(gettimeofday);
my #list;
my $decoded = decode_json($page);
my $start_time = gettimeofday;
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,"key1"=>$decoded->{$id}{key1}...etc));
}
my $end_time = gettimeofday;
print "sort and append took #{[ $end_time - $start_time ]} seconds\n";
(I leave the Python version as an exercise)
From here you can improve your technique. You can use CPU seconds instead of wall clock. The array append and cost of creating the string are still involved in the benchmark, they can be eliminated so you're just benchmarking sort. And so on.
Additionally, you can use a profiler to find out where your programs are spending their time. These have the same raw performance caveats as benchmarking libraries, the results are only useful to find out what percentage of its time a program is using where, but it will prove useful to quickly see if your benchmark has unexpected drag.
The important thing is to benchmark what you think you're benchmarking.
Something else is at play here; I can run your sort in half a second. Improving that is not going to depend on sorting algorithm so much as reducing the amount of code run per comparison; a Schwartzian Transform gets it to a third of a second, a Guttman-Rosler Transform gets it down to a quarter of a second:
#!/usr/bin/perl
use 5.014;
use warnings;
my $decoded = { map( (int rand 1e9, { map( ("key$_", int rand 1e9), 1..6 ) } ), 1..40000 ) };
use Benchmark 'timethese';
timethese( -5, {
'original' => sub {
my #list;
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
'st' => sub {
my #list;
foreach my $id (
map $_->[1],
sort { $b->[0] <=> $a->[0] }
map [ $decoded->{$_}{key5}, $_ ],
keys %{$decoded}
) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
'grt' => sub {
my $maxkeylen=15;
my #list;
foreach my $id (
map substr($_,$maxkeylen),
sort { $b cmp $a }
map sprintf('%0*s', $maxkeylen, $decoded->{$_}{key5}) . $_,
keys %{$decoded}
) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
});
Don't create a new hash for each record. Just add the key to the existing one.
$decoded->{$_}{key} = $_
for keys(%$decoded);
my #list = sort { $b->{key5} <=> $a->{key5} } values(%$decoded);
Using Sort::Key will make it even faster.
use Sort::Key qw( rukeysort );
$decoded->{$_}{key} = $_
for keys(%$decoded);
my #list = rukeysort { $_->{key5} } values(%$decoded);

Does spark utilize the sorted order of hbase keys, when using hbase as data source

I store time-series data in HBase. The rowkey is composed from user_id and timestamp, like this:
{
"userid1-1428364800" : {
"columnFamily1" : {
"val" : "1"
}
}
}
"userid1-1428364803" : {
"columnFamily1" : {
"val" : "2"
}
}
}
"userid2-1428364812" : {
"columnFamily1" : {
"val" : "abc"
}
}
}
}
Now I need to perform per-user analysis. Here is the initialization of hbase_rdd (from here)
sc = SparkContext(appName="HBaseInputFormat")
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
The natural mapreduce-like way to process would be:
hbase_rdd
.map(lambda row: (row[0].split('-')[0], (row[0].split('-')[1], row[1]))) # shift timestamp from key to value
.groupByKey()
.map(processUserData) # process user's data
While executing first map (shift timestamp from key to value) it is crucial to know when the time-series data of the current user is finished and therefore groupByKey transformation could be started. Thus we do not need to map over all table and store all the temporary data. It is possible because hbase stores row-keys in a sorted order.
With hadoop streaming it could be done in such way:
import sys
current_user_data = []
last_userid = None
for line in sys.stdin:
k, v = line.split('\t')
userid, timestamp = k.split('-')
if userid != last_userid and current_user_data:
print processUserData(last_userid, current_user_data)
last_userid = userid
current_user_data = [(timestamp, v)]
else:
current_user_data.append((timestamp, v))
The question is: how to utilize the sorted order of hbase keys within Spark?
I'm not super familiar with the guarantees you get with the way you're pulling data from HBase, but if I understand correctly, I can answer with just plain old Spark.
You've got some RDD[X]. As far as Spark knows, the Xs in that RDD are completely unordered. But you have some outside knowledge, and you can guarantee that the data is in fact grouped by some field of X (and perhaps even sorted by another field).
In that case, you can use mapPartitions to do virtually the same thing you did with hadoop streaming. That lets you iterate over all the records in one partition, so you can look for blocks of records w/ the same key.
val myRDD: RDD[X] = ...
val groupedData: RDD[Seq[X]] = myRdd.mapPartitions { itr =>
var currentUserData = new scala.collection.mutable.ArrayBuffer[X]()
var currentUser: X = null
//itr is an iterator over *all* the records in one partition
itr.flatMap { x =>
if (currentUser != null && x.userId == currentUser.userId) {
// same user as before -- add the data to our list
currentUserData += x
None
} else {
// its a new user -- return all the data for the old user, and make
// another buffer for the new user
val userDataGrouped = currentUserData
currentUserData = new scala.collection.mutable.ArrayBuffer[X]()
currentUserData += x
currentUser = x
Some(userDataGrouped)
}
}
}
// now groupedRDD has all the data for one user grouped together, and we didn't
// need to do an expensive shuffle. Also, the above transformation is lazy, so
// we don't necessarily even store all that data in memory -- we could still
// do more filtering on the fly, eg:
val usersWithLotsOfData = groupedRDD.filter{ userData => userData.size > 10 }
I realize you wanted to use python -- sorry I figure I'm more likely to get the example correct if I write in Scala. and I think the type annotations make the meaning more clear, but that it probably a Scala bias ... :). In any case, hopefully you can understand what is going on and translate it. (Don't worry too much about flatMap & Some & None, probably unimportant if you understand the idea ...)

Categories