How can I use Mongodb Aggregation in this example? - python

I am currently using Python to build many of my results instead of MongoDB itself. I am trying to get my head around Aggregation, but I'm struggling a bit. Here is an example of what I am doing currently which perhaps could be better handled by MongoDB.
I have a collection of programs and a collection of episodes. Each program has a list of episodes (DBRefs) associated with it. (The episodes are stored in their own collection because both programs and episodes are quite complex and deep, so embedding is impractical). Each episode has a duration (float). If I want to find a program's average episode duration, I do this:
episodes = list(db.Episodes.find({'Program':DBRef('Programs',ObjectId(...))}))
durations = set(e['Duration'] for e in episodes if e['Duration'] > 0)
avg_mins = int(sum(durations) / len(durations) / 60
This is pretty slow when a program has over 1000 episodes. Is there a way I can do it in MongoDB?
Here is some sample data in Mongo shell format. There are three episodes belonging to the same program. How can I calculate the average episode duration for the program?
> db.Episodes.find({
'_Program':DBRef('Programs',ObjectId('4ec634fbf4c4005664000313'))},
{'_Program':1,'Duration':1}).limit(3)
{
"_id" : ObjectId("506c15cbf4c4005f9c40f830"),
"Duration" : 1643.856,
"_Program" : DBRef("Programs", ObjectId("4ec634fbf4c4005664000313"))
}
{
"_id" : ObjectId("506c15d3f4c4005f9c40f8cf"),
"Duration" : 1598.088,
"_Program" : DBRef("Programs", ObjectId("4ec634fbf4c4005664000313"))
}
{
"_id" : ObjectId("506c15caf4c4005f9c40f80e"),
"_Program" : DBRef("Programs", ObjectId("4ec634fbf4c4005664000313")),
"Duration" : 1667.04
}

I figured it out, and it is ridiculously fast compared to pulling it all into Python.
p = db.Programs.find_one({'Title':'...'})
pipe = [
{'$match':{'_Program':DBRef('Programs',p['_id']),'Duration':{'$gt':0}}},
{'$group':{'_id':'$_Program', 'AverageDuration':{'$avg':'$Duration'}}}
]
eps = db.Episodes.aggregate(pipeline=pipe)
print eps['result']

Related

Gym Retro/Stable-Baselines Doesn't Stop Iteration After Done Condition Is Met

I'm trying to use Gym Retro and Stable-Baselines to train a bot to play Super Mario Bros. Everything seems to work, except it appears that the environment doesn't really end/reset when it should. The BK2 files that it records are over 500 kb in size, take around 20 minutes to convert to video, and the video is about 2 hours long. The video itself starts with about three minutes of AI gameplay, but after it loses all three lives it sits on the title screen until the demo starts playing. I'm pretty sure the demo gets picked up by the reward functions, so it interferes with the training. I'm also worried it's massively slowing down training since it has to sit through 2 hours of extra "gameplay". Here's what my scenario file looks like:
{
"done": {
"condition": "any",
"variables": {
"lives": {
"op": "equal",
"reference": -1
},
"time": {
"op": "equal",
"reference": 0
}
}
},
"reward": {
"variables": {
"xscrollHi": {
"reward": 10
},
"playerx": {
"reward": 0.1
},
"coins": {
"reward": 10
}
}
}
}
I have verified using the Integration UI tool that the Done and Did-End variables switch to yes when either done condition is met. And just in case here's the relevant Python code:
env = DummyVecEnv([lambda: retro.make("SuperMarioBros-Nes", state="Level1-1.state", scenario="training", record="/gdrive/MyDrive/530_project")])
#model = PPO2(CnnPolicy, env, verbose=1)
for i in range(24):
model = PPO2.load(filePath + "/" + fileName)
model.set_env(env)
model.learn(total_timesteps=time_steps, log_interval=1000, reset_num_timesteps=False)
model.save(filePath + "/" + fileName)
print("done with iteration ", i)
del model
If you want to see the whole Python notebook here's the link: https://colab.research.google.com/drive/1ThxDqjeNQh3rNEXYqlXJQ6tn3W2TPK7k?usp=sharing
It's possible fixing this won't change how it's training, but at the very least I'd like to have smaller bk2 and mp4 files so they're easier to deal with. Any advice would be appreciated. Also let me know if there's a better place to be asking this question
If anybody finds this with this problem, I kinda found an answer. I misunderstood what total_timesteps was. It looks like it's actually a time limit for each run. I've set this to be about how long it takes for the time on one life to run out so it effectively works, but it's still a little jank.
You don't need to load and delete model each episode. Timesteps in the learn() correpsonds to total timesteps for learning over all episodes.
If you want to limit episode lenght -> you can use gym TimeLimit wrapper.
your code can looks like this:
from gym.wrappers.time_limit import TimeLimit
time_steps = 1000000
episode_length = 500
env = DummyVecEnv([lambda: TimeLimit('your_mario_env_config...',
max_episode_steps=episode_length)])
model = PPO2(CnnPolicy, env, verbose=1)
model.learn(total_timesteps=time_steps, log_interval=1000, reset_num_timesteps=False)
model.save(filePath + "/" + fileName)
We have limited here each episode with 500 steps, while total learning process will have approximately 1000000 steps.

Get n number of documents from a collection using MongoDB/MongoEngine

Hi everyone I have a document inside a collection like this. (Ignore the absurdity of the question).
[
{
"tag": "english",
"difficulty": "hard",
"question": "What are alphabets",
"option_1": "98 billion light years",
"option_2": "23.3 trillion light years",
"option_3": "6 minutes",
"option_4": "It is still unknown",
"correct_answer": "option_1",
"id": "5f80befbaaf3c9ce2f4e2fb9"
}
]
There are multiple documents such as this one (10000).
I'm trying to write a python get to function using flask-restful to get n number of documents from this collection.
Currently, I'm confused about how to write a MongoEngine query.
This is what I do to get a single document based on it.
def get(self,id):
questions = Question.objects.get(id=id).to_json()
return Response(questions,
mimetype="application/json",
status = 200)
for n number of documents, I'm unable to figure out what to write inside.
def get_n_questions(self,n):
body = request.get_json(force =True)
questions = ???
return Response(questions,
mimetype="application/json",
status = 200)
You can use the limit(n) method (doc) on a queryset. This will let you retrieve the n firsts documents from the collection.
In your case that would mean:
questions = Question.objects().limit(n).to_json()
You may also be interested in the skip(n) method, this will allow you to do pagination (similarly to a limit/offset from MySQL for instance).

Bulk update in Pymongo using multiple ObjectId

I want to update thousands of documents in mongo collection. I want to find them using ObjectId and then whichever document matches , should be updated. My update is same for all documents. I have list of ObjectId. For every ObjectId in list, mongo should find matching document and update "isBad" key of that document to "N"
ids = [ObjectId('56ac9d3fa722f1029b75b128'), ObjectId('56ac8961a722f10249ad0ad1')]
bulk = db.testdata.initialize_unordered_bulk_op()
bulk.find( { '_id': ids} ).update( { '$set': { "isBad" : "N" } } )
print bulk.execute()
This gives me result :
{'nModified': 0, 'nUpserted': 0, 'nMatched': 0, 'writeErrors': [], 'upserted': [], 'writeConcernErrors': [], 'nRemoved': 0, 'nInserted': 0}
This is expected because it is trying to match "_id" with list. But I don't know how to proceed.
I know how to update every document individually. My list size is of the order of 25000. I do not want to make 25000 calls individually. Number of documents in my collection are much more. I am using python2, pymongo = 3.2.1.
Iterate through the id list using a for loop and send the bulk updates in batches of 500:
bulk = db.testdata.initialize_unordered_bulk_op()
counter = 0
for id in ids:
# process in bulk
bulk.find({ '_id': id }).update({ '$set': { 'isBad': 'N' } })
counter += 1
if (counter % 500 == 0):
bulk.execute()
bulk = db.testdata.initialize_ordered_bulk_op()
if (counter % 500 != 0):
bulk.execute()
Because write commands can accept no more than 1000 operations (from the docs), you will have to split bulk operations into multiple batches, in this case you can choose an arbitrary batch size of up to 1000.
The reason for choosing 500 is to ensure that the sum of the associated document from the Bulk.find() and the update document is less than or equal to the maximum BSON document size even though there is no there is no guarantee using the default 1000 operations requests will fit under the 16MB BSON limit. The Bulk() operations in the mongo shell and comparable methods in the drivers do not have this limit.
bulk = db.testdata.initialize_unordered_bulk_op()
for id in ids:
bulk.find( { '_id': id}).update({ '$set': { "isBad" : "N" }})
bulk.execute()

Python vs perl sort performance

Solution
This solved all issues with my Perl code (plus extra implementation code.... :-) ) In conlusion both Perl and Python are equally awesome.
use WWW::Curl::Easy;
Thanks to ALL who responded, very much appreciated.
Edit
It appears that the Perl code I am using is spending the majority of its time performing the http get, for example:
my $start_time = gettimeofday;
$request = HTTP::Request->new('GET', 'http://localhost:8080/data.json');
$response = $ua->request($request);
$page = $response->content;
my $end_time = gettimeofday;
print "Time taken #{[ $end_time - $start_time ]} seconds.\n";
The result is:
Time taken 74.2324419021606 seconds.
My python code in comparison:
start = time.time()
r = requests.get('http://localhost:8080/data.json', timeout=120, stream=False)
maxsize = 100000000
content = ''
for chunk in r.iter_content(2048):
content += chunk
if len(content) > maxsize:
r.close()
raise ValueError('Response too large')
end = time.time()
timetaken = end-start
print timetaken
The result is:
20.3471381664
In both cases the sort times are sub second. So first of all I apologise for the misleading question, and it is another lesson for me to never ever make assumptions.... :-)
I'm not sure what is the best thing to do with this question now. Perhaps someone can propose a better way of performing the request in perl?
End of edit
This is just a quick question regarding sort performance differences in Perl vs Python. This is not a question about which language is better/faster etc, for the record, I first wrote this in perl, noticed the time the sort was taking, and then tried to write the same thing in python to see how fast it would be. I simply want to know, how can I make the perl code perform as fast as the python code?
Lets say we have the following json:
["3434343424335": {
"key1": 2322,
"key2": 88232,
"key3": 83844,
"key4": 444454,
"key5": 34343543,
"key6": 2323232
},
"78237236343434": {
"key1": 23676722,
"key2": 856568232,
"key3": 838723244,
"key4": 4434544454,
"key5": 3432323543,
"key6": 2323232
}
]
Lets say we have a list of around 30k-40k records which we want to sort by one of the sub keys. We then want to build a new array of records ordered by the sub key.
Perl - Takes around 27 seconds
my #list;
$decoded = decode_json($page);
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,"key1"=>$decoded->{$id}{key1}...etc));
}
Python - Takes around 6 seconds
list = []
data = json.loads(content)
data2 = sorted(data, key = lambda x: data[x]['key5'], reverse=True)
for key in data2:
tmp= {'id':key,'key1':data[key]['key1'],etc.....}
list.append(tmp)
For the perl code, I have tried using the following tweaks:
use sort '_quicksort'; # use a quicksort algorithm
use sort '_mergesort'; # use a mergesort algorithm
Your benchmark is flawed, you're benchmarking multiple variables, not one. It is not just sorting data, but it is also doing JSON decoding, and creating strings, and appending to an array. You can't know how much time is spent sorting and how much is spent doing everything else.
The matter is made worse in that there are several different JSON implementations in Perl each with their own different performance characteristics. Change the underlying JSON library and the benchmark will change again.
If you want to benchmark sort, you'll have to change your benchmark code to eliminate the cost of loading your test data from the benchmark, JSON or not.
Perl and Python have their own internal benchmarking libraries that can benchmark individual functions, but their instrumentation can make them perform far less well than they would in the real world. The performance drag from each benchmarking implementation will be different and might introduce a false bias. These benchmarking libraries are more useful for comparing two functions in the same program. For comparing between languages, keep it simple.
Simplest thing to do to get an accurate benchmark is to time them within the program using the wall clock.
# The current time to the microsecond.
use Time::HiRes qw(gettimeofday);
my #list;
my $decoded = decode_json($page);
my $start_time = gettimeofday;
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,"key1"=>$decoded->{$id}{key1}...etc));
}
my $end_time = gettimeofday;
print "sort and append took #{[ $end_time - $start_time ]} seconds\n";
(I leave the Python version as an exercise)
From here you can improve your technique. You can use CPU seconds instead of wall clock. The array append and cost of creating the string are still involved in the benchmark, they can be eliminated so you're just benchmarking sort. And so on.
Additionally, you can use a profiler to find out where your programs are spending their time. These have the same raw performance caveats as benchmarking libraries, the results are only useful to find out what percentage of its time a program is using where, but it will prove useful to quickly see if your benchmark has unexpected drag.
The important thing is to benchmark what you think you're benchmarking.
Something else is at play here; I can run your sort in half a second. Improving that is not going to depend on sorting algorithm so much as reducing the amount of code run per comparison; a Schwartzian Transform gets it to a third of a second, a Guttman-Rosler Transform gets it down to a quarter of a second:
#!/usr/bin/perl
use 5.014;
use warnings;
my $decoded = { map( (int rand 1e9, { map( ("key$_", int rand 1e9), 1..6 ) } ), 1..40000 ) };
use Benchmark 'timethese';
timethese( -5, {
'original' => sub {
my #list;
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
'st' => sub {
my #list;
foreach my $id (
map $_->[1],
sort { $b->[0] <=> $a->[0] }
map [ $decoded->{$_}{key5}, $_ ],
keys %{$decoded}
) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
'grt' => sub {
my $maxkeylen=15;
my #list;
foreach my $id (
map substr($_,$maxkeylen),
sort { $b cmp $a }
map sprintf('%0*s', $maxkeylen, $decoded->{$_}{key5}) . $_,
keys %{$decoded}
) {
push(#list,{"key"=>$id,%{$decoded->{$id}}});
}
},
});
Don't create a new hash for each record. Just add the key to the existing one.
$decoded->{$_}{key} = $_
for keys(%$decoded);
my #list = sort { $b->{key5} <=> $a->{key5} } values(%$decoded);
Using Sort::Key will make it even faster.
use Sort::Key qw( rukeysort );
$decoded->{$_}{key} = $_
for keys(%$decoded);
my #list = rukeysort { $_->{key5} } values(%$decoded);

Raster: How to get elevation at lat/long using python?

I also posted this question in the GIS section of SO. As I'm not sure if this rather a 'pure' python question I also ask it here again.
I was wondering if anyone has some experience in getting elevation data from a raster without using ArcGIS, but rather get the information as a python list or dict?
I get my XY data as a list of tuples.
I'd like to loop through the list or pass it to a function or class-method to get the corresponding elevation for the xy-pairs.
I did some research on the topic and the gdal API sounds promising. Can anyone advice me how to go about things, pitfalls, sample code? Other options?
Thanks for your efforts, LarsVegas
I recommend checking out the Google Elevation API
It's very straightforward to use:
http://maps.googleapis.com/maps/api/elevation/json?locations=39.7391536,-104.9847034&sensor=true_or_false
{
"results" : [
{
"elevation" : 1608.637939453125,
"location" : {
"lat" : 39.73915360,
"lng" : -104.98470340
},
"resolution" : 4.771975994110107
}
],
"status" : "OK"
}
note that the free version is limited to 2500 requests per day.
We used this code to get elevation for a given latitude/longitude (NOTE: we only asked to print the elevation, and the rounded lat and long values).
import urllib.request
import json
lati = input("Enter the latitude:")
lngi = input("Enter the longitude:")
# url_params completes the base url with the given latitude and longitude values
ELEVATION_BASE_URL = 'http://maps.googleapis.com/maps/api/elevation/json?'
URL_PARAMS = "locations=%s,%s&sensor=%s" % (lati, lngi, "false")
url=ELEVATION_BASE_URL + URL_PARAMS
with urllib.request.urlopen(url) as f:
response = json.loads(f.read().decode())
status = response["status"]
result = response["results"][0]
print(float(result["elevation"]))
print(float(result["location"]["lat"]))
print(float(result["location"]["lng"]))
Have a look at altimeter a wrapper for the Google Elevation API
Here is the another one nice API that I`v built: https://algorithmia.com/algorithms/Gaploid/Elevation
import Algorithmia
input = {
"lat": "50.2111",
"lon": "18.1233"
}
client = Algorithmia.client('YOUR_API_KEY')
algo = client.algo('Gaploid/Elevation/0.3.0')
print algo.pipe(input)

Categories