Pymongo query taking too long - python

right now I have a problem on how to make a find query return the results faster and I don't know exactly how I can do it. I'm fairly new to both MongoDB and python so please bear with me. I have a collection on MongoDB with 250000 objects. Each one is fairly nested. They are something like this:
_id: Objectid
shopper: Shopperid
data:{
report:{
items:{
accounts:{
account_1:{
mask:
type:
subtype:
inst_id:
historical_balances:{
object_1
object_2
...}
}
account_2:{
mask:
type:
subtype:
inst_id:
historical_balances:{
object_1
object_2
...}
...
}
}
}
}
And I need to get those data and the sum of historical balances for each account on each object. Right now it is taking forever and I don't know exactly what to do. I am trying to download locally the data from MongoDB but every time my internet has a connection problem all is lost and it's taking more than 10 hours to get to half of it. I tried using list(find(query)) so I could deal with it later but I don't have enough ram. Right now what I am doing is:
for k in cursor:
for j in range(len(k['data']['report']['items'][0]['accounts'])):
e.append([k['_id'], k['shopper'],k['data']['report']['items'][0]['institution_id'],
k['data']['report']['items'][0]['institution_name'],
k['data']['report']['items'][0]['accounts'][j]['mask'],
k['data']['report']['items'][0]['accounts'][j]['type'],
k['data']['report']['items'][0]['accounts'][j]['subtype'], len(k['data']['report']['items'][0]['accounts'][j]['transactions'])])
In summary: Each object has a data field that has a report field, that has a items field that has an account field, each account field has multiple accounts objects that have a mask, type, subtype and historical balances. I need all this data and a sum of historical balances for each account.
Right now I am using the code above to get data, put into a list and then turn it into a Panda Dataframe so I can save as a csv file which is what I need. I know it isn't the prettiest code but it was the first idea I came up with. Any ideas on how I can improve this performance that is being really slow for my needs?

Related

Extract only the data from a Django Query Set

I am working to learn Django, and have built a test database to work with. I have a table that provides basic vendor invoice information, so, and I want to simply present a user with the total value of invoices that have been loaded to into the database. I found that the following queryset does sum the column as I'd hoped:
total_outstanding: object = Invoice.objects.aggregate(Sum('invoice_amount'))
but the result is presented on the page in the following unhelpful way:
Total $ Outstanding: {'invoice_amount__sum': Decimal('1965')}
The 1965 is the correct total for the invoices that I populated the database with, so the queryset is pulling what I want it to, but I just want to present that portion of the result to the user, without the other stuff.
Someone else asked a similar question (basically the same) here: how-to-extract-data-from-django-queryset, but the answer makes no sense to me, it is just:
k = k[0] = {'name': 'John'}
Queryset is list .
Can anyone help me with a plain-English explanation of how I can extract just the numerical result of that query for presentation to a user?
What you here get is a dictionary that maps the name of the aggregate to the corresponding value. You can use subscripting to obtain the corresponding value:
object = Invoice.objects.aggregate(
Sum('in,voice_amount')
)['invoice_amount__sum']

DynamoDB Querying in Python (Count with GroupBy)

This may be trivial, but I loaded a local DynamoDB instance with 30GB worth of Twitter data that I aggregated.
The primary key is id (tweet_id from the Tweet JSON), and I also store the date/text/username/geocode.
I basically am interested in mentions of two topics (let's say "Bees" and "Booze"). I want to get a count of each of those by state by day.
So by the end, I should know for each state, how many times each was mentioned on a given day. And I guess it'd be nice to export that as a CSV or something for later analysis.
Some issues I had with doing this...
First, the geocode info is a tuple of [latitude, longitude] so for each entry, I need to map that to a state. That I can do.
Second, is the most efficient way to do this to go through each entry and manually check if it contains a mention of either keyword and then have a dictionary for each that maps the date/location/count?
EDIT:
Since it took me 20 hours to load all the data into my table, I don't want to delete and re-create it. Perhaps I should create a global secondary index (?) and use that to search other fields in a query? That way I don't have to scan everything. Is that the right track?
EDIT 2:
Well, since the table is on my computer locally I should be OK with just using expensive operations like a Scan right?
So if I did something like this:
query = table.scan(
FilterExpression=Attr('text').contains("Booze"),
ProjectionExpression='id, text, date, geo',
Limit=100)
And did one scan for each keyword, then I would be able to go through the resulting filtered list and get a count of mentions of each topic for each state on a given day, right?
EDIT3:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100)
//do something with this set
while 'LastEvaluatedKey' in response:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100,
ExclusiveStartKey=response['LastEvaluatedKey']
)
//do something with each batch of 100 entries
So something like that, for both keywords. That way I'll be able to go through the resulting filtered set and do what I want (in this case, figure out the location and day and create a final dataset with that info). Right?
EDIT 4
If I add:
ProjectionExpression='date, location, user, text'
into the scan request, I get an error saying "botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the Scan operation: Invalid ProjectionExpression: Attribute name is a reserved keyword; reserved keyword: location". How do I fix that?
NVM I got it. Answer is to look into ExpressionAttributeNames (see: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ExpressionPlaceholders.html)
Yes, scanning the table for "Booze" and counting the items in the result should give you the total count. Please note that you need to do recursive scan until LastEvaluatedKey is null.
Refer exclusive start key as well.
Scan
EDIT:-
Yes, the code looks good. One thing to note, the result set wouldn't always contain 100 items. Please refer the LIMIT definition below (not same as SQL database).
Limit — (Integer) The maximum number of items to evaluate (not
necessarily the number of matching items). If DynamoDB processes the
number of items up to the limit while processing the results, it stops
the operation and returns the matching values up to that point, and a
key in LastEvaluatedKey to apply in a subsequent operation, so that
you can pick up where you left off. Also, if the processed data set
size exceeds 1 MB before DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a key
in LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information, see Query and Scan in the Amazon
DynamoDB Developer Guide.

Elasticsearch backfill two fields into one new field after calculations

Question. I have been tasked with researching how to backfill data in Elasticsearch. So far coming up a bit empty. The basic gist is:
Notes: All documents are stored under daily indexes, with ~200k documents per day.
I need to be able to reindex about 60 days worth of data.
I need to take two fields for each doc payload.time_sec and payload.time_nanosec, take there values and do some math on them (time_sec * 10**9 + time_nanosec) and then return this as a single field into the reindexed document
I am looking at the Python API documentation with bulk helpers:
http://elasticsearch-py.readthedocs.io/en/master/helpers.html
But I am wondering if this is even possible.
My thoughts were to use:
Bulk helpers to pull a scroll ID (bulk _update?), iterate over each doc id, pull that data in from the two fields for each dock, do the math, and finish the update request with the new field data.
Anyone done this? Maybe something with a groovy script?
Thanks!
Bulk helpers to pull a scroll ID (bulk _update?), iterate over each doc id, pull that data in from the two fields for each dock, do the math, and finish the update request with the new field data.
Basically, yes:
use /_search?scroll to fetch the docs
perform your operation
send /_bulk update requests
Other options are:
use the /_reindex APIProbably not so good if you don't want to create a new index
use the /_update_by_query API
Both support scripting which, if I understood it correctly, wold be the perfect choice because your update does not depend on external factors so this could as well be done directly within the server.
Here is where I am at (roughly):
Ive been working with a Python and the bulk helpers and so far am around here:
doc = helpers.scan(es, query={
"query": {
"match_all": {}
},
"size":1000
},index=INDEX, scroll='5m', raise_on_error=False)
for x in doc:
x['_index'] = NEW_INDEX
try:
time_sec = x['_source']['payload']['time_sec']
time_nanosec=x['_source']['payload']['time_nanosec']
duration = (time_sec * 10**9) + time_nanosec
except KeyError: pass
count = count + 1
x['_source']['payload']['duration'] = duration
new_index_data.append(x)
helpers.bulk(es,new_index_data)
From here I am just using the bulk python helper to insert into a new index. However I will experiment changing and testing with bulk update to an existing index.
This look like a right approach?

Which database to store very large nested Python dicts?

My script produces data in the following format:
dictionary = {
(.. 42 values: None, 1 or 2 ..): {
0: 0.4356, # ints as keys, floats as values
1: 0.2355,
2: 0.4352,
...
6: 0.6794
},
...
}
where:
(.. 42 values: None, 1 or 2 ..) is a game state
inner dict stores calculated values of actions which are possible in that state
The problem is that the state space is very big (millions of states), so the whole data stucture cannot be stored in memory. That's why I'm looking for a database engine which would fit my needs and I could use with Python. I need to get the list of actions and their values in the given state (previously mentioned tuple of 42 values) and to modify value of given action in given state.
Check out ZODB: http://www.zodb.org/en/latest/
It's natve object DB for Python that supports transactions, caching, pluggable layers, pack operations (for keeping history) and BLOBs.
You can use a key-value cache solution. A good one is Redis. It`s very fast and simple, written on the C and more over than just a key value cache. Integration with python just several lines of code. The redis is also can be scaled very easy for the really big data. I worked in the game industry and understand what I am talking about.
Also, as already mentioned here, you can use more complex solution, not a cache, the database PostgresSQL. Now it supports a JSON binary format field - JSONB. I think the best python database ORM is the SQLAlchemy. It supports PostgresSQL out of the box. I will use this one in my code block. For example, you have a table
class MobTable(db.Model):
tablename = 'mobs'
id = db.Column(db.Integer, primary_key=True)
stats = db.Column(JSONB, index=True, default={})
If your have a mob with such json stats
{
id: 1,
title: 'UglyOrk',
resists: {cold: 13}
}
You can search all mobs with the not null cold resists
expr = MobTable.stats[("resists", "cold")]
q = (session.query(MobTable.id, expr.label("cold_protected"))
.filter(expr != None)
.all())
I recommend you use HD5f. It's a data base format that works perfectly with Python (it is specifically developed for Python) and stores the data in binary format. This reduces the size of the data to be stored a great extent! More importantly it gives you the ability of random access which I believe serves for your purposes. Also, if you do not use any compression method you will retrieve the data with the highest possible speed.
You can also store it as JSONB in PostgreSQL DB.
For connecting with PostgreSQL you can use psycopg2, which is compliant with Python Database API Specification v2.0.

How come a document in MongoDB sometimes gets inserted , but most often don't?

con = pymongo.Connection('localhost',27017)
db = con.MouseDB
post = { ...some stuff }
datasets = db.datasets
datasets.insert(post)
So far, there are only 3 records, and it's supposed to have about 100...
Did you check to make sure that you don't have collisions on your primary keys. Take a look at the console and it should tell you why a document is not inserting. I'd also recommend trying to insert one at a time from the command line tool to get better information as to why it may not be inserting correctly. If you want to update, make sure to use save instead of insert (which performs an upsert)

Categories