DynamoDB Querying in Python (Count with GroupBy) - python

This may be trivial, but I loaded a local DynamoDB instance with 30GB worth of Twitter data that I aggregated.
The primary key is id (tweet_id from the Tweet JSON), and I also store the date/text/username/geocode.
I basically am interested in mentions of two topics (let's say "Bees" and "Booze"). I want to get a count of each of those by state by day.
So by the end, I should know for each state, how many times each was mentioned on a given day. And I guess it'd be nice to export that as a CSV or something for later analysis.
Some issues I had with doing this...
First, the geocode info is a tuple of [latitude, longitude] so for each entry, I need to map that to a state. That I can do.
Second, is the most efficient way to do this to go through each entry and manually check if it contains a mention of either keyword and then have a dictionary for each that maps the date/location/count?
EDIT:
Since it took me 20 hours to load all the data into my table, I don't want to delete and re-create it. Perhaps I should create a global secondary index (?) and use that to search other fields in a query? That way I don't have to scan everything. Is that the right track?
EDIT 2:
Well, since the table is on my computer locally I should be OK with just using expensive operations like a Scan right?
So if I did something like this:
query = table.scan(
FilterExpression=Attr('text').contains("Booze"),
ProjectionExpression='id, text, date, geo',
Limit=100)
And did one scan for each keyword, then I would be able to go through the resulting filtered list and get a count of mentions of each topic for each state on a given day, right?
EDIT3:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100)
//do something with this set
while 'LastEvaluatedKey' in response:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100,
ExclusiveStartKey=response['LastEvaluatedKey']
)
//do something with each batch of 100 entries
So something like that, for both keywords. That way I'll be able to go through the resulting filtered set and do what I want (in this case, figure out the location and day and create a final dataset with that info). Right?
EDIT 4
If I add:
ProjectionExpression='date, location, user, text'
into the scan request, I get an error saying "botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the Scan operation: Invalid ProjectionExpression: Attribute name is a reserved keyword; reserved keyword: location". How do I fix that?
NVM I got it. Answer is to look into ExpressionAttributeNames (see: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ExpressionPlaceholders.html)

Yes, scanning the table for "Booze" and counting the items in the result should give you the total count. Please note that you need to do recursive scan until LastEvaluatedKey is null.
Refer exclusive start key as well.
Scan
EDIT:-
Yes, the code looks good. One thing to note, the result set wouldn't always contain 100 items. Please refer the LIMIT definition below (not same as SQL database).
Limit — (Integer) The maximum number of items to evaluate (not
necessarily the number of matching items). If DynamoDB processes the
number of items up to the limit while processing the results, it stops
the operation and returns the matching values up to that point, and a
key in LastEvaluatedKey to apply in a subsequent operation, so that
you can pick up where you left off. Also, if the processed data set
size exceeds 1 MB before DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a key
in LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information, see Query and Scan in the Amazon
DynamoDB Developer Guide.

Related

How to use start_at() & end_at() in Firebase query with Python?

My Firebase realtime database schema:
Let's suppose above Firebase database schema.
I want to get data with order_by_key() which after first 5 and before first 10 not more. Range should be 5-10. Like in the image.
My key is always starting with -.
I'm trying this but failed. It returns 0. How can I do this?
snapshot = ref.child('tracks').order_by_key().start_at('-\5').end_at(u'-\10').get()
Firebase queries are based on cursor/anchor values, and not on offsets. This means that the start_at and end_at calls expect values of the thing you order on, so in your keys they expect the keys of those notes.
To get the slice you indicate you'll need:
ref.child('tracks').order_by_key().start_at('-MQJ7P').end_at(u'-MQJ8O').get()
If you don't know either of those values, you can't specify them and can only start from the first item or end on the last item.
The only exception is that you can specify a limit_to_first instead of end_at to get a number of items at the start of the slice:
ref.child('tracks').order_by_key().start_at('-MQJ7P').limit_to_first(5).get()
Alternatively if you know only the key of the last item, you can get the five items before that with:
ref.child('tracks').order_by_key().end_at('-MQJ8O').limit_to_last(5).get()
But you'll need to know at least one of the keys, typically because you've shown it as the last item on the previous page/first item on the next page.

How to know number of position document in MongoDB

What is the best way to look for a document’s position in a collection?
Using the following code. But its works so bad with my big collection of documents.
get_top_func(user_score):
return len(db.collection.find({'score': {'$gt': user_score}})) + 1
Considering your query and the fact that you mention that the query is slow, I am guessing that you don't have an index on the score field. Creating and index should make this query faster:
db.collection.createIndex( { score: -1 } )
After that, you should be able to run the query you have with better performance.
You can have a field like myDocId as the key and value to be like a variable that functions as a counter.
So while inserting each document in a collection, you'll be having its number also stored with the document data as well.
Each document in a collection has an identifier key that is stuffed my mongoDB itself which is _id but that wont tell you exactly which number of document is it(like nth document) as its made up using 4 parameters
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
Also refer https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/
and also MongoDB Query, sort then take nth document for group
So what you can do is using aggregate you can use your filters and then project as needed to get the nth element using the $arrayElemAt.
https://docs.mongodb.com/manual/reference/operator/aggregation/arrayElemAt/

Firebase is only responding with a single document when I query it even though multiple meet the query criteria

I'm trying to write a cloud function that returns users near a specific location. get_nearby() returns a list of tuples containing upper and lower bounds for a geohash query, and then this loop should query firebase for users within those geohashes.
user_ref = db.collection(u'users')
db_response = []
for query_range in get_nearby(lat, long, radius):
query = user_ref.where(u'geohash', u'>=', query_range[0]).where(u'geohash', u'<=', query_range[1]).get()
for el in query:
db_response.append(el.to_dict())
For some reason when I run this code, it returns only one document from my database, even though there are three other documents with the same geohash as that one. I know the documents are there, and they do get returned when I request the entire collection. What am I missing here?
edit:
The database currently has 4 records in it, 3 of which should be returned in this query:
{
{name: "Trevor", geohash: "dnvtz"}, #this is the one that gets returned
{name: "Test", geohash: "dnvtz"},
{name: "Test", geohash: "dnvtz"}
}
query_range is a tuple with two values. A lower and upper bound geohash. In this case, it's ("dnvt0", "dnvtz").
I decided to clear all documents from my database and then generate a new set of sample data to work with (everything there was only for testing anyway, nothing important). After pushing the new data to Firestore, everything is working. My only assumption is that even though the strings matched up, I'd used the wrong encoding on some of them.

Correct the sequence generator start number?

My PosgressSql database is allocating ids that already exist. From what I read can be a problem with sequence generator.
Seems that I get sequence corruption often, with the sequence starting number, being before, the last id in the database.
I know I can change the number in pgadmin, but how can I auto-correct this behavior in production.
I'm using python/django, it is possible to catch the error somehow, and reset the sequence ?
For sequences it goes something like
select setval('foo_id_seq',max(id),true) from foo;
for apropriate values 'foo_id_seq' of foo and id;

NDB query by keys value

New to using Python NDB.
I have something like:
class User(ndb.Model):
seen_list = nbd.KeyProperty(kind=Survey, repeated=True)
class Survey(ndb.Model):
same = ndb.StringProperty(required=True)
I want to be able to query for users that have not seen certain surveys.
What I am doing now is:
users = User.query(seen_list != 'survey name').fetch()
This does not work. What would be the proper way to do this? Should I first query the Survey list to get the key of the survey with a certain name? Is the != part correct?
I could not find any examples similar to this.
Thanks.
unfortunately, if your survey is a repeated property, it won't work that way. When you query a repeated property the datastore tries EVERY entry in your list, and if one works, it'll return the item. So when you say "!= survey name 1", if you have at least ONE entry in your list that isn't "survey name 1", it'll come back as positive, even if another result IS "survey name 1".
it's uninstinctive if you come from an SQL background I know.... the only way to go around that is to go programatically and evaluate the ones your query returns. It comes from the fact that, for repeated values, Big Table "flatten" your results, which means it creates one entry for EVERY value in your repeated attribute. so as it scans, it eventually finds one "correct" line with your info, grabs the object key from there, and returns the object.

Categories