How to get the row count of a table instantly in DynamoDB? - python

I'm using boto.dynamodb2, and it seems I can use Table.query_count(). However it had raised an exception when no query filter is applied.
What can I do to fix this?
BTW, where is the document of filters that boto.dynamodb2.table.Table.Query can use? I tried searching for it but found nothing.

There are two ways you can get a row count in DynamoDB.
The first is performing a full table scan and counting the rows as you go. For a table of any reasonable size this is generally a horrible idea as it will consume all of your provisioned read throughput.
The other way is to use the Describe Table request to get an estimate of the number of rows in the table. This will return instantly, but will only be updated periodically per the AWS documentation.
The number of items in the specified index. DynamoDB updates this
value approximately every six hours. Recent changes might not be
reflected in this value.

As per documentation boto3
"The number of items in the specified table. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value."
import boto3
dynamoDBResource = boto3.resource('dynamodb')
table = dynamoDBResource.Table('tableName')
print(table.item_count)
or you can use DescribeTable:
import boto3
dynamoDBClient = boto3.client('dynamodb')
table = dynamoDBClient.describe_table(
TableName='tableName'
)
print(table)

If you want to count the number of items:
import boto3
client = boto3.client('dynamodb','us-east-1')
response = client.describe_table(TableName='test')
print(response['Table']['ItemCount'])
#ItemCount (integer) --The number of items in the specified table.
# DynamoDB updates this value approximately every six hours.
# Recent changes might not be reflected in this value.
Ref: Boto3 Documentation (under ItemCount in describe_table())

You can use this, to get count of entire table items
from boto.dynamodb2.table import Table
dynamodb_table = Table('Users')
dynamodb_table.count() # updated roughly 6 hours
Refer here: http://boto.cloudhackers.com/en/latest/ref/dynamodb2.html#module-boto.dynamodb2.table
query_count method will return the item count based on the indexes you provide.
For example,
from boto.dynamodb2.table import Table
dynamodb_table = Table('Users')
print dynamodb_table.query_count(
index='first_name-last_name-index', # Get indexes from indexes tab in dynamodb console
first_name__eq='John', # add __eq to your index name for specific search
last_name__eq='Smith' # This is your range key
)
You can add the primary index or global secondary indexes along with range keys.
possible comparison operators
__eq for equal
__lt for less than
__gt for greater than
__gte for greater than or equal
__lte for less than or equal
__between for between
__beginswith for begins with
Example for between
print dynamodb_table.query_count(
index='first_name-last_name-index', # Get indexes from indexes tab in dynamodb console
first_name__eq='John', # add __eq to your index name for specific search
age__between=[30, 50] # This is your range key
)

Related

Best way to speed up PyMongo loop

I'm currently using a MongoDB database where I'm storing product data. I'm currently using a for loop of around ~50 IDs, and with each iteration, I'm searching for the ID and if the ID doesn't exist, I'm adding it, and if it exists and another column is a specific value, I'll run a function.
for id in ids:
value = db.find_one({"value": id})
if value:
# It checks for some other columns here using both the ID and the return value
else:
# It adds the ID and some other information to the database
The problem here is that this is incredibly inefficient. When searching around for other ways to do this, all results show how to get a list of the results, but I'm not sure how this would be implemented in my scenario since I'm running functions and checks with each result and ID.
Thank you!
You can improve by doing only one find request.
And in a second time, add all the documents in DB. Maybe with an insert_many ?
value = db.find({"value": {"$in": ids}})
for value in values:
# It checks for some other columns here using both the ID and the return
ids.remove(value.id)
# Do all your inserts
# with a loop
for id in ids:
df.insert(...)
# or with insert_many
db.insert_many(...)

How to use start_at() & end_at() in Firebase query with Python?

My Firebase realtime database schema:
Let's suppose above Firebase database schema.
I want to get data with order_by_key() which after first 5 and before first 10 not more. Range should be 5-10. Like in the image.
My key is always starting with -.
I'm trying this but failed. It returns 0. How can I do this?
snapshot = ref.child('tracks').order_by_key().start_at('-\5').end_at(u'-\10').get()
Firebase queries are based on cursor/anchor values, and not on offsets. This means that the start_at and end_at calls expect values of the thing you order on, so in your keys they expect the keys of those notes.
To get the slice you indicate you'll need:
ref.child('tracks').order_by_key().start_at('-MQJ7P').end_at(u'-MQJ8O').get()
If you don't know either of those values, you can't specify them and can only start from the first item or end on the last item.
The only exception is that you can specify a limit_to_first instead of end_at to get a number of items at the start of the slice:
ref.child('tracks').order_by_key().start_at('-MQJ7P').limit_to_first(5).get()
Alternatively if you know only the key of the last item, you can get the five items before that with:
ref.child('tracks').order_by_key().end_at('-MQJ8O').limit_to_last(5).get()
But you'll need to know at least one of the keys, typically because you've shown it as the last item on the previous page/first item on the next page.

API Get method to get all tweets with hashtag count greater than within MongoDB in JSON format

I have a MongoDB database that contains a number of tweets. I want to be able to get all the tweets in JSON list through my API that contain a number of hashtags greather than that specified by the user in the url (eg http://localhost:5000/tweets?morethan=5, which is 5 in this case) .
The hashtags are contained inside the entities column in the database, along with other columns such as user_mentions, urls, symbols and media. Here is the code I've written so far but doesnt return anything.
#!flask/bin/python
app = Flask(__name__)
#app.route('/tweets', methods=['GET'])
def get_tweets():
# Connect to database and pull back collections
db = client['mongo']
collection = db['collection']
parameter = request.args.get('morethan')
if parameter:
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) # create the namespace#
cursor = collection.find({key_im_looking_for: {"$exists": True}})
EDIT: IT WORKS!
The code in question is this line
cursor = collection.find({"entities": {"hashtags": parameter}})
This answer explains why it is impossible to directly perform what you ask.
mongodb query: $size with $gt returns always 0
That answer also describes potential (but poor) ideas to get around it.
The best suggestion is to modify all your documents and put a "num_hashtags" key in somewhere, index that, and query against it.
Using The Twitter JSON API you could update all your documents and put a the num_hashtags key in the entities document.
Alternatively, you could solve your immediate problem by doing a very slow full table scan across all documents for every query checking if the hashtag number which is one greater than your parameter exists by abusing MongoDB Dot Notation.
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) #create the namespace#
# py2.7 => key_im_looking_for = "entities.hashtags.%s" %(gt_parameter)
# in this example it would be "entities.hashtags.6"
cursor = collection.find({key_im_looking_for: {"$exists": True}})
The best answer (and the key reason to use a NoSQL database in the first place) is that you should modify your data to suit your retrieval. If possible, you should perform an inplace update adding the num_hashtags key.

Checking millions of MySQL rows in Python

I have a Python program that downloads a text file of over 100 million unique values and follows the following logic:
If the value already exists in the table, update the entry's last_seen date (SELECT id WHERE <col> = <value>;)
If the value does not exist in the table, insert the value into the table
I queue up entries that need to be added and then insert them in a bulk statement after a few hundred have been gathered.
Currently, the program takes over 24 hours to run. I've created an index on the column that stores the values.
I'm currently using the MySQLdb.
It seems that checking for value existence is taking the lion's share of the runtime. What avenues can I pursue to make this faster?
Thank you.
You could try loading the values into a set, so you can do the lookups without fetching from the database every time. Assuming that the table is not being updated by anyone else, and that you have sufficient memory.
# Let's assume you have a function runquery, that executes the
# provided statement and returns a collection of values as strings.
existing_values = set(runquery('SELECT DISTINCT value FROM table'))
with open('big_file.txt') as f:
inserts = []
updates = []
for line in f:
value = line.strip()
if value in existing_values:
updates.append(value)
else:
existing_values.add(value)
inserts.append(value)
if len(inserts) > THRESHOLD or len(updates) > THRESHOLD:
# Do bulk updates and clear inserts and updates

DynamoDB Querying in Python (Count with GroupBy)

This may be trivial, but I loaded a local DynamoDB instance with 30GB worth of Twitter data that I aggregated.
The primary key is id (tweet_id from the Tweet JSON), and I also store the date/text/username/geocode.
I basically am interested in mentions of two topics (let's say "Bees" and "Booze"). I want to get a count of each of those by state by day.
So by the end, I should know for each state, how many times each was mentioned on a given day. And I guess it'd be nice to export that as a CSV or something for later analysis.
Some issues I had with doing this...
First, the geocode info is a tuple of [latitude, longitude] so for each entry, I need to map that to a state. That I can do.
Second, is the most efficient way to do this to go through each entry and manually check if it contains a mention of either keyword and then have a dictionary for each that maps the date/location/count?
EDIT:
Since it took me 20 hours to load all the data into my table, I don't want to delete and re-create it. Perhaps I should create a global secondary index (?) and use that to search other fields in a query? That way I don't have to scan everything. Is that the right track?
EDIT 2:
Well, since the table is on my computer locally I should be OK with just using expensive operations like a Scan right?
So if I did something like this:
query = table.scan(
FilterExpression=Attr('text').contains("Booze"),
ProjectionExpression='id, text, date, geo',
Limit=100)
And did one scan for each keyword, then I would be able to go through the resulting filtered list and get a count of mentions of each topic for each state on a given day, right?
EDIT3:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100)
//do something with this set
while 'LastEvaluatedKey' in response:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100,
ExclusiveStartKey=response['LastEvaluatedKey']
)
//do something with each batch of 100 entries
So something like that, for both keywords. That way I'll be able to go through the resulting filtered set and do what I want (in this case, figure out the location and day and create a final dataset with that info). Right?
EDIT 4
If I add:
ProjectionExpression='date, location, user, text'
into the scan request, I get an error saying "botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the Scan operation: Invalid ProjectionExpression: Attribute name is a reserved keyword; reserved keyword: location". How do I fix that?
NVM I got it. Answer is to look into ExpressionAttributeNames (see: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ExpressionPlaceholders.html)
Yes, scanning the table for "Booze" and counting the items in the result should give you the total count. Please note that you need to do recursive scan until LastEvaluatedKey is null.
Refer exclusive start key as well.
Scan
EDIT:-
Yes, the code looks good. One thing to note, the result set wouldn't always contain 100 items. Please refer the LIMIT definition below (not same as SQL database).
Limit — (Integer) The maximum number of items to evaluate (not
necessarily the number of matching items). If DynamoDB processes the
number of items up to the limit while processing the results, it stops
the operation and returns the matching values up to that point, and a
key in LastEvaluatedKey to apply in a subsequent operation, so that
you can pick up where you left off. Also, if the processed data set
size exceeds 1 MB before DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a key
in LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information, see Query and Scan in the Amazon
DynamoDB Developer Guide.

Categories